Quite a good question, I assume you know the size of the cluster going in, then you can essentially try to partition the data in some multiples of that & use rangepartitioner to partition the data roughly equally. Dynamic partitions are created based on number of blocks on filesystem & hence the task overhead of scheduling so many tasks mostly kills the performance.
import org.apache.spark.RangePartitioner; var file=sc.textFile("<my local path>") var partitionedFile=file.map(x=>(x,1)) var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile)) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Sat, Aug 16, 2014 at 7:04 AM, abhiguruvayya <sharath.abhis...@gmail.com> wrote: > My use case as mentioned below. > > 1. Read input data from local file system using sparkContext.textFile(input > path). > 2. partition the input data(80 million records) into partitions using > RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer > function. Without using coalesce() or repartition() on the input data spark > executes really slow and fails with out of memory exception. > > The issue i am facing here is in deciding the number of partitions to be > applied on the input data. *The input data size varies every time and hard > coding a particular value is not an option. And spark performs really well > only when certain optimum partition is applied on the input data for which > i > have to perform lots of iteration(trial and error). Which is not an option > in a production environment.* > > My question: Is there a thumb rule to decide the number of partitions > required depending on the input data size and cluster resources > available(executors,cores, etc...)? If yes please point me in that > direction. Any help is much appreciated. > > I am using spark 1.0 on yarn. > > Thanks, > AG > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Question-regarding-spark-data-partition-and-coalesce-Need-info-on-my-use-case-tp12214.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >