Hi Naveen, So by default when we call parallelize it will be parallelized by the default number (which we can control with the property spark.default.parallelism) or if we just want a specific instance of parallelize to have a different number of partitions, we can instead call sc.parallelize(data, numpartitions). The default value of this is documented in http://spark.apache.org/docs/latest/configuration.html#spark-properties
Cheers, Holden :) On Thu, Nov 6, 2014 at 10:43 PM, Naveen Kumar Pokala < npok...@spcapitaliq.com> wrote: > Hi, > > > > JavaRDD<Integer> distData = sc.parallelize(data); > > > > On what basis parallelize splits the data into multiple datasets. How to > handle if we want these many datasets to be executed per executor? > > > > For example, my data is of 1000 integers list and I am having 2 node yarn > cluster. It is diving into 2 batches of 500 size. > > > > Regards, > > Naveen. > -- Cell : 425-233-8271