Hi Naveen,

So by default when we call parallelize it will be parallelized by the
default number (which we can control with the
property spark.default.parallelism) or if we just want a specific instance
of parallelize to have a different number of partitions, we can instead
call sc.parallelize(data, numpartitions). The default value of this is
documented in
http://spark.apache.org/docs/latest/configuration.html#spark-properties

Cheers,

Holden :)

On Thu, Nov 6, 2014 at 10:43 PM, Naveen Kumar Pokala <
npok...@spcapitaliq.com> wrote:

> Hi,
>
>
>
> JavaRDD<Integer> distData = sc.parallelize(data);
>
>
>
> On what basis parallelize splits the data into multiple datasets. How to
> handle if we want these many datasets to be executed per executor?
>
>
>
> For example, my data is of 1000 integers list and I am having 2 node yarn
> cluster. It is diving into 2 batches of 500 size.
>
>
>
> Regards,
>
> Naveen.
>



-- 
Cell : 425-233-8271

Reply via email to