I think it should use the default parallelism which by default is equal to the number of cores in your cluster.
If you want to control it, specify a value for numSlices - the second param to parallelize(). -adrian On 10/20/15, 6:13 PM, "t3l" <[email protected]> wrote: >If I have a cluster with 7 nodes, each having an equal amount of cores and >create an RDD with sc.parallelize() it looks as if the Spark will always >tries to distribute the partitions. > >Question: >(1) Is that something I can rely on? > >(2) Can I rely that sc.parallelize() will assign partitions to as many >executers as possible. Meaning: Let's say I request 7 partitions, is each >node guaranteed to get 1 of these partitions? If I select 14 partitions, is >each node guaranteed to grab 2 partitions? > >P.S.: I am aware that for other cases like sc.hadoopFile, this might depend >in the actual storage location of the data. I am merely asking for the >sc.parallelize() case. > > > >-- >View this message in context: >http://apache-spark-user-list.1001560.n3.nabble.com/Partition-for-each-executor-tp25141.html >Sent from the Apache Spark User List mailing list archive at Nabble.com. > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [email protected] >For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
