You mean SPARK_WORKER_CORES in /conf/spark-env.sh? On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui <same...@databricks.com> wrote:
> In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there > are slots for task threads. The slot count is configured by the num_cores > setting. Generally over subscribe this. So if you have 10 free CPU cores, > set num_cores to 20. > > > On Monday, February 23, 2015, Deep Pradhan <pradhandeep1...@gmail.com> > wrote: > >> How is task slot different from # of Workers? >> >> >> >> so don't read into any performance metrics you've collected to >> extrapolate what may happen at scale. >> I did not get you in this. >> >> Thank You >> >> On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui <same...@databricks.com >> > wrote: >> >>> In general you should first figure out how many task slots are in the >>> cluster and then repartition the RDD to maybe 2x that #. So if you have a >>> 100 slots, then maybe RDDs with partition count of 100-300 would be normal. >>> >>> But also size of each partition can matter. You want a task to operate >>> on a partition for at least 200ms, but no longer than around 20 seconds. >>> >>> Even if you have 100 slots, it could be okay to have a RDD with 10,000 >>> partitions if you've read in a large file. >>> >>> So don't repartition your RDD to match the # of Worker JVMs, but rather >>> align it to the total # of task slots in the Executors. >>> >>> If you're running on a single node, shuffle operations become almost >>> free (because there's no network movement), so don't read into any >>> performance metrics you've collected to extrapolate what may happen at >>> scale. >>> >>> >>> On Monday, February 23, 2015, Deep Pradhan <pradhandeep1...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> If I repartition my data by a factor equal to the number of worker >>>> instances, will the performance be better or worse? >>>> As far as I understand, the performance should be better, but in my >>>> case it is becoming worse. >>>> I have a single node standalone cluster, is it because of this? >>>> Am I guaranteed to have a better performance if I do the same thing in >>>> a multi-node cluster? >>>> >>>> Thank You >>>> >>> >>