Re: Repartition and Worker Instances

2015-02-23 Thread Sameer Farooqui
In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there are slots for task threads. The slot count is configured by the num_cores setting. Generally over subscribe this. So if you have 10 free CPU cores, set num_cores to 20. On Monday, February 23, 2015, Deep Pradhan

Re: Repartition and Worker Instances

2015-02-23 Thread Sameer Farooqui
In general you should first figure out how many task slots are in the cluster and then repartition the RDD to maybe 2x that #. So if you have a 100 slots, then maybe RDDs with partition count of 100-300 would be normal. But also size of each partition can matter. You want a task to operate on a

Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
How is task slot different from # of Workers? so don't read into any performance metrics you've collected to extrapolate what may happen at scale. I did not get you in this. Thank You On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com wrote: In general you should first

Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
You mean SPARK_WORKER_CORES in /conf/spark-env.sh? On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui same...@databricks.com wrote: In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there are slots for task threads. The slot count is configured by the num_cores setting.