Hi,
If I repartition my data by a factor equal to the number of worker
instances, will the performance be better or worse?
As far as I understand, the performance should be better, but in my case it
is becoming worse.
I have a single node standalone cluster, is it because of this?
Am I guaranteed
In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there
are slots for task threads. The slot count is configured by the num_cores
setting. Generally over subscribe this. So if you have 10 free CPU cores,
set num_cores to 20.
On Monday, February 23, 2015, Deep Pradhan
In general you should first figure out how many task slots are in the
cluster and then repartition the RDD to maybe 2x that #. So if you have a
100 slots, then maybe RDDs with partition count of 100-300 would be normal.
But also size of each partition can matter. You want a task to operate on a
How is task slot different from # of Workers?
so don't read into any performance metrics you've collected to
extrapolate what may happen at scale.
I did not get you in this.
Thank You
On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com
wrote:
In general you should first
You mean SPARK_WORKER_CORES in /conf/spark-env.sh?
On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui same...@databricks.com
wrote:
In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there
are slots for task threads. The slot count is configured by the num_cores
setting.