Re: Repartition and Worker Instances

Deep Pradhan Mon, 23 Feb 2015 19:10:32 -0800

You mean SPARK_WORKER_CORES in /conf/spark-env.sh?

On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui <same...@databricks.com>
wrote:


> In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there
> are slots for task threads. The slot count is configured by the num_cores
> setting. Generally over subscribe this. So if you have 10 free CPU cores,
> set num_cores to 20.
>
>
> On Monday, February 23, 2015, Deep Pradhan <pradhandeep1...@gmail.com>
> wrote:
>
>> How is task slot different from # of Workers?
>>
>>
>> >> so don't read into any performance metrics you've collected to
>> extrapolate what may happen at scale.
>> I did not get you in this.
>>
>> Thank You
>>
>> On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui <same...@databricks.com
>> > wrote:
>>
>>> In general you should first figure out how many task slots are in the
>>> cluster and then repartition the RDD to maybe 2x that #. So if you have a
>>> 100 slots, then maybe RDDs with partition count of 100-300 would be normal.
>>>
>>> But also size of each partition can matter. You want a task to operate
>>> on a partition for at least 200ms, but no longer than around 20 seconds.
>>>
>>> Even if you have 100 slots, it could be okay to have a RDD with 10,000
>>> partitions if you've read in a large file.
>>>
>>> So don't repartition your RDD to match the # of Worker JVMs, but rather
>>> align it to the total # of task slots in the Executors.
>>>
>>> If you're running on a single node, shuffle operations become almost
>>> free (because there's no network movement), so don't read into any
>>> performance metrics you've collected to extrapolate what may happen at
>>> scale.
>>>
>>>
>>> On Monday, February 23, 2015, Deep Pradhan <pradhandeep1...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> If I repartition my data by a factor equal to the number of worker
>>>> instances, will the performance be better or worse?
>>>> As far as I understand, the performance should be better, but in my
>>>> case it is becoming worse.
>>>> I have a single node standalone cluster, is it because of this?
>>>> Am I guaranteed to have a better performance if I do the same thing in
>>>> a multi-node cluster?
>>>>
>>>> Thank You
>>>>
>>>
>>

Re: Repartition and Worker Instances

Reply via email to