Ok, I set the number of spark worker instances to 2 (below is my startup 
command).  But, this essentially had the effect of increasing my number of 
workers from 3 to 6 (which was good) but it also reduced my number of cores per 
worker from 8 to 4 (which was not so good).  In the end, I would still only be 
able to concurrently process 24 partitions in parallel.  I'm starting a 
stand-alone cluster using the spark provided ec2 scripts .  I tried setting the 
env variable for SPARK_WORKER_CORES in the spark_ec2.py but this had no effect. 
So, it's not clear if I could even set the SPARK_WORKER_CORES with the ec2 
scripts.  Anyway, not sure if there is anything else I can try but at least 
wanted to document what I did try and the net effect.  I'm open to any 
suggestions/advice.

 ./spark-ec2 -k key -i key.pem --hadoop-major-version=2 launch -s 3 -t 
m3.2xlarge -w 3600 --spot-price=.08 -z us-east-1e --worker-instances=2 
my-cluster



________________________________
 From: Daniel Siegmann <daniel.siegm...@velos.io>
To: Darin McBeath <ddmcbe...@yahoo.com> 
Cc: Daniel Siegmann <daniel.siegm...@velos.io>; "user@spark.apache.org" 
<user@spark.apache.org> 
Sent: Thursday, July 31, 2014 10:04 AM
Subject: Re: Number of partitions and Number of concurrent tasks
 


I haven't configured this myself. I'd start with setting SPARK_WORKER_CORES to 
a higher value, since that's a bit simpler than adding more workers. This 
defaults to "all available cores" according to the documentation, so I'm not 
sure if you can actually set it higher. If not, you can get around this by 
adding more worker instances; I believe simply setting SPARK_WORKER_INSTANCES 
to 2 would be sufficient.

I don't think you have to set the cores if you have more workers - it will 
default to 8 cores per worker (in your case). But maybe 16 cores per node will 
be too many. You'll have to test. Keep in mind that more workers means more 
memory and such too, so you may need to tweak some other settings downward in 
this case.

On a side note: I've read some people found performance was better when they 
had more workers with less memory each, instead of a single worker with tons of 
memory, because it cut down on garbage collection time. But I can't speak to 
that myself.

In any case, if you increase the number of cores available in your cluster 
(whether per worker, or adding more workers per node, or of course adding more 
nodes) you should see more tasks running concurrently. Whether this will 
actually be faster probably depends mainly on whether the CPUs in your nodes 
were really being fully utilized with the current number of cores.




On Wed, Jul 30, 2014 at 8:30 PM, Darin McBeath <ddmcbe...@yahoo.com> wrote:

Thanks.
>
>
>So to make sure I understand.  Since I'm using a 'stand-alone' cluster, I 
>would set SPARK_WORKER_INSTANCES to something like 2 (instead of the default 
>value of 1).  Is that correct?  But, it also sounds like I need to explicitly 
>set a value for SPARKER_WORKER_CORES (based on what the documentation states). 
> What would I want that value to be based on my configuration below?  Or, 
>would I leave that alone?
>
>
>
>________________________________
> From: Daniel Siegmann <daniel.siegm...@velos.io>
>To: user@spark.apache.org; Darin McBeath <ddmcbe...@yahoo.com> 
>Sent: Wednesday, July 30, 2014 5:58 PM
>Subject: Re: Number of partitions and Number of concurrent tasks
> 
>
>
>This is correct behavior. Each "core" can execute exactly one task at a time, 
>with each task corresponding to a partition. If your cluster only has 24 
>cores, you can only run at most 24 tasks at once.
>
>You could run multiple workers per node to get more executors. That would give 
>you more cores in the cluster. But however many cores you have, each core will 
>run only one task at a time.
>
>
>
>
>On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <ddmcbe...@yahoo.com> wrote:
>
>I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
>>
>>
>>I have an RDD<String> which I've repartitioned so it has 100 partitions 
>>(hoping to increase the parallelism).
>>
>>
>>When I do a transformation (such as filter) on this RDD, I can't  seem to get 
>>more than 24 tasks (my total number of cores across the 3 nodes) going at one 
>>point in time.  By tasks, I mean the number of tasks that appear under the 
>>Application UI.  I tried explicitly setting the spark.default.parallelism to 
>>48 (hoping I would get 48 tasks concurrently running) and verified this in 
>>the Application UI for the running application but this had no effect.  
>>Perhaps, this is ignored for a 'filter' and the default is the total number 
>>of cores available.
>>
>>
>>I'm fairly new with Spark so maybe I'm just missing or misunderstanding 
>>something fundamental.  Any help would be appreciated.
>>
>>
>>Thanks.
>>
>>
>>Darin.
>>
>>
>
>
>-- 
>
>Daniel Siegmann, Software Developer
>Velos
>Accelerating Machine Learning
>
>
>440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>E: daniel.siegm...@velos.iow: www.velos.io
>
>


-- 

Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning


440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.iow: www.velos.io

Reply via email to