RE: How to force parallel processing of RDD using multiple thread

Wang, Ningjun (LNG-NPV) Fri, 16 Jan 2015 07:08:37 -0800

So one worker is enough and it will use all 4 cores? In what situation shall I 
configure more workers in my single node cluster?

Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541

From: Gerard Maas [mailto:gerard.m...@gmail.com]
Sent: Friday, January 16, 2015 9:44 AM
To: Wang, Ningjun (LNG-NPV)
Cc: Sean Owen; user@spark.apache.org
Subject: Re: How to force parallel processing of RDD using multiple thread

Spark will use the number of cores available in the cluster. If your cluster is 
1 node with 4 cores, Spark will execute up to 4 tasks in parallel.
Setting your #of partitions to 4 will ensure an even load across cores.
Note that this is different from saying "threads" - Internally Spark uses many 
threads  (data block sender/receiver, listeners, notifications, scheduler, ...)

-kr, Gerard.

On Fri, Jan 16, 2015 at 3:14 PM, Wang, Ningjun (LNG-NPV) 
<ningjun.w...@lexisnexis.com<mailto:ningjun.w...@lexisnexis.com>> wrote:
Does parallel processing mean it is executed in multiple worker or executed in 
one worker but multiple threads? For example if I have only one worker but my 
RDD has 4 partition, will it be executed parallel in 4 thread?

The reason I am asking is try to decide whether I need to configure spark to 
have multiple workers. By default, it just start with one worker.

Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541

-----Original Message-----
From: Sean Owen [mailto:so...@cloudera.com<mailto:so...@cloudera.com>]
Sent: Thursday, January 15, 2015 11:04 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: How to force parallel processing of RDD using multiple thread

Check the number of partitions in your input. It may be much less than the 
available parallelism of your small cluster. For example, input that lives in 
just 1 partition will spawn just 1 task.

Beyond that parallelism just happens. You can see the parallelism of each 
operation in the Spark UI.

On Thu, Jan 15, 2015 at 10:53 PM, Wang, Ningjun (LNG-NPV) 
<ningjun.w...@lexisnexis.com<mailto:ningjun.w...@lexisnexis.com>> wrote:
> Spark Standalone cluster.
>
> My program is running very slow, I suspect it is not doing parallel 
> processing of rdd. How can I force it to run parallel? Is there anyway to 
> check whether it is processed in parallel?
>
> Regards,
>
> Ningjun Wang
> Consulting Software Engineer
> LexisNexis
> 121 Chanlon Road
> New Providence, NJ 07974-1541
>
>
> -----Original Message-----
> From: Sean Owen [mailto:so...@cloudera.com<mailto:so...@cloudera.com>]
> Sent: Thursday, January 15, 2015 4:29 PM
> To: Wang, Ningjun (LNG-NPV)
> Cc: user@spark.apache.org<mailto:user@spark.apache.org>
> Subject: Re: How to force parallel processing of RDD using multiple
> thread
>
> What is your cluster manager? For example on YARN you would specify 
> --executor-cores. Read:
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
> On Thu, Jan 15, 2015 at 8:54 PM, Wang, Ningjun (LNG-NPV) 
> <ningjun.w...@lexisnexis.com<mailto:ningjun.w...@lexisnexis.com>> wrote:
>> I have a standalone spark cluster with only one node with 4 CPU cores.
>> How can I force spark to do parallel processing of my RDD using
>> multiple threads? For example I can do the following
>>
>>
>>
>> Spark-submit  --master local[4]
>>
>>
>>
>> However I really want to use the cluster as follow
>>
>>
>>
>> Spark-submit  --master spark://10.125.21.15:7070<http://10.125.21.15:7070>
>>
>>
>>
>> In that case, how can I make sure the RDD is processed with multiple
>> threads/cores?
>>
>>
>>
>> Thanks
>>
>> Ningjun
>>
>>

RE: How to force parallel processing of RDD using multiple thread

Reply via email to