Spark will use the number of cores available in the cluster. If your cluster is 1 node with 4 cores, Spark will execute up to 4 tasks in parallel. Setting your #of partitions to 4 will ensure an even load across cores. Note that this is different from saying "threads" - Internally Spark uses many threads (data block sender/receiver, listeners, notifications, scheduler, ...)
-kr, Gerard. On Fri, Jan 16, 2015 at 3:14 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > Does parallel processing mean it is executed in multiple worker or > executed in one worker but multiple threads? For example if I have only one > worker but my RDD has 4 partition, will it be executed parallel in 4 thread? > > The reason I am asking is try to decide whether I need to configure spark > to have multiple workers. By default, it just start with one worker. > > Regards, > > Ningjun Wang > Consulting Software Engineer > LexisNexis > 121 Chanlon Road > New Providence, NJ 07974-1541 > > > -----Original Message----- > From: Sean Owen [mailto:so...@cloudera.com] > Sent: Thursday, January 15, 2015 11:04 PM > To: Wang, Ningjun (LNG-NPV) > Cc: user@spark.apache.org > Subject: Re: How to force parallel processing of RDD using multiple thread > > Check the number of partitions in your input. It may be much less than the > available parallelism of your small cluster. For example, input that lives > in just 1 partition will spawn just 1 task. > > Beyond that parallelism just happens. You can see the parallelism of each > operation in the Spark UI. > > On Thu, Jan 15, 2015 at 10:53 PM, Wang, Ningjun (LNG-NPV) < > ningjun.w...@lexisnexis.com> wrote: > > Spark Standalone cluster. > > > > My program is running very slow, I suspect it is not doing parallel > processing of rdd. How can I force it to run parallel? Is there anyway to > check whether it is processed in parallel? > > > > Regards, > > > > Ningjun Wang > > Consulting Software Engineer > > LexisNexis > > 121 Chanlon Road > > New Providence, NJ 07974-1541 > > > > > > -----Original Message----- > > From: Sean Owen [mailto:so...@cloudera.com] > > Sent: Thursday, January 15, 2015 4:29 PM > > To: Wang, Ningjun (LNG-NPV) > > Cc: user@spark.apache.org > > Subject: Re: How to force parallel processing of RDD using multiple > > thread > > > > What is your cluster manager? For example on YARN you would specify > --executor-cores. Read: > > http://spark.apache.org/docs/latest/running-on-yarn.html > > > > On Thu, Jan 15, 2015 at 8:54 PM, Wang, Ningjun (LNG-NPV) < > ningjun.w...@lexisnexis.com> wrote: > >> I have a standalone spark cluster with only one node with 4 CPU cores. > >> How can I force spark to do parallel processing of my RDD using > >> multiple threads? For example I can do the following > >> > >> > >> > >> Spark-submit --master local[4] > >> > >> > >> > >> However I really want to use the cluster as follow > >> > >> > >> > >> Spark-submit --master spark://10.125.21.15:7070 > >> > >> > >> > >> In that case, how can I make sure the RDD is processed with multiple > >> threads/cores? > >> > >> > >> > >> Thanks > >> > >> Ningjun > >> > >> >