Does parallel processing mean it is executed in multiple worker or executed in one worker but multiple threads? For example if I have only one worker but my RDD has 4 partition, will it be executed parallel in 4 thread?
The reason I am asking is try to decide whether I need to configure spark to have multiple workers. By default, it just start with one worker. Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 -----Original Message----- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, January 15, 2015 11:04 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to force parallel processing of RDD using multiple thread Check the number of partitions in your input. It may be much less than the available parallelism of your small cluster. For example, input that lives in just 1 partition will spawn just 1 task. Beyond that parallelism just happens. You can see the parallelism of each operation in the Spark UI. On Thu, Jan 15, 2015 at 10:53 PM, Wang, Ningjun (LNG-NPV) <ningjun.w...@lexisnexis.com> wrote: > Spark Standalone cluster. > > My program is running very slow, I suspect it is not doing parallel > processing of rdd. How can I force it to run parallel? Is there anyway to > check whether it is processed in parallel? > > Regards, > > Ningjun Wang > Consulting Software Engineer > LexisNexis > 121 Chanlon Road > New Providence, NJ 07974-1541 > > > -----Original Message----- > From: Sean Owen [mailto:so...@cloudera.com] > Sent: Thursday, January 15, 2015 4:29 PM > To: Wang, Ningjun (LNG-NPV) > Cc: user@spark.apache.org > Subject: Re: How to force parallel processing of RDD using multiple > thread > > What is your cluster manager? For example on YARN you would specify > --executor-cores. Read: > http://spark.apache.org/docs/latest/running-on-yarn.html > > On Thu, Jan 15, 2015 at 8:54 PM, Wang, Ningjun (LNG-NPV) > <ningjun.w...@lexisnexis.com> wrote: >> I have a standalone spark cluster with only one node with 4 CPU cores. >> How can I force spark to do parallel processing of my RDD using >> multiple threads? For example I can do the following >> >> >> >> Spark-submit --master local[4] >> >> >> >> However I really want to use the cluster as follow >> >> >> >> Spark-submit --master spark://10.125.21.15:7070 >> >> >> >> In that case, how can I make sure the RDD is processed with multiple >> threads/cores? >> >> >> >> Thanks >> >> Ningjun >> >>