Re: Spark job concurrency problem

Imran Rashid Tue, 05 May 2015 07:06:19 -0700

can you give your entire spark submit command?  Are you missing
"--executor-cores <num_cpu>"?  Also, if you intend to use all 6 nodes, you
also need "--num-executors 6"


On Mon, May 4, 2015 at 2:07 AM, Xi Shen <davidshe...@gmail.com> wrote:

> Hi,
>
> I have two small RDD, each has about 600 records. In my code, I did
>
> val rdd1 = sc...cache()
> val rdd2 = sc...cache()
>
> val result = rdd1.cartesian(rdd2).*repartition*(num_cpu).map {case (a,b)
> =>
>   some_expensive_job(a,b)
> }
>
> I ran my job in YARN cluster with "--master yarn-cluster", I have 6
> executor, and each has a large memory volume.
>
> However, I noticed my job is very slow. I went to the RM page, and found
> there are two containers, one is the driver, one is the worker. I guess
> this is correct?
>
> I went to the worker's log, and monitor the log detail. My app print some
> information, so I can use them to estimate the progress of the "map"
> operation. Looking at the log, it feels like the jobs are done one by one
> sequentially, rather than #cpu batch at a time.
>
> I checked the worker node, and their CPU are all busy.
>
>
>
> [image: --]
> Xi Shen
> [image: http://]about.me/davidshen
> <http://about.me/davidshen?promo=email_sig>
>   <http://about.me/davidshen>
>

Re: Spark job concurrency problem

Reply via email to