Re: Spark on Yarn vs Standalone

2015-09-21 Thread Saisai Shao
I think you need to increase the memory size of executor through command arguments "--executor-memory", or configuration "spark.executor.memory". Also yarn.scheduler.maximum-allocation-mb in Yarn side if necessary. Thanks Saisai On Mon, Sep 21, 2015 at 5:13 PM, Alexander Pivovarov

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Sandy Ryza
The warning your seeing in Spark is no issue. The scratch space lives inside the heap, so it'll never result in YARN killing the container by itself. The issue is that Spark is using some off-heap space on top of that. You'll need to bump the spark.yarn.executor.memoryOverhead property to give

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Alexander Pivovarov
I repartitioned input RDD from 4,800 to 24,000 partitions After that the stage (24000 tasks) was done in 22 min on 100 boxes Shuffle read/write: 905 GB / 710 GB Task Metrics (Dur/GC/Read/Write) Min: 7s/1s/38MB/30MB Med: 22s/9s/38MB/30MB Max:1.8min/1.6min/38MB/30MB On Mon, Sep 21, 2015 at 5:55

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Alexander Pivovarov
I noticed that some executors have issue with scratch space. I see the following in yarn app container stderr around the time when yarn killed the executor because it uses too much memory. -- App container stderr -- 15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache rdd_6_346

Re: Spark on Yarn vs Standalone

2015-09-10 Thread Sandy Ryza
YARN will never kill processes for being unresponsive. It may kill processes for occupying more memory than it allows. To get around this, you can either bump spark.yarn.executor.memoryOverhead or turn off the memory checks entirely with yarn.nodemanager.pmem-check-enabled. -Sandy On Tue, Sep

Re: Spark on Yarn vs Standalone

2015-09-08 Thread Sandy Ryza
Those settings seem reasonable to me. Are you observing performance that's worse than you would expect? -Sandy On Mon, Sep 7, 2015 at 11:22 AM, Alexander Pivovarov wrote: > Hi Sandy > > Thank you for your reply > Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB) >

Re: Spark on Yarn vs Standalone

2015-09-08 Thread Alexander Pivovarov
The problem which we have now is skew data (2360 tasks done in 5 min, 3 tasks in 40 min and 1 task in 2 hours) Some people from the team worry that the executor which runs the longest task can be killed by YARN (because executor might be unresponsive because of GC or it might occupy more memory

Re: Spark on Yarn vs Standalone

2015-09-07 Thread Sandy Ryza
Hi Alex, If they're both configured correctly, there's no reason that Spark Standalone should provide performance or memory improvement over Spark on YARN. -Sandy On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov wrote: > Hi Everyone > > We are trying the latest aws

Re: Spark on Yarn vs Standalone

2015-09-07 Thread Alexander Pivovarov
Hi Sandy Thank you for your reply Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB) with emr setting for Spark "maximizeResourceAllocation": "true" It is automatically converted to Spark settings spark.executor.memory47924M spark.yarn.executor.memoryOverhead 5324 we also set

Spark on Yarn vs Standalone

2015-09-04 Thread Alexander Pivovarov
Hi Everyone We are trying the latest aws emr-4.0.0 and Spark and my question is about YARN vs Standalone mode. Our usecase is - start 100-150 nodes cluster every week, - run one heavy spark job (5-6 hours) - save data to s3 - stop cluster Officially aws emr-4.0.0 comes with Spark on Yarn It's