Re: Spark Performance on Yarn

2015-04-22 Thread Ted Yu
In master branch, overhead is now 10%. 
That would be 500 MB 

FYI



 On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote:
 
 +1 to executor-memory to 5g.
 Do check the overhead space for both the driver and the executor as per
 Wilfred's suggestion.
 
 Typically, 384 MB should suffice.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-04-22 Thread nsalian
+1 to executor-memory to 5g.
Do check the overhead space for both the driver and the executor as per
Wilfred's suggestion.

Typically, 384 MB should suffice.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-04-22 Thread Neelesh Salian
Does it still hit the memory limit for the container?

An expensive transformation?

On Wed, Apr 22, 2015 at 8:45 AM, Ted Yu yuzhih...@gmail.com wrote:

 In master branch, overhead is now 10%.
 That would be 500 MB

 FYI



  On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote:
 
  +1 to executor-memory to 5g.
  Do check the overhead space for both the driver and the executor as per
  Wilfred's suggestion.
 
  Typically, 384 MB should suffice.
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: Spark Performance on Yarn

2015-04-21 Thread hnahak
Try --executor-memory 5g , because you have 8 gb RAM in each machine 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22603.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-04-20 Thread Peng Cheng
I got exactly the same problem, except that I'm running on a standalone
master. Can you tell me the counterpart parameter on standalone master for
increasing the same memroy overhead?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-02-23 Thread Lee Bierman
Thanks for the suggestions.

I removed the  persist call from program. Doing so I started it with:

spark-submit --class com.xxx.analytics.spark.AnalyticsJob --master yarn
/tmp/analytics.jar --input_directory hdfs://ip:8020/flume/events/2015/02/


This takes all the default and only runs 2 executors. This runs with no
failures but takes 17 hours.


After this I tried to run it with

spark-submit --class com.extole.analytics.spark.AnalyticsJob
--num-executors 5 --executor-cores 2 --master yarn /tmp/analytics.jar
--input_directory
hdfs://ip-10-142-198-50.ec2.internal:8020/flume/events/2015/02/

This results in lots of executor failures and restarts and failures. I
can't seem to get any kind of parallelism or throughput. The next try will
be to set the yarn memory overhead.


What other configs should I list to help figure out the sweet spot here.




On Sat, Feb 21, 2015 at 12:29 AM, Davies Liu dav...@databricks.com wrote:

 How many executors you have per machine? It will be helpful if you
 could list all the configs.

 Could you also try to run it without persist? Caching do hurt than
 help, if you don't have enough memory.

 On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier...@gmail.com wrote:
  Thanks for the suggestions.
  I'm experimenting with different values for spark memoryOverhead and
  explictly giving the executors more memory, but still have not found the
  golden medium to get it to finish in a proper time frame.
 
  Is my cluster massively undersized at 5 boxes, 8gb 2cpu ?
  Trying to figure out a memory setting and executor setting so it runs on
  many containers in parallel.
 
  I'm still struggling as pig jobs and hive jobs on the same whole data set
  don't take as long. I'm wondering too if the logic in our code is just
 doing
  something silly causing multiple reads of all the data.
 
 
  On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
 
  If that's the error you're hitting, the fix is to boost
  spark.yarn.executor.memoryOverhead, which will put some extra room in
  between the executor heap sizes and the amount of memory requested for
 them
  from YARN.
 
  -Sandy
 
  On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:
 
  A bit more context on this issue. From the container logs on the
 executor
 
  Given my cluster specs above what would be appropriate parameters to
 pass
  into :
  --num-executors --num-cores --executor-memory
 
  I had tried it with --executor-memory 2500MB
 
  015-02-20 06:50:09,056 WARN
 
 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
  Container
 [pid=23320,containerID=container_1423083596644_0238_01_004160]
  is
  running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
  physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
  container.
  Dump of the process-tree for container_1423083596644_0238_01_004160 :
  |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
  SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
  |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash
 -c
  /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
  -Xms2400m
  -Xmx2400m
 
 
 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp
 
 
 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
  org.apache.spark.executor.CoarseGrainedExecutorBackend
 
  akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
  8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1
 
 
 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
  2
 
 
 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
  |- 23323 23320 23320 23320 (java) 922271 12263 461976
 724218
  /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p
  -Xms2400m
  -Xmx2400m
 
 
 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp
 
 
 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
  org.apache.spark.executor.CoarseGrainedExecutorBackend
  akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 



Re: Spark Performance on Yarn

2015-02-21 Thread Davies Liu
How many executors you have per machine? It will be helpful if you
could list all the configs.

Could you also try to run it without persist? Caching do hurt than
help, if you don't have enough memory.

On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier...@gmail.com wrote:
 Thanks for the suggestions.
 I'm experimenting with different values for spark memoryOverhead and
 explictly giving the executors more memory, but still have not found the
 golden medium to get it to finish in a proper time frame.

 Is my cluster massively undersized at 5 boxes, 8gb 2cpu ?
 Trying to figure out a memory setting and executor setting so it runs on
 many containers in parallel.

 I'm still struggling as pig jobs and hive jobs on the same whole data set
 don't take as long. I'm wondering too if the logic in our code is just doing
 something silly causing multiple reads of all the data.


 On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the executor

 Given my cluster specs above what would be appropriate parameters to pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=23320,containerID=container_1423083596644_0238_01_004160]
 is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend

 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-02-20 Thread Sean Owen
None of this really points to the problem. These indicate that workers
died but not why. I'd first go locate executor logs that reveal more
about what's happening. It sounds like a hard-er type of failure, like
JVM crash or running out of file handles, or GC thrashing.

On Fri, Feb 20, 2015 at 4:51 AM, lbierman leebier...@gmail.com wrote:
 I'm a bit new to Spark, but had a question on performance. I suspect a lot of
 my issue is due to tuning and parameters. I have a Hive external table on
 this data and to run queries against it runs in minutes

 The Job:
 + 40gb of avro events on HDFS (100 million+ avro events)
 + Read in the files from HDFS and dedupe events by key (mapToPair then a
 reduceByKey)
 + RDD returned and persisted (disk and memory)
 + Then passed to a job that take the RDD and mapToPair of new object data
 and then reduceByKey and foreachpartion do work

 The issue:
 When I run this on my environment on Yarn this takes 20+ hours. Running on
 yarn we see the first stage runs to do build the RDD deduped, but then when
 the next stage starts, things fail and data is lost. This results in stage 0
 starting over and over and just dragging it out.

 Errors I see in the driver logs:
 ERROR cluster.YarnClientClusterScheduler: Lost executor 1 on X: remote
 Akka client disassociated

 15/02/20 00:27:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.1
 (TID 1335,): FetchFailed(BlockManagerId(3, i, 33958), shuffleId=1,
 mapId=162, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Failed to connect
 toX/X:33958

 Also we see this, but I'm suspecting this is because the previous stage
 fails and the next one starts:
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 1

 Cluster:
 5 machines, each 2 core , 8gb machines

 Spark-submit command:
  spark-submit --class com.myco.SparkJob \
 --master yarn \
 /tmp/sparkjob.jar \

 Any thoughts or where to look or how to start approaching this problem or
 more data points to present.

 Thanks..

 Code for the job:
  JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)
 context.newAPIHadoopRDD(
 context.hadoopConfiguration(),
 AvroKeyInputFormat.class,
 AvroKey.class,
 NullWritable.class
 ).keys())
 .map(event - AnalyticsEvent.newBuilder(event.datum()).build())
 .filter(key - { return
 Optional.ofNullable(key.getStepEventKey()).isPresent(); })
 .mapToPair(event - new Tuple2AnalyticsEvent, Integer(event, 1))
 .reduceByKey((analyticsEvent1, analyticsEvent2) - analyticsEvent1)
 .map(tuple - tuple._1());

 events.persist(StorageLevel.MEMORY_AND_DISK_2());
 events.mapToPair(event - {
 return new Tuple2T, RunningAggregates(
 keySelector.select(event),
 new RunningAggregates(
 Optional.ofNullable(event.getVisitors()).orElse(0L),
 Optional.ofNullable(event.getImpressions()).orElse(0L),
 Optional.ofNullable(event.getAmount()).orElse(0.0D),

 Optional.ofNullable(event.getAmountSumOfSquares()).orElse(0.0D)));
 })
 .reduceByKey((left, right) - { return left.add(right); })
 .foreachpartition(dostuff)






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy,

I appreciate your clear explanation. Let me try again. It's the best way to
confirm I understand.

spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory
that YARN will create a JVM

spark.executor.memory = the memory I can actually use in my jvm application
= part of it (spark.storage.memoryFraction) is reserved for caching + part
of it (spark.shuffle.memoryFraction) is reserved for shuffling + the
remaining is for bookkeeping  UDFs

If I am correct above, then one implication from them is:

(spark.executor.memory + spark.yarn.executor.memoryOverhead) * number of
executors per machine should be configured smaller than a single machine
physical memory

Right? Again, thanks!

Kelvin

On Fri, Feb 20, 2015 at 11:50 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:

 Hi Kelvin,

 spark.executor.memory controls the size of the executor heaps.

 spark.yarn.executor.memoryOverhead is the amount of memory to request from
 YARN beyond the heap size.  This accounts for the fact that JVMs use some
 non-heap memory.

 The Spark heap is divided into spark.storage.memoryFraction (default 0.6)
 and spark.shuffle.memoryFraction (default 0.2), and the rest is for basic
 Spark bookkeeping and anything the user does inside UDFs.

 -Sandy



 On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu 2dot7kel...@gmail.com
 wrote:

 Hi Sandy,

 I am also doing memory tuning on YARN. Just want to confirm, is it
 correct to say:

 spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory
 I can actually use in my jvm application

 If it is not, what is the correct relationship? Any other variables or
 config parameters in play? Thanks.

 Kelvin

 On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the
 executor

 Given my cluster specs above what would be appropriate parameters to
 pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container
 [pid=23320,containerID=container_1423083596644_0238_01_004160] is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
That's all correct.

-Sandy

On Fri, Feb 20, 2015 at 1:23 PM, Kelvin Chu 2dot7kel...@gmail.com wrote:

 Hi Sandy,

 I appreciate your clear explanation. Let me try again. It's the best way
 to confirm I understand.

 spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory
 that YARN will create a JVM

 spark.executor.memory = the memory I can actually use in my jvm
 application = part of it (spark.storage.memoryFraction) is reserved for
 caching + part of it (spark.shuffle.memoryFraction) is reserved for
 shuffling + the remaining is for bookkeeping  UDFs

 If I am correct above, then one implication from them is:

 (spark.executor.memory + spark.yarn.executor.memoryOverhead) * number of
 executors per machine should be configured smaller than a single machine
 physical memory

 Right? Again, thanks!

 Kelvin

 On Fri, Feb 20, 2015 at 11:50 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Kelvin,

 spark.executor.memory controls the size of the executor heaps.

 spark.yarn.executor.memoryOverhead is the amount of memory to request
 from YARN beyond the heap size.  This accounts for the fact that JVMs use
 some non-heap memory.

 The Spark heap is divided into spark.storage.memoryFraction (default 0.6)
 and spark.shuffle.memoryFraction (default 0.2), and the rest is for basic
 Spark bookkeeping and anything the user does inside UDFs.

 -Sandy



 On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu 2dot7kel...@gmail.com
 wrote:

 Hi Sandy,

 I am also doing memory tuning on YARN. Just want to confirm, is it
 correct to say:

 spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory
 I can actually use in my jvm application

 If it is not, what is the correct relationship? Any other variables or
 config parameters in play? Thanks.

 Kelvin

 On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the
 executor

 Given my cluster specs above what would be appropriate parameters to
 pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container
 [pid=23320,containerID=container_1423083596644_0238_01_004160] is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash
 -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976
 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: Spark Performance on Yarn

2015-02-20 Thread Lee Bierman
Thanks for the suggestions.
I'm experimenting with different values for spark memoryOverhead and
explictly giving the executors more memory, but still have not found the
golden medium to get it to finish in a proper time frame.

Is my cluster massively undersized at 5 boxes, 8gb 2cpu ?
Trying to figure out a memory setting and executor setting so it runs on
many containers in parallel.

I'm still struggling as pig jobs and hive jobs on the same whole data set
don't take as long. I'm wondering too if the logic in our code is just
doing something silly causing multiple reads of all the data.


On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the executor

 Given my cluster specs above what would be appropriate parameters to pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=23320,containerID=container_1423083596644_0238_01_004160]
 is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark Performance on Yarn

2015-02-20 Thread lbierman
A bit more context on this issue. From the container logs on the executor 

Given my cluster specs above what would be appropriate parameters to pass
into :
--num-executors --num-cores --executor-memory 

I had tried it with --executor-memory 2500MB

015-02-20 06:50:09,056 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is
running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
container.
Dump of the process-tree for container_1423083596644_0238_01_004160 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
/usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2400m
-Xmx2400m 
-Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/CoarseGrainedScheduler
8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1
/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
2
/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
|- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
/usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m
-Xmx2400m
-Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Are you specifying the executor memory, cores, or number of executors
anywhere?  If not, you won't be taking advantage of the full resources on
the cluster.

-Sandy

On Fri, Feb 20, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote:

 None of this really points to the problem. These indicate that workers
 died but not why. I'd first go locate executor logs that reveal more
 about what's happening. It sounds like a hard-er type of failure, like
 JVM crash or running out of file handles, or GC thrashing.

 On Fri, Feb 20, 2015 at 4:51 AM, lbierman leebier...@gmail.com wrote:
  I'm a bit new to Spark, but had a question on performance. I suspect a
 lot of
  my issue is due to tuning and parameters. I have a Hive external table on
  this data and to run queries against it runs in minutes
 
  The Job:
  + 40gb of avro events on HDFS (100 million+ avro events)
  + Read in the files from HDFS and dedupe events by key (mapToPair then a
  reduceByKey)
  + RDD returned and persisted (disk and memory)
  + Then passed to a job that take the RDD and mapToPair of new object data
  and then reduceByKey and foreachpartion do work
 
  The issue:
  When I run this on my environment on Yarn this takes 20+ hours. Running
 on
  yarn we see the first stage runs to do build the RDD deduped, but then
 when
  the next stage starts, things fail and data is lost. This results in
 stage 0
  starting over and over and just dragging it out.
 
  Errors I see in the driver logs:
  ERROR cluster.YarnClientClusterScheduler: Lost executor 1 on X:
 remote
  Akka client disassociated
 
  15/02/20 00:27:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
 3.1
  (TID 1335,): FetchFailed(BlockManagerId(3, i, 33958),
 shuffleId=1,
  mapId=162, reduceId=0, message=
  org.apache.spark.shuffle.FetchFailedException: Failed to connect
  toX/X:33958
 
  Also we see this, but I'm suspecting this is because the previous stage
  fails and the next one starts:
  org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
  location for shuffle 1
 
  Cluster:
  5 machines, each 2 core , 8gb machines
 
  Spark-submit command:
   spark-submit --class com.myco.SparkJob \
  --master yarn \
  /tmp/sparkjob.jar \
 
  Any thoughts or where to look or how to start approaching this problem or
  more data points to present.
 
  Thanks..
 
  Code for the job:
   JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)
  context.newAPIHadoopRDD(
  context.hadoopConfiguration(),
  AvroKeyInputFormat.class,
  AvroKey.class,
  NullWritable.class
  ).keys())
  .map(event - AnalyticsEvent.newBuilder(event.datum()).build())
  .filter(key - { return
  Optional.ofNullable(key.getStepEventKey()).isPresent(); })
  .mapToPair(event - new Tuple2AnalyticsEvent, Integer(event,
 1))
  .reduceByKey((analyticsEvent1, analyticsEvent2) -
 analyticsEvent1)
  .map(tuple - tuple._1());
 
  events.persist(StorageLevel.MEMORY_AND_DISK_2());
  events.mapToPair(event - {
  return new Tuple2T, RunningAggregates(
  keySelector.select(event),
  new RunningAggregates(
  Optional.ofNullable(event.getVisitors()).orElse(0L),
 
  Optional.ofNullable(event.getImpressions()).orElse(0L),
  Optional.ofNullable(event.getAmount()).orElse(0.0D),
 
  Optional.ofNullable(event.getAmountSumOfSquares()).orElse(0.0D)));
  })
  .reduceByKey((left, right) - { return left.add(right); })
  .foreachpartition(dostuff)
 
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
If that's the error you're hitting, the fix is to boost
spark.yarn.executor.memoryOverhead, which will put some extra room in
between the executor heap sizes and the amount of memory requested for them
from YARN.

-Sandy

On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the executor

 Given my cluster specs above what would be appropriate parameters to pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy,

I am also doing memory tuning on YARN. Just want to confirm, is it correct
to say:

spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I
can actually use in my jvm application

If it is not, what is the correct relationship? Any other variables or
config parameters in play? Thanks.

Kelvin

On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the executor

 Given my cluster specs above what would be appropriate parameters to pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=23320,containerID=container_1423083596644_0238_01_004160]
 is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Hi Kelvin,

spark.executor.memory controls the size of the executor heaps.

spark.yarn.executor.memoryOverhead is the amount of memory to request from
YARN beyond the heap size.  This accounts for the fact that JVMs use some
non-heap memory.

The Spark heap is divided into spark.storage.memoryFraction (default 0.6)
and spark.shuffle.memoryFraction (default 0.2), and the rest is for basic
Spark bookkeeping and anything the user does inside UDFs.

-Sandy



On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu 2dot7kel...@gmail.com wrote:

 Hi Sandy,

 I am also doing memory tuning on YARN. Just want to confirm, is it correct
 to say:

 spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I
 can actually use in my jvm application

 If it is not, what is the correct relationship? Any other variables or
 config parameters in play? Thanks.

 Kelvin

 On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the executor

 Given my cluster specs above what would be appropriate parameters to pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=23320,containerID=container_1423083596644_0238_01_004160]
 is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Spark Performance on Yarn

2015-02-19 Thread lbierman
I'm a bit new to Spark, but had a question on performance. I suspect a lot of
my issue is due to tuning and parameters. I have a Hive external table on
this data and to run queries against it runs in minutes 

The Job:
+ 40gb of avro events on HDFS (100 million+ avro events)
+ Read in the files from HDFS and dedupe events by key (mapToPair then a
reduceByKey)
+ RDD returned and persisted (disk and memory)
+ Then passed to a job that take the RDD and mapToPair of new object data
and then reduceByKey and foreachpartion do work

The issue:
When I run this on my environment on Yarn this takes 20+ hours. Running on
yarn we see the first stage runs to do build the RDD deduped, but then when
the next stage starts, things fail and data is lost. This results in stage 0
starting over and over and just dragging it out.

Errors I see in the driver logs:
ERROR cluster.YarnClientClusterScheduler: Lost executor 1 on X: remote
Akka client disassociated

15/02/20 00:27:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.1
(TID 1335,): FetchFailed(BlockManagerId(3, i, 33958), shuffleId=1,
mapId=162, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect
toX/X:33958

Also we see this, but I'm suspecting this is because the previous stage
fails and the next one starts:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 1

Cluster:
5 machines, each 2 core , 8gb machines 

Spark-submit command:
 spark-submit --class com.myco.SparkJob \
--master yarn \
/tmp/sparkjob.jar \

Any thoughts or where to look or how to start approaching this problem or
more data points to present.

Thanks..

Code for the job:
 JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)
context.newAPIHadoopRDD(
context.hadoopConfiguration(),
AvroKeyInputFormat.class,
AvroKey.class,
NullWritable.class
).keys())
.map(event - AnalyticsEvent.newBuilder(event.datum()).build())
.filter(key - { return
Optional.ofNullable(key.getStepEventKey()).isPresent(); })
.mapToPair(event - new Tuple2AnalyticsEvent, Integer(event, 1)) 
.reduceByKey((analyticsEvent1, analyticsEvent2) - analyticsEvent1)
.map(tuple - tuple._1());

events.persist(StorageLevel.MEMORY_AND_DISK_2());
events.mapToPair(event - {
return new Tuple2T, RunningAggregates(
keySelector.select(event),
new RunningAggregates(
Optional.ofNullable(event.getVisitors()).orElse(0L),
Optional.ofNullable(event.getImpressions()).orElse(0L),
Optional.ofNullable(event.getAmount()).orElse(0.0D),
   
Optional.ofNullable(event.getAmountSumOfSquares()).orElse(0.0D)));
})
.reduceByKey((left, right) - { return left.add(right); })
.foreachpartition(dostuff)






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org