Re: Spark Performance on Yarn
In master branch, overhead is now 10%. That would be 500 MB FYI On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote: +1 to executor-memory to 5g. Do check the overhead space for both the driver and the executor as per Wilfred's suggestion. Typically, 384 MB should suffice. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
+1 to executor-memory to 5g. Do check the overhead space for both the driver and the executor as per Wilfred's suggestion. Typically, 384 MB should suffice. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
Does it still hit the memory limit for the container? An expensive transformation? On Wed, Apr 22, 2015 at 8:45 AM, Ted Yu yuzhih...@gmail.com wrote: In master branch, overhead is now 10%. That would be 500 MB FYI On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote: +1 to executor-memory to 5g. Do check the overhead space for both the driver and the executor as per Wilfred's suggestion. Typically, 384 MB should suffice. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
Try --executor-memory 5g , because you have 8 gb RAM in each machine -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22603.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
I got exactly the same problem, except that I'm running on a standalone master. Can you tell me the counterpart parameter on standalone master for increasing the same memroy overhead? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22580.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
Thanks for the suggestions. I removed the persist call from program. Doing so I started it with: spark-submit --class com.xxx.analytics.spark.AnalyticsJob --master yarn /tmp/analytics.jar --input_directory hdfs://ip:8020/flume/events/2015/02/ This takes all the default and only runs 2 executors. This runs with no failures but takes 17 hours. After this I tried to run it with spark-submit --class com.extole.analytics.spark.AnalyticsJob --num-executors 5 --executor-cores 2 --master yarn /tmp/analytics.jar --input_directory hdfs://ip-10-142-198-50.ec2.internal:8020/flume/events/2015/02/ This results in lots of executor failures and restarts and failures. I can't seem to get any kind of parallelism or throughput. The next try will be to set the yarn memory overhead. What other configs should I list to help figure out the sweet spot here. On Sat, Feb 21, 2015 at 12:29 AM, Davies Liu dav...@databricks.com wrote: How many executors you have per machine? It will be helpful if you could list all the configs. Could you also try to run it without persist? Caching do hurt than help, if you don't have enough memory. On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier...@gmail.com wrote: Thanks for the suggestions. I'm experimenting with different values for spark memoryOverhead and explictly giving the executors more memory, but still have not found the golden medium to get it to finish in a proper time frame. Is my cluster massively undersized at 5 boxes, 8gb 2cpu ? Trying to figure out a memory setting and executor setting so it runs on many containers in parallel. I'm still struggling as pig jobs and hive jobs on the same whole data set don't take as long. I'm wondering too if the logic in our code is just doing something silly causing multiple reads of all the data. On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote: If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote: A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1423083596644_0238_01_004160 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal :42535/user/CoarseGrainedScheduler 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout 2 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
How many executors you have per machine? It will be helpful if you could list all the configs. Could you also try to run it without persist? Caching do hurt than help, if you don't have enough memory. On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier...@gmail.com wrote: Thanks for the suggestions. I'm experimenting with different values for spark memoryOverhead and explictly giving the executors more memory, but still have not found the golden medium to get it to finish in a proper time frame. Is my cluster massively undersized at 5 boxes, 8gb 2cpu ? Trying to figure out a memory setting and executor setting so it runs on many containers in parallel. I'm still struggling as pig jobs and hive jobs on the same whole data set don't take as long. I'm wondering too if the logic in our code is just doing something silly causing multiple reads of all the data. On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote: If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote: A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1423083596644_0238_01_004160 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/CoarseGrainedScheduler 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout 2 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
None of this really points to the problem. These indicate that workers died but not why. I'd first go locate executor logs that reveal more about what's happening. It sounds like a hard-er type of failure, like JVM crash or running out of file handles, or GC thrashing. On Fri, Feb 20, 2015 at 4:51 AM, lbierman leebier...@gmail.com wrote: I'm a bit new to Spark, but had a question on performance. I suspect a lot of my issue is due to tuning and parameters. I have a Hive external table on this data and to run queries against it runs in minutes The Job: + 40gb of avro events on HDFS (100 million+ avro events) + Read in the files from HDFS and dedupe events by key (mapToPair then a reduceByKey) + RDD returned and persisted (disk and memory) + Then passed to a job that take the RDD and mapToPair of new object data and then reduceByKey and foreachpartion do work The issue: When I run this on my environment on Yarn this takes 20+ hours. Running on yarn we see the first stage runs to do build the RDD deduped, but then when the next stage starts, things fail and data is lost. This results in stage 0 starting over and over and just dragging it out. Errors I see in the driver logs: ERROR cluster.YarnClientClusterScheduler: Lost executor 1 on X: remote Akka client disassociated 15/02/20 00:27:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.1 (TID 1335,): FetchFailed(BlockManagerId(3, i, 33958), shuffleId=1, mapId=162, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect toX/X:33958 Also we see this, but I'm suspecting this is because the previous stage fails and the next one starts: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 Cluster: 5 machines, each 2 core , 8gb machines Spark-submit command: spark-submit --class com.myco.SparkJob \ --master yarn \ /tmp/sparkjob.jar \ Any thoughts or where to look or how to start approaching this problem or more data points to present. Thanks.. Code for the job: JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent) context.newAPIHadoopRDD( context.hadoopConfiguration(), AvroKeyInputFormat.class, AvroKey.class, NullWritable.class ).keys()) .map(event - AnalyticsEvent.newBuilder(event.datum()).build()) .filter(key - { return Optional.ofNullable(key.getStepEventKey()).isPresent(); }) .mapToPair(event - new Tuple2AnalyticsEvent, Integer(event, 1)) .reduceByKey((analyticsEvent1, analyticsEvent2) - analyticsEvent1) .map(tuple - tuple._1()); events.persist(StorageLevel.MEMORY_AND_DISK_2()); events.mapToPair(event - { return new Tuple2T, RunningAggregates( keySelector.select(event), new RunningAggregates( Optional.ofNullable(event.getVisitors()).orElse(0L), Optional.ofNullable(event.getImpressions()).orElse(0L), Optional.ofNullable(event.getAmount()).orElse(0.0D), Optional.ofNullable(event.getAmountSumOfSquares()).orElse(0.0D))); }) .reduceByKey((left, right) - { return left.add(right); }) .foreachpartition(dostuff) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
Hi Sandy, I appreciate your clear explanation. Let me try again. It's the best way to confirm I understand. spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory that YARN will create a JVM spark.executor.memory = the memory I can actually use in my jvm application = part of it (spark.storage.memoryFraction) is reserved for caching + part of it (spark.shuffle.memoryFraction) is reserved for shuffling + the remaining is for bookkeeping UDFs If I am correct above, then one implication from them is: (spark.executor.memory + spark.yarn.executor.memoryOverhead) * number of executors per machine should be configured smaller than a single machine physical memory Right? Again, thanks! Kelvin On Fri, Feb 20, 2015 at 11:50 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Kelvin, spark.executor.memory controls the size of the executor heaps. spark.yarn.executor.memoryOverhead is the amount of memory to request from YARN beyond the heap size. This accounts for the fact that JVMs use some non-heap memory. The Spark heap is divided into spark.storage.memoryFraction (default 0.6) and spark.shuffle.memoryFraction (default 0.2), and the rest is for basic Spark bookkeeping and anything the user does inside UDFs. -Sandy On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu 2dot7kel...@gmail.com wrote: Hi Sandy, I am also doing memory tuning on YARN. Just want to confirm, is it correct to say: spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I can actually use in my jvm application If it is not, what is the correct relationship? Any other variables or config parameters in play? Thanks. Kelvin On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote: If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote: A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1423083596644_0238_01_004160 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal :42535/user/CoarseGrainedScheduler 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout 2 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
That's all correct. -Sandy On Fri, Feb 20, 2015 at 1:23 PM, Kelvin Chu 2dot7kel...@gmail.com wrote: Hi Sandy, I appreciate your clear explanation. Let me try again. It's the best way to confirm I understand. spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory that YARN will create a JVM spark.executor.memory = the memory I can actually use in my jvm application = part of it (spark.storage.memoryFraction) is reserved for caching + part of it (spark.shuffle.memoryFraction) is reserved for shuffling + the remaining is for bookkeeping UDFs If I am correct above, then one implication from them is: (spark.executor.memory + spark.yarn.executor.memoryOverhead) * number of executors per machine should be configured smaller than a single machine physical memory Right? Again, thanks! Kelvin On Fri, Feb 20, 2015 at 11:50 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Kelvin, spark.executor.memory controls the size of the executor heaps. spark.yarn.executor.memoryOverhead is the amount of memory to request from YARN beyond the heap size. This accounts for the fact that JVMs use some non-heap memory. The Spark heap is divided into spark.storage.memoryFraction (default 0.6) and spark.shuffle.memoryFraction (default 0.2), and the rest is for basic Spark bookkeeping and anything the user does inside UDFs. -Sandy On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu 2dot7kel...@gmail.com wrote: Hi Sandy, I am also doing memory tuning on YARN. Just want to confirm, is it correct to say: spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I can actually use in my jvm application If it is not, what is the correct relationship? Any other variables or config parameters in play? Thanks. Kelvin On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote: If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote: A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1423083596644_0238_01_004160 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal :42535/user/CoarseGrainedScheduler 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout 2 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
Thanks for the suggestions. I'm experimenting with different values for spark memoryOverhead and explictly giving the executors more memory, but still have not found the golden medium to get it to finish in a proper time frame. Is my cluster massively undersized at 5 boxes, 8gb 2cpu ? Trying to figure out a memory setting and executor setting so it runs on many containers in parallel. I'm still struggling as pig jobs and hive jobs on the same whole data set don't take as long. I'm wondering too if the logic in our code is just doing something silly causing multiple reads of all the data. On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote: If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote: A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1423083596644_0238_01_004160 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal :42535/user/CoarseGrainedScheduler 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout 2 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1423083596644_0238_01_004160 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/CoarseGrainedScheduler 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout 2 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
Are you specifying the executor memory, cores, or number of executors anywhere? If not, you won't be taking advantage of the full resources on the cluster. -Sandy On Fri, Feb 20, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote: None of this really points to the problem. These indicate that workers died but not why. I'd first go locate executor logs that reveal more about what's happening. It sounds like a hard-er type of failure, like JVM crash or running out of file handles, or GC thrashing. On Fri, Feb 20, 2015 at 4:51 AM, lbierman leebier...@gmail.com wrote: I'm a bit new to Spark, but had a question on performance. I suspect a lot of my issue is due to tuning and parameters. I have a Hive external table on this data and to run queries against it runs in minutes The Job: + 40gb of avro events on HDFS (100 million+ avro events) + Read in the files from HDFS and dedupe events by key (mapToPair then a reduceByKey) + RDD returned and persisted (disk and memory) + Then passed to a job that take the RDD and mapToPair of new object data and then reduceByKey and foreachpartion do work The issue: When I run this on my environment on Yarn this takes 20+ hours. Running on yarn we see the first stage runs to do build the RDD deduped, but then when the next stage starts, things fail and data is lost. This results in stage 0 starting over and over and just dragging it out. Errors I see in the driver logs: ERROR cluster.YarnClientClusterScheduler: Lost executor 1 on X: remote Akka client disassociated 15/02/20 00:27:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.1 (TID 1335,): FetchFailed(BlockManagerId(3, i, 33958), shuffleId=1, mapId=162, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect toX/X:33958 Also we see this, but I'm suspecting this is because the previous stage fails and the next one starts: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 Cluster: 5 machines, each 2 core , 8gb machines Spark-submit command: spark-submit --class com.myco.SparkJob \ --master yarn \ /tmp/sparkjob.jar \ Any thoughts or where to look or how to start approaching this problem or more data points to present. Thanks.. Code for the job: JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent) context.newAPIHadoopRDD( context.hadoopConfiguration(), AvroKeyInputFormat.class, AvroKey.class, NullWritable.class ).keys()) .map(event - AnalyticsEvent.newBuilder(event.datum()).build()) .filter(key - { return Optional.ofNullable(key.getStepEventKey()).isPresent(); }) .mapToPair(event - new Tuple2AnalyticsEvent, Integer(event, 1)) .reduceByKey((analyticsEvent1, analyticsEvent2) - analyticsEvent1) .map(tuple - tuple._1()); events.persist(StorageLevel.MEMORY_AND_DISK_2()); events.mapToPair(event - { return new Tuple2T, RunningAggregates( keySelector.select(event), new RunningAggregates( Optional.ofNullable(event.getVisitors()).orElse(0L), Optional.ofNullable(event.getImpressions()).orElse(0L), Optional.ofNullable(event.getAmount()).orElse(0.0D), Optional.ofNullable(event.getAmountSumOfSquares()).orElse(0.0D))); }) .reduceByKey((left, right) - { return left.add(right); }) .foreachpartition(dostuff) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote: A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1423083596644_0238_01_004160 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal :42535/user/CoarseGrainedScheduler 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout 2 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
Hi Sandy, I am also doing memory tuning on YARN. Just want to confirm, is it correct to say: spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I can actually use in my jvm application If it is not, what is the correct relationship? Any other variables or config parameters in play? Thanks. Kelvin On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote: If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote: A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1423083596644_0238_01_004160 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal :42535/user/CoarseGrainedScheduler 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout 2 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Performance on Yarn
Hi Kelvin, spark.executor.memory controls the size of the executor heaps. spark.yarn.executor.memoryOverhead is the amount of memory to request from YARN beyond the heap size. This accounts for the fact that JVMs use some non-heap memory. The Spark heap is divided into spark.storage.memoryFraction (default 0.6) and spark.shuffle.memoryFraction (default 0.2), and the rest is for basic Spark bookkeeping and anything the user does inside UDFs. -Sandy On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu 2dot7kel...@gmail.com wrote: Hi Sandy, I am also doing memory tuning on YARN. Just want to confirm, is it correct to say: spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I can actually use in my jvm application If it is not, what is the correct relationship? Any other variables or config parameters in play? Thanks. Kelvin On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote: If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote: A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1423083596644_0238_01_004160 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal :42535/user/CoarseGrainedScheduler 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout 2 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m -Xmx2400m -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Performance on Yarn
I'm a bit new to Spark, but had a question on performance. I suspect a lot of my issue is due to tuning and parameters. I have a Hive external table on this data and to run queries against it runs in minutes The Job: + 40gb of avro events on HDFS (100 million+ avro events) + Read in the files from HDFS and dedupe events by key (mapToPair then a reduceByKey) + RDD returned and persisted (disk and memory) + Then passed to a job that take the RDD and mapToPair of new object data and then reduceByKey and foreachpartion do work The issue: When I run this on my environment on Yarn this takes 20+ hours. Running on yarn we see the first stage runs to do build the RDD deduped, but then when the next stage starts, things fail and data is lost. This results in stage 0 starting over and over and just dragging it out. Errors I see in the driver logs: ERROR cluster.YarnClientClusterScheduler: Lost executor 1 on X: remote Akka client disassociated 15/02/20 00:27:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.1 (TID 1335,): FetchFailed(BlockManagerId(3, i, 33958), shuffleId=1, mapId=162, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect toX/X:33958 Also we see this, but I'm suspecting this is because the previous stage fails and the next one starts: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 Cluster: 5 machines, each 2 core , 8gb machines Spark-submit command: spark-submit --class com.myco.SparkJob \ --master yarn \ /tmp/sparkjob.jar \ Any thoughts or where to look or how to start approaching this problem or more data points to present. Thanks.. Code for the job: JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent) context.newAPIHadoopRDD( context.hadoopConfiguration(), AvroKeyInputFormat.class, AvroKey.class, NullWritable.class ).keys()) .map(event - AnalyticsEvent.newBuilder(event.datum()).build()) .filter(key - { return Optional.ofNullable(key.getStepEventKey()).isPresent(); }) .mapToPair(event - new Tuple2AnalyticsEvent, Integer(event, 1)) .reduceByKey((analyticsEvent1, analyticsEvent2) - analyticsEvent1) .map(tuple - tuple._1()); events.persist(StorageLevel.MEMORY_AND_DISK_2()); events.mapToPair(event - { return new Tuple2T, RunningAggregates( keySelector.select(event), new RunningAggregates( Optional.ofNullable(event.getVisitors()).orElse(0L), Optional.ofNullable(event.getImpressions()).orElse(0L), Optional.ofNullable(event.getAmount()).orElse(0.0D), Optional.ofNullable(event.getAmountSumOfSquares()).orElse(0.0D))); }) .reduceByKey((left, right) - { return left.add(right); }) .foreachpartition(dostuff) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org