Here is the application log for this spark job. http://pastebin.com/2UJS9L4e
Thanks, Weifeng From: "Aulakh, Sahib" <aula...@a9.com> Date: Friday, May 20, 2016 at 12:43 PM To: Ted Yu <yuzhih...@gmail.com> Cc: Rodrick Brown <rodr...@orchardplatform.com>, Cui Weifeng <weife...@a9.com>, user <user@spark.apache.org>, "Zhao, Jun" <junz...@a9.com> Subject: Re: Can not set spark dynamic resource allocation Yes it is yarn. We have configured spark shuffle service w yarn node manager but something must be off. We will send u app log on paste bin. Sent from my iPhone On May 20, 2016, at 12:35 PM, Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote: Since yarn-site.xml was cited, I assume the cluster runs YARN. On Fri, May 20, 2016 at 12:30 PM, Rodrick Brown <rodr...@orchardplatform.com<mailto:rodr...@orchardplatform.com>> wrote: Is this Yarn or Mesos? For the later you need to start an external shuffle service. Get Outlook for iOS<https://aka.ms/o0ukef> On Fri, May 20, 2016 at 11:48 AM -0700, "Cui, Weifeng" <weife...@a9.com<mailto:weife...@a9.com>> wrote: Hi guys, Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set dynamic resource allocation for spark and we followed the following link. After the changes, all spark jobs failed. https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation This test was on a test cluster which has 1 master machine (running namenode, resourcemanager and hive server), 1 worker machine (running datanode and nodemanager) and 1 machine as client( running spark shell). What I updated in config : 1. Update in spark-defaults.conf spark.dynamicAllocation.enabled true spark.shuffle.service.enabled true 2. Update yarn-site.xml <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle,spark_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> <value>org.apache.spark.network.yarn.YarnShuffleService</value> </property> <property> <name>spark.shuffle.service.enabled</name> <value>true</value> </property> 3. Copy spark-1.6.1-yarn-shuffle.jar to yarn.application.classpath ($HADOOP_HOME/share/hadoop/yarn/*) in python code 4. Restart namenode, datanode, resourcemanager, nodemanger... retart everything 5. The config will update in all machines, resourcemanager and nodemanager. We update the config in one place and copy to all machines. What I tested: 1. I started a scala spark shell and check its environment variables, spark.dynamicAllocation.enabled is true. 2. I used the following code: scala > val line = sc.textFile("/spark-events/application_1463681113470_0006") line: org.apache.spark.rdd.RDD[String] = /spark-events/application_1463681113470_0006 MapPartitionsRDD[1] at textFile at <console>:27 scala > line.count # This command just stuck here 3. In the beginning, there is only 1 executor(this is for driver) and after line.count, I could see 3 executors, then dropped to 1. 4. Several jobs were launched and all of them failed. Tasks (for all stages): Succeeded/Total : 0/2 (4 failed) Error messages: I found the following messages in spark web UI. I found this in spark.log on nodemanager machine as well. ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container marked as failed: container_1463692924309_0002_01_000002 on host: xxxxxxxxxxxxxxx.com<http://xxxxxxxxxxxxxxx.com>. Exit status: 1. Diagnostics: Exception from container-launch. Container id: container_1463692924309_0002_01_000002 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 Thanks a lot for help. We can provide more information if needed. Thanks, Weifeng NOTICE TO RECIPIENTS: This communication is confidential and intended for the use of the addressee only. If you are not an intended recipient of this communication, please delete it immediately and notify the sender by return email. Unauthorized reading, dissemination, distribution or copying of this communication is prohibited. This communication does not constitute an offer to sell or a solicitation of an indication of interest to purchase any loan, security or any other financial product or instrument, nor is it an offer to sell or a solicitation of an indication of interest to purchase any products or services to any persons who are prohibited from receiving such information under applicable law. The contents of this communication may not be accurate or complete and are subject to change without notice. As such, Orchard App, Inc. (including its subsidiaries and affiliates, "Orchard") makes no representation regarding the accuracy or completeness of the information contained herein. The intended recipient is advised to consult its own professional advisors, including those specializing in legal, tax and accounting matters. Orchard does not provide legal, tax or accounting advice.