Can you retrieve the log for application_1463681113470_0006 and pastebin it ?
Thanks On Fri, May 20, 2016 at 11:48 AM, Cui, Weifeng <weife...@a9.com> wrote: > Hi guys, > > > > Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set > dynamic resource allocation for spark and we followed the following link. > After the changes, all spark jobs failed. > > > https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation > > This test was on a test cluster which has 1 master machine (running > namenode, resourcemanager and hive server), 1 worker machine (running > datanode and nodemanager) and 1 machine as client( running spark shell). > > > > *What I updated in config :* > > > > 1. Update in spark-defaults.conf > > spark.dynamicAllocation.enabled true > > spark.shuffle.service.enabled true > > > > 2. Update yarn-site.xml > > <property> > > <name>yarn.nodemanager.aux-services</name> > <value>mapreduce_shuffle,*spark_shuffle*</value> > </property> > > <property> > <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> > <value>org.apache.spark.network.yarn.YarnShuffleService</value> > </property> > > <property> > <name>spark.shuffle.service.enabled</name> > <value>true</value> > </property> > > 3. Copy spark-1.6.1-yarn-shuffle.jar to yarn.application.classpath > ($HADOOP_HOME/share/hadoop/yarn/*) in python code > > 4. Restart namenode, datanode, resourcemanager, nodemanger... > retart everything > > 5. The config will update in all machines, resourcemanager > and nodemanager. We update the config in one place and copy to all machines. > > > > *What I tested:* > > > > 1. I started a scala spark shell and check its environment variables, > spark.dynamicAllocation.enabled is true. > > 2. I used the following code: > > scala > val line = > sc.textFile("/spark-events/application_1463681113470_0006") > > line: org.apache.spark.rdd.RDD[String] = > /spark-events/application_1463681113470_0006 MapPartitionsRDD[1] at > textFile at <console>:27 > > scala > line.count # This command just stuck here > > > > 3. In the beginning, there is only 1 executor(this is for driver) and > after line.count, I could see 3 executors, then dropped to 1. > > 4. Several jobs were launched and all of them failed. Tasks (for all > stages): Succeeded/Total : 0/2 (4 failed) > > > > *Error messages:* > > > > I found the following messages in spark web UI. I found this in spark.log > on nodemanager machine as well. > > > > *ExecutorLostFailure (executor 1 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1463692924309_0002_01_000002 on host: xxxxxxxxxxxxxxx.com > <http://xxxxxxxxxxxxxxx.com>. Exit status: 1. Diagnostics: Exception from > container-launch.* > *Container id: container_1463692924309_0002_01_000002* > *Exit code: 1* > *Stack trace: ExitCodeException exitCode=1: * > *at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)* > *at org.apache.hadoop.util.Shell.run(Shell.java:455)* > *at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)* > *at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)* > *at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)* > *at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)* > *at java.util.concurrent.FutureTask.run(FutureTask.java:266)* > *at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)* > *at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)* > *at java.lang.Thread.run(Thread.java:745)* > > *Container exited with a non-zero exit code 1* > > > > Thanks a lot for help. We can provide more information if needed. > > > > Thanks, > Weifeng > > > > > > > > > > >