Can you retrieve the log for application_1463681113470_0006 and pastebin it
?

Thanks

On Fri, May 20, 2016 at 11:48 AM, Cui, Weifeng <weife...@a9.com> wrote:

> Hi guys,
>
>
>
> Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set
> dynamic resource allocation for spark and we followed the following link.
> After the changes, all spark jobs failed.
>
>
> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>
> This test was on a test cluster which has 1 master machine (running
> namenode, resourcemanager and hive server), 1 worker machine (running
> datanode and nodemanager) and 1 machine as client( running spark shell).
>
>
>
> *What I updated in config :*
>
>
>
> 1. Update in spark-defaults.conf
>
>         spark.dynamicAllocation.enabled     true
>
>         spark.shuffle.service.enabled            true
>
>
>
> 2. Update yarn-site.xml
>
>         <property>
>
>              <name>yarn.nodemanager.aux-services</name>
>               <value>mapreduce_shuffle,*spark_shuffle*</value>
>         </property>
>
>         <property>
>             <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
>             <value>org.apache.spark.network.yarn.YarnShuffleService</value>
>         </property>
>
>         <property>
>             <name>spark.shuffle.service.enabled</name>
>              <value>true</value>
>         </property>
>
> 3. Copy  spark-1.6.1-yarn-shuffle.jar to yarn.application.classpath
> ($HADOOP_HOME/share/hadoop/yarn/*) in python code
>
> 4. Restart namenode, datanode, resourcemanager, nodemanger...
> retart everything
>
> 5. The config will update in all machines, resourcemanager
> and nodemanager. We update the config in one place and copy to all machines.
>
>
>
> *What I tested:*
>
>
>
> 1. I started a scala spark shell and check its environment variables,
> spark.dynamicAllocation.enabled is true.
>
> 2. I used the following code:
>
>         scala > val line =
> sc.textFile("/spark-events/application_1463681113470_0006")
>
>                     line: org.apache.spark.rdd.RDD[String] =
> /spark-events/application_1463681113470_0006 MapPartitionsRDD[1] at
> textFile at <console>:27
>
>         scala > line.count # This command just stuck here
>
>
>
> 3. In the beginning, there is only 1 executor(this is for driver) and
> after line.count, I could see 3 executors, then dropped to 1.
>
> 4. Several jobs were launched and all of them failed.   Tasks (for all
> stages): Succeeded/Total : 0/2 (4 failed)
>
>
>
> *Error messages:*
>
>
>
> I found the following messages in spark web UI. I found this in spark.log
> on nodemanager machine as well.
>
>
>
> *ExecutorLostFailure (executor 1 exited caused by one of the running
> tasks) Reason: Container marked as failed:
> container_1463692924309_0002_01_000002 on host: xxxxxxxxxxxxxxx.com
> <http://xxxxxxxxxxxxxxx.com>. Exit status: 1. Diagnostics: Exception from
> container-launch.*
> *Container id: container_1463692924309_0002_01_000002*
> *Exit code: 1*
> *Stack trace: ExitCodeException exitCode=1: *
> *at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)*
> *at org.apache.hadoop.util.Shell.run(Shell.java:455)*
> *at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)*
> *at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)*
> *at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)*
> *at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)*
> *at java.util.concurrent.FutureTask.run(FutureTask.java:266)*
> *at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)*
> *at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)*
> *at java.lang.Thread.run(Thread.java:745)*
>
> *Container exited with a non-zero exit code 1*
>
>
>
> Thanks a lot for help. We can provide more information if needed.
>
>
>
> Thanks,
> Weifeng
>
>
>
>
>
>
>
>
>
>
>

Reply via email to