Hi Jeff and Prabhu, Thanks for your help.
I look deep in the nodemanager log and I found that I have a error message like this: 2016-03-02 03:13:59,692 ERROR org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: error opening leveldb file file:/data/yarn/cache/yarn/nm-local-dir/registeredExecutors.ldb <file:///data/yarn/cache/yarn/nm-local-dir/registeredExecutors.ldb>. Creating new file, will not be able to recover state for existing applications This error message is also reported in the following jira ticket. https://issues.apache.org/jira/browse/SPARK-13622 I reason for this problem is that in core-site.xml, I set hadoop.tmp.dir as follows: <property> <name>hadoop.tmp.dir</name> <value>file:/home/xs6/hadoop-2.7.1/tmp</value> </property> I solve the problem by remove "file:" from the value fields. Thanks! Xiaoye On Wed, Mar 2, 2016 at 10:02 PM, Prabhu Joseph <prabhujose.ga...@gmail.com> wrote: > Is all NodeManager services restarted after the change in yarn-site.xml > > On Thu, Mar 3, 2016 at 6:00 AM, Jeff Zhang <zjf...@gmail.com> wrote: > >> The executor may fail to start. You need to check the executor logs, if >> there's no executor log then you need to check node manager log. >> >> On Wed, Mar 2, 2016 at 4:26 PM, Xiaoye Sun <sunxiaoy...@gmail.com> wrote: >> >>> Hi all, >>> >>> I am very new to spark and yarn. >>> >>> I am running a BroadcastTest example application using spark 1.6.0 and >>> Hadoop/Yarn 2.7.1. in a 5 nodes cluster. >>> >>> I configured my configuration files according to >>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation >>> >>> 1. copy >>> ./spark-1.6.0/network/yarn/target/scala-2.10/spark-1.6.0-yarn-shuffle.jar >>> to /hadoop-2.7.1/share/hadoop/yarn/lib/ >>> 2. yarn-site.xml is like this >>> http://www.owlnet.rice.edu/~xs6/yarn-site.xml >>> 3. spark-defaults.conf is like this >>> http://www.owlnet.rice.edu/~xs6/spark-defaults.conf >>> 4. spark-env.sh is like this >>> http://www.owlnet.rice.edu/~xs6/spark-env.sh >>> 5. the command I use to submit spark application is: ./bin/spark-submit >>> --class org.apache.spark.examples.BroadcastTest --master yarn --deploy-mode >>> cluster ./examples/target/spark-examples_2.10-1.6.0.jar 1 10000000 Http >>> >>> However, the job is stuck at RUNNING status, and by looking at the log, >>> I found that the executor is failed/cancelled frequently... >>> Here is the log output http://www.owlnet.rice.edu/~xs6/stderr >>> It shows something like >>> >>> 16/03/02 02:07:35 WARN yarn.YarnAllocator: Container marked as failed: >>> container_1456905762620_0002_01_000002 on host: bold-x.rice.edu. Exit >>> status: 1. Diagnostics: Exception from container-launch. >>> >>> >>> Is there anybody know what is the problem here? >>> Best, >>> Xiaoye >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > >