Sorry to reply this late. <property> <name>yarn.nodemanager.log-dirs</name> <value>/local/output/logs/nm-log-dir</value> </property>
We do not use file:// in the settings, so that should not be the problem. Any other guesses? Weifeng On 5/20/16, 2:40 PM, "David Newberger" <david.newber...@wandcorp.com> wrote: >Hi All, > >The error you are seeing looks really similar to Spark-13514 to me. I could be >wrong though > >https://issues.apache.org/jira/browse/SPARK-13514 > >Can you check yarn.nodemanager.local-dirs in your YARN configuration for >"file://" > > >Cheers! >David Newberger > >-----Original Message----- >From: Cui, Weifeng [mailto:weife...@a9.com] >Sent: Friday, May 20, 2016 4:26 PM >To: Marcelo Vanzin >Cc: Ted Yu; Rodrick Brown; user; Zhao, Jun; Aulakh, Sahib; Song, Yiwei >Subject: Re: Can not set spark dynamic resource allocation > >Sorry, here is the node-manager log. application_1463692924309_0002 is my >test. Hope this will help. >http://pastebin.com/0BPEcgcW > > > >On 5/20/16, 2:09 PM, "Marcelo Vanzin" <van...@cloudera.com> wrote: > >>Hi Weifeng, >> >>That's the Spark event log, not the YARN application log. You get the >>latter using the "yarn logs" command. >> >>On Fri, May 20, 2016 at 1:14 PM, Cui, Weifeng <weife...@a9.com> wrote: >>> Here is the application log for this spark job. >>> >>> http://pastebin.com/2UJS9L4e >>> >>> >>> >>> Thanks, >>> Weifeng >>> >>> >>> >>> >>> >>> From: "Aulakh, Sahib" <aula...@a9.com> >>> Date: Friday, May 20, 2016 at 12:43 PM >>> To: Ted Yu <yuzhih...@gmail.com> >>> Cc: Rodrick Brown <rodr...@orchardplatform.com>, Cui Weifeng >>> <weife...@a9.com>, user <user@spark.apache.org>, "Zhao, Jun" >>> <junz...@a9.com> >>> Subject: Re: Can not set spark dynamic resource allocation >>> >>> >>> >>> Yes it is yarn. We have configured spark shuffle service w yarn node >>> manager but something must be off. >>> >>> >>> >>> We will send u app log on paste bin. >>> >>> Sent from my iPhone >>> >>> >>> On May 20, 2016, at 12:35 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>> Since yarn-site.xml was cited, I assume the cluster runs YARN. >>> >>> >>> >>> On Fri, May 20, 2016 at 12:30 PM, Rodrick Brown >>> <rodr...@orchardplatform.com> wrote: >>> >>> Is this Yarn or Mesos? For the later you need to start an external >>> shuffle service. >>> >>> Get Outlook for iOS >>> >>> >>> >>> >>> >>> On Fri, May 20, 2016 at 11:48 AM -0700, "Cui, Weifeng" >>> <weife...@a9.com> >>> wrote: >>> >>> Hi guys, >>> >>> >>> >>> Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set >>> dynamic resource allocation for spark and we followed the following >>> link. After the changes, all spark jobs failed. >>> >>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-reso >>> urce-allocation >>> >>> This test was on a test cluster which has 1 master machine (running >>> namenode, resourcemanager and hive server), 1 worker machine (running >>> datanode and nodemanager) and 1 machine as client( running spark shell). >>> >>> >>> >>> What I updated in config : >>> >>> >>> >>> 1. Update in spark-defaults.conf >>> >>> spark.dynamicAllocation.enabled true >>> >>> spark.shuffle.service.enabled true >>> >>> >>> >>> 2. Update yarn-site.xml >>> >>> <property> >>> >>> <name>yarn.nodemanager.aux-services</name> >>> <value>mapreduce_shuffle,spark_shuffle</value> >>> </property> >>> >>> <property> >>> <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> >>> <value>org.apache.spark.network.yarn.YarnShuffleService</value> >>> </property> >>> >>> <property> >>> <name>spark.shuffle.service.enabled</name> >>> <value>true</value> >>> </property> >>> >>> 3. Copy spark-1.6.1-yarn-shuffle.jar to yarn.application.classpath >>> ($HADOOP_HOME/share/hadoop/yarn/*) in python code >>> >>> 4. Restart namenode, datanode, resourcemanager, nodemanger... retart >>> everything >>> >>> 5. The config will update in all machines, resourcemanager and nodemanager. >>> We update the config in one place and copy to all machines. >>> >>> >>> >>> What I tested: >>> >>> >>> >>> 1. I started a scala spark shell and check its environment variables, >>> spark.dynamicAllocation.enabled is true. >>> >>> 2. I used the following code: >>> >>> scala > val line = >>> sc.textFile("/spark-events/application_1463681113470_0006") >>> >>> line: org.apache.spark.rdd.RDD[String] = >>> /spark-events/application_1463681113470_0006 MapPartitionsRDD[1] at >>> textFile at <console>:27 >>> >>> scala > line.count # This command just stuck here >>> >>> >>> >>> 3. In the beginning, there is only 1 executor(this is for driver) and >>> after line.count, I could see 3 executors, then dropped to 1. >>> >>> 4. Several jobs were launched and all of them failed. Tasks (for all >>> stages): Succeeded/Total : 0/2 (4 failed) >>> >>> >>> >>> Error messages: >>> >>> >>> >>> I found the following messages in spark web UI. I found this in >>> spark.log on nodemanager machine as well. >>> >>> >>> >>> ExecutorLostFailure (executor 1 exited caused by one of the running >>> tasks) >>> Reason: Container marked as failed: >>> container_1463692924309_0002_01_000002 >>> on host: xxxxxxxxxxxxxxx.com. Exit status: 1. Diagnostics: Exception >>> from container-launch. >>> Container id: container_1463692924309_0002_01_000002 >>> Exit code: 1 >>> Stack trace: ExitCodeException exitCode=1: >>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) >>> at org.apache.hadoop.util.Shell.run(Shell.java:455) >>> at >>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java: >>> 715) >>> at >>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.la >>> unchContainer(DefaultContainerExecutor.java:211) >>> at >>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C >>> ontainerLaunch.call(ContainerLaunch.java:302) >>> at >>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C >>> ontainerLaunch.call(ContainerLaunch.java:82) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. >>> java:1142) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor >>> .java:617) at java.lang.Thread.run(Thread.java:745) >>> >>> Container exited with a non-zero exit code 1 >>> >>> >>> >>> Thanks a lot for help. We can provide more information if needed. >>> >>> >>> >>> Thanks, >>> Weifeng >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> NOTICE TO RECIPIENTS: This communication is confidential and intended >>> for the use of the addressee only. If you are not an intended >>> recipient of this communication, please delete it immediately and >>> notify the sender by return email. Unauthorized reading, >>> dissemination, distribution or copying of this communication is >>> prohibited. This communication does not constitute an offer to sell >>> or a solicitation of an indication of interest to purchase any loan, >>> security or any other financial product or instrument, nor is it an >>> offer to sell or a solicitation of an indication of interest to >>> purchase any products or services to any persons who are prohibited >>> from receiving such information under applicable law. The contents of >>> this communication may not be accurate or complete and are subject to >>> change without notice. As such, Orchard App, Inc. (including its >>> subsidiaries and affiliates, "Orchard") makes no representation >>> regarding the accuracy or completeness of the information contained >>> herein. The intended recipient is advised to consult its own >>> professional advisors, including those specializing in legal, tax and >>> accounting matters. Orchard does not provide legal, tax or accounting >>> advice. >>> >>> >> >> >> >>-- >>Marcelo > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org >