[ https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173566#comment-14173566 ]
Christophe PRÉAUD commented on SPARK-3967: ------------------------------------------ After investigating, it turns out that the problem is when the executor fetches a jar file: the jar is downloaded in a temporary file, always in /d1/yarn/local/nm-local-dir (first directory of yarn.nodemanager.local-dirs), and then moved in one of the directories of yarn.nodemanager.local-dirs: --> if it is the same than the temporary file (i.e. /d1/yarn/local/nm-local-dir), then the application continues normally --> if it is another one (i.e. /d2/yarn/local/nm-local-dir, /d3/yarn/local/nm-local-dir,...), it fails with the following error: 14/10/10 14:33:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 0) java.io.FileNotFoundException: ./logReader-1.0.10.jar (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:221) at com.google.common.io.Files$FileByteSink.openStream(Files.java:223) at com.google.common.io.Files$FileByteSink.openStream(Files.java:211) at com.google.common.io.ByteSource.copyTo(ByteSource.java:203) at com.google.common.io.Files.copy(Files.java:436) at com.google.common.io.Files.move(Files.java:651) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:440) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:325) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:323) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:323) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) I have no idea why the move fails when the source and target files are not on the same partition (it is no more atomic, but it should succeed anyway), for the moment I have worked around the problem with the attached patch (i.e. I ensure that the temp file and the moved file are always on the same partition). > Spark applications fail in yarn-cluster mode when the directories configured > in yarn.nodemanager.local-dirs are located on different disks/partitions > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-3967 > URL: https://issues.apache.org/jira/browse/SPARK-3967 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.1.0 > Reporter: Christophe PRÉAUD > Attachments: spark-1.1.0-yarn_cluster_tmpdir.patch > > > Spark applications fail from time to time in yarn-cluster mode (but not in > yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is > set to a comma-separated list of directories which are located on different > disks/partitions. > Steps to reproduce: > 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of > directories located on different partitions (the more you set, the more > likely it will be to reproduce the bug): > (...) > <property> > <name>yarn.nodemanager.local-dirs</name> > > <value>file:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir</value> > </property> > (...) > 2. Launch (several times) an application in yarn-cluster mode, it will fail > (apparently randomly) from time to time -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org