Thomas Graves created SPARK-19812:
-------------------------------------

             Summary: YARN shuffle service fix moving recovery DB directories
                 Key: SPARK-19812
                 URL: https://issues.apache.org/jira/browse/SPARK-19812
             Project: Spark
          Issue Type: Bug
          Components: YARN
    Affects Versions: 2.0.1
            Reporter: Thomas Graves
            Assignee: Thomas Graves


The yarn shuffle service tries to switch from the yarn local directories to the 
real recovery directory but can fail to move the existing recovery db's.  It 
fails due to Files.move not doing directories that have contents.

2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move 
recovery file sparkShuffleRecovery.ldb to the path 
/mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
        at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
        at 
sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
        at java.nio.file.Files.move(Files.java:1395)
        at 
org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
        at 
org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
        at 
org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)

This used to use f.renameTo and we switched it in the pr due to review comments 
and it looks like didn't do a final real test. The tests are using files rather 
then directories so it didn't catch. We need to fix the test also.

history: 
https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to