Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

Kevin Markey Wed, 21 May 2014 16:04:31 -0700

I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster mode that had run successfully with Spark-0.9.1 and Hadoop 2.3 or 2.2. The application successfully ran to conclusion but it ultimately failed.

There were 2 anomalies...

1. ASM reported only that the application was "ACCEPTED". It never indicated that the application was "RUNNING."

14/05/21 16:06:12 INFO yarn.Client: Application report from ASM:
     application identifier: application_1400696988985_0007
     appId: 7
     clientToAMToken: null
     appDiagnostics:
     appMasterHost: N/A
     appQueue: default
     appMasterRpcPort: -1
     appStartTime: 1400709970857
     yarnAppState: ACCEPTED
     distributedFinalState: UNDEFINED
     appTrackingUrl: http://Sleepycat:8088/proxy/application_1400696988985_0007/
     appUser: hduser

Furthermore, it started a second container, running two partly overlapping drivers, when it appeared that the application never started. Each container ran to conclusion as explained above, taking twice as long as usual for both to complete. Both instances had the same concluding failure.

2. Each instance failed as indicated by the stderr log, finding that the filesystem was closed when trying to clean up the staging directories.

14/05/21 16:08:24 INFO Executor: Serialized size of result for 1453 is 86314/05/21 16:08:24 INFO Executor: Sending result for 1453 directly to driver14/05/21 16:08:24 INFO Executor: Finished task ID 145314/05/21 16:08:24 INFO TaskSetManager: Finished TID 1453 in 202 ms on localhost (progress: 2/2)14/05/21 16:08:24 INFO DAGScheduler: Completed ResultTask(1507, 1)14/05/21 16:08:24 INFO TaskSchedulerImpl: Removed TaskSet 1507.0, whose tasks have all completed, from pool14/05/21 16:08:24 INFO DAGScheduler: Stage 1507 (count at KEval.scala:32) finished in 0.417 s14/05/21 16:08:24 INFO SparkContext: Job finished: count at KEval.scala:32, took 1.532789283 s14/05/21 16:08:24 INFO SparkUI: Stopped Spark web UI at http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:4225014/05/21 16:08:24 INFO DAGScheduler: Stopping DAGScheduler14/05/21 16:08:25 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!14/05/21 16:08:25 INFO ConnectionManager: Selector thread was interrupted!14/05/21 16:08:25 INFO ConnectionManager: ConnectionManager stopped14/05/21 16:08:25 INFO MemoryStore: MemoryStore cleared14/05/21 16:08:25 INFO BlockManager: BlockManager stopped14/05/21 16:08:25 INFO BlockManagerMasterActor: Stopping BlockManagerMaster14/05/21 16:08:25 INFO BlockManagerMaster: BlockManagerMaster stopped14/05/21 16:08:25 INFO SparkContext: Successfully stopped SparkContext14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.14/05/21 16:08:25 INFO ApplicationMaster: finishApplicationMaster with SUCCEEDED14/05/21 16:08:25 INFO ApplicationMaster: AppMaster received a signal.14/05/21 16:08:25 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1400696988985_000714/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.14/05/21 16:08:25 ERROR ApplicationMaster: Failed to cleanup staging dir .sparkStaging/application_1400696988985_0007java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689) at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1685) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:587) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:587) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:371) at org.apache.spark.deploy.yarn.ApplicationMaster$AppMasterShutdownHook.run(ApplicationMaster.scala:386) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

There is nothing about the staging directory themselves that looks suspicious...

drwx------ - hduser supergroup 0 2014-05-21 16:06 /user/hduser/.sparkStaging/application_1400696988985_0007-rw-r--r-- 3 hduser supergroup 92881278 2014-05-21 16:06 /user/hduser/.sparkStaging/application_1400696988985_0007/app.jar-rw-r--r-- 3 hduser supergroup 118900783 2014-05-21 16:06 /user/hduser/.sparkStaging/application_1400696988985_0007/spark-assembly-1.0.0-hadoop2.3.0.jar

Just prior to the staging directory cleanup, the application concluded by writing results to 3 HDFS files. That occurred without incident.

This particular test was run using ...

1. RC10 compiled as follows: mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
2. Ran in yarn-cluster mode using spark-submit

Is there any configuration new to 1.0.0 that I might be missing. I walked through all the changes in the Yarn deploy web page, updating my scripts and configuration appropriately, and running except for these two anomalies.

Thanks
Kevin Markey

Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

Reply via email to