I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster
mode that had run successfully with Spark-0.9.1 and Hadoop 2.3 or
2.2. The application successfully ran to conclusion but it
ultimately failed. There were 2 anomalies... 1. ASM reported only that the application was "ACCEPTED". It never indicated that the application was "RUNNING." 14/05/21 16:06:12 INFO yarn.Client: Application report from ASM:Furthermore, it started a second container, running two partly overlapping drivers, when it appeared that the application never started. Each container ran to conclusion as explained above, taking twice as long as usual for both to complete. Both instances had the same concluding failure. 2. Each instance failed as indicated by the stderr log, finding that the filesystem was closed when trying to clean up the staging directories. 14/05/21 16:08:24 INFO Executor: Serialized size of result for 1453 is 863 14/05/21 16:08:24 INFO Executor: Sending result for 1453 directly to driver 14/05/21 16:08:24 INFO Executor: Finished task ID 1453 14/05/21 16:08:24 INFO TaskSetManager: Finished TID 1453 in 202 ms on localhost (progress: 2/2) 14/05/21 16:08:24 INFO DAGScheduler: Completed ResultTask(1507, 1) 14/05/21 16:08:24 INFO TaskSchedulerImpl: Removed TaskSet 1507.0, whose tasks have all completed, from pool 14/05/21 16:08:24 INFO DAGScheduler: Stage 1507 (count at KEval.scala:32) finished in 0.417 s 14/05/21 16:08:24 INFO SparkContext: Job finished: count at KEval.scala:32, took 1.532789283 s 14/05/21 16:08:24 INFO SparkUI: Stopped Spark web UI at http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:42250 14/05/21 16:08:24 INFO DAGScheduler: Stopping DAGScheduler 14/05/21 16:08:25 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/05/21 16:08:25 INFO ConnectionManager: Selector thread was interrupted! 14/05/21 16:08:25 INFO ConnectionManager: ConnectionManager stopped 14/05/21 16:08:25 INFO MemoryStore: MemoryStore cleared 14/05/21 16:08:25 INFO BlockManager: BlockManager stopped 14/05/21 16:08:25 INFO BlockManagerMasterActor: Stopping BlockManagerMaster 14/05/21 16:08:25 INFO BlockManagerMaster: BlockManagerMaster stopped 14/05/21 16:08:25 INFO SparkContext: Successfully stopped SparkContext 14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/05/21 16:08:25 INFO ApplicationMaster: finishApplicationMaster with SUCCEEDED 14/05/21 16:08:25 INFO ApplicationMaster: AppMaster received a signal. 14/05/21 16:08:25 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1400696988985_0007 14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/05/21 16:08:25 ERROR ApplicationMaster: Failed to cleanup staging dir .sparkStaging/application_1400696988985_0007 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689) at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1685) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:587) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:587) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:371) at org.apache.spark.deploy.yarn.ApplicationMaster$AppMasterShutdownHook.run(ApplicationMaster.scala:386) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) There is nothing about the staging directory themselves that looks suspicious... drwx------ - hduser supergroup 0 2014-05-21 16:06 /user/hduser/.sparkStaging/application_1400696988985_0007 -rw-r--r-- 3 hduser supergroup 92881278 2014-05-21 16:06 /user/hduser/.sparkStaging/application_1400696988985_0007/app.jar -rw-r--r-- 3 hduser supergroup 118900783 2014-05-21 16:06 /user/hduser/.sparkStaging/application_1400696988985_0007/spark-assembly-1.0.0-hadoop2.3.0.jar Just prior to the staging directory cleanup, the application concluded by writing results to 3 HDFS files. That occurred without incident. This particular test was run using ... 1. RC10 compiled as follows: mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package 2. Ran in yarn-cluster mode using spark-submit Is there any configuration new to 1.0.0 that I might be missing. I walked through all the changes in the Yarn deploy web page, updating my scripts and configuration appropriately, and running except for these two anomalies. Thanks Kevin Markey |
- Failed RC-10 yarn-cluster job for FS closed error when clean... Kevin Markey