Update: Partly user error. But still getting FS closed error. Yes, we are running plain vanilla Hadoop 2.3.0. But it probably doesn't matter.... 1. Tried Colin McCabe's suggestion to patch with pull 850 (https://issues.apache.org/jira/browse/SPARK-1898). No effect. 2. When testing Colin's patch, realized that master is set two places. In spark-submit it is set using "--master yarn-cluster". But it is also set via command line in my application (for SparkContext initialization) -- which I did not modify specifically for testing 1.0. When modifying my scripts to use --master, I removed the user option --sm, which had the effect of setting the master to "local"!! But local contained by two Yarn containers!!! (This is almost too funny to imagine!) (Each of the two container logs is identical, looking like a local mode log.) 3. I corrected this option to also specify "yarn-cluster". Now ASM reports that the application is RUNNING... 14/05/22 00:09:08 INFO yarn.Client: Application report from ASM: application identifier: application_1400738053238_0002 appId: 2 clientToAMToken: null appDiagnostics: appMasterHost: 192.168.1.21 appQueue: default appMasterRpcPort: 0 appStartTime: 1400738864762 yarnAppState: RUNNING distributedFinalState: UNDEFINED appTrackingUrl: http://Sleepycat:8088/proxy/application_1400738053238_0002/ appUser: hduser And it is reported as a SUCCESS, despite a Filesystem closed IOException when the AM attempts to clean up the staging directory. (Now the two container logs are different. One is that of a driver, the other that of an executor.) 4. But there is still the Filesystem closed error. This is reported only in the driver's/AM's stderr log. It does not affect the final success of the job. I'm not sure whether my app's opening and closing the FS has any bearing. (It never has before.) Before the application concludes, it saves several text files to HDFS by using Hadoop utilities to (1) get an instance of the filesystem, (2) an output stream to a path in HDFS, (3) write to it, (4) close the stream, and (5) close the filesystem -- for each file to be written! But this is done by the driver thread, not by the application master. (Does the FS object owned by ApplicationMaster interact with any arbitrary FS instance in the driver?) Furthermore, it opens and closes the filesystem at least 3 times and as many as hundreds of times, and has in the past without side-effects. We've had dozens of arguments about whether to close such FS instances. Sometimes we have problems if we close them. Sometimes when we don't! I shall experiment with the FileSystem handling when it's convenient. 5. Finally, I don't know if pull 850 had any effect. I've not rolled it back to retest with the correct SparkContext master setting. Thanks for your feedback. Kevin On 05/21/2014 11:54 PM, Tathagata Das
wrote:
|
- Failed RC-10 yarn-cluster job for FS closed error when clean... Kevin Markey
- Re: Failed RC-10 yarn-cluster job for FS closed error w... Tom Graves
- Re: Failed RC-10 yarn-cluster job for FS closed err... Tathagata Das
- Re: Failed RC-10 yarn-cluster job for FS closed... Kevin Markey