[ https://issues.apache.org/jira/browse/MAPREDUCE-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ravi Prakash updated MAPREDUCE-3614: ------------------------------------ Attachment: MAPREDUCE-3614.patch Thanks Daryn and Vinod for your careful reviews. Attaching a patch addressing comments from them. Here are some clarifications @Daryn: bq. There's some issues with the filesystem handling. It's creating a filesystem to get the fs scheme so it can create the same filesystem again. I chose not to modify the code which had already been checked in. For myself I'm now getting the schema directly from the config. bq. The code is a bit misleading about only running when a SIGTERM hits. Unless something unregisters the shutdown hook, it runs on a clean exit too. You have to make sure that none of the changes are going to be detrimental to a normal shutdown. You are right! I have renamed the variables to more accurately say isSignalled etc. I reviewed the code and it will run fine in case of a normal shutdown too and ran it on my 1 node cluster a few times and checked the logs. For JHEH, all jobIds should have been closed in case of a normal shutdown, so nothing will be different. In RMCommunicator there is the stopped variable which makes sure it isn't run on stopped objects. bq. Is there a compelling need to maintain a parallel set of JobIds? Why not check the fileMap values to see if the writer is open? The event construction looks a bit odd, but I don't know much about event injection. You are right! Again! :) I'm checking for open eventWriters in fileMap now. I had to modify EventWriter for that. But I like it better. bq. I'd suggest reverting all the fs manipulation code, and try this: I had already tried that approach. And failed :( The problem with using "fs.automatic.close" was that a FileSystem object had already been opened with that parameter set to true. If you see FileSystem.java:2080, it doesn't overwrite that value. As a result on SIGTERM, the FileSystem object was being closed. @Vinod bq. FindBugs warning is still not fixed as Jenkins reports. Sorry about that. Should be fixed now. bq. Not sure why you need all the magic with the FileSystem objects, can you explain? FileSystem registers its own Runtime shutdownhook (FileSystem.java:2076) This closes all instances of FileSystem objects without the "magic". :( So when JHEH wants to write the JobUnsuccessfulCompletionEvent to HDFS / move the files to done_intermediate, it would throw an IOException. bq. You can make MRAppMaster.MRAppMasterShutdownHook extend CompositeServiceShutdownHook Daryn pointed out I'd made another boo boo here =D I should not have been starting off another thread in MRAppMasterShutdownHook for CompositeServiceShutdownHook. I just needed to call stop() so there was no need for extending anymore bq. RMCommunicator: It seems unnatural to set killed final state if the job was running. I think you should set a flag with RMCommunicator similar to JobHistoryEventHandler. Thanks for the suggestion. Done! :) bq. Trivial: isSigTermed == false can be written as !isSigTermed Done! bq. Again trivial: The sigterm flag will never go back to false from true. So the code snippet "while (isSigTermed && jobIt.hasNext()) .." can be rewritten as "if (isSigTermed) { while .. }" Done bq. For the same snippet, can you simply add a comment saying, we are doing this so as to drain everything and only write finish-events to shutdown as fast as possible. Done! bq. The test seems very wired into the code. Can we instead do something like this: Done! I've tried being as close as possible to the SIGTERM without the test becoming too onerous. Please let me know if it needs some change still. > finalState UNDEFINED if AM is killed by hand > --------------------------------------------- > > Key: MAPREDUCE-3614 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3614 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 0.23.0 > Reporter: Ravi Prakash > Assignee: Ravi Prakash > Fix For: 0.23.2 > > Attachments: MAPREDUCE-3614.branch-0.23.patch, MAPREDUCE-3614.patch, > MAPREDUCE-3614.patch, MAPREDUCE-3614.patch, MAPREDUCE-3614.patch, > MAPREDUCE-3614.patch > > > Courtesy [~dcapwell] > {quote} > If the AM is running and you kill the process (sudo kill #pid), the State in > Yarn would be FINISHED and FinalStatus is UNDEFINED. The Tracking UI would > say "History" and point to the proxy url (which will redirect to the history > server). > The state should be more descriptive that the job failed and the tracker url > shouldn't point to the history server. > {quote} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira