[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated MAPREDUCE-4813: --- Resolution: Fixed Fix Version/s: 0.23.6 2.0.3-alpha 3.0.0 Status: Resolved (was: Patch Available) Thanks Jason, I looked at the differences between the 0.23 patch and the trunk patch and they all look fine. I put this into trunk, branch-2, and branch-0.23 > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 3.0.0, 2.0.3-alpha, 0.23.6 > > Attachments: JobImplStateMachine.pdf, > MAPREDUCE-4813-2-branch-0.23.patch, MAPREDUCE-4813-2.patch, > MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, > MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4813: -- Attachment: MAPREDUCE-4813-2-branch-0.23.patch Thanks, Bobby. Adding the patch for branch-0.23. > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: JobImplStateMachine.pdf, > MAPREDUCE-4813-2-branch-0.23.patch, MAPREDUCE-4813-2.patch, > MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, > MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4813: -- Attachment: MAPREDUCE-4813-2.patch Thanks for the review, Bobby. Updated the patch to address the review comments: * I went ahead and fixed the JOB_START in KILLED race since it's a one-line fix to ignore that event. * I noticed that JOB_TASK_COMPLETED also needs to be handled in the FAIL_ABORT state. * The commit cancel wait loop was busted, as it needs to recompute {{now}} within the loop to function properly. * Standardized on StringUtils.stringifyException to report committer exceptions in diagnostics. I had previously preserved the existing behavior, but I agree we should be consistent. * Updated IOException to Exception within the try block of committer calls to make sure we don't "leak" exceptions and deadlock. > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: JobImplStateMachine.pdf, MAPREDUCE-4813-2.patch, > MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, > MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4813: -- Attachment: MAPREDUCE-4813-2.patch Updated patch to trunk and cleaned up some unused import warnings. > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: JobImplStateMachine.pdf, MAPREDUCE-4813-2.patch, > MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, MAPREDUCE-4813.patch, > MAPREDUCE-4813.patch, MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4813: -- Attachment: MAPREDUCE-4813-2.patch Updated patch to address FindBug warnings. > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: JobImplStateMachine.pdf, MAPREDUCE-4813-2.patch, > MAPREDUCE-4813-2.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, > MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4813: -- Attachment: MAPREDUCE-4813-2.patch JobImplStateMachine.pdf Per Vinod's suggestions, I updated the patch to move all of the committer interactions in JobImpl to a separate CommitterEventHandler which was previously known as TaskCleaner. The asynchronous processing of committer callbacks required adding new internal states to JobImpl, specifically: * SETUP which occurs after INITED while processing the setupJob callback * COMMITTING which occurs after RUNNING while processing the commitJob callback * FAIL_ABORT which occurs prior to FAILED while proccessing the abortJob callback * KILL_ABORT which occurs prior to KILLED while processing the abortJob callback One significant shift with this rework is that the committer's setupJob call is now performed *after* INITED and after the job reports externally that it is RUNNING. Previously it processed the setupJob callback synchronously within the MRAppMaster.start method, and this seemed like the cleanest way to handle the now asynchronous nature of the committer callback. > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: JobImplStateMachine.pdf, MAPREDUCE-4813-2.patch, > MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4813: -- Status: Patch Available (was: Open) > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 2.0.1-alpha, 0.23.3 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, > MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4813: -- Attachment: MAPREDUCE-4813.patch Thanks for the review, Vinod! I've attached a patch that hopefully addresses most of your comments. I agree that abortJob, setupJob, etc. need to be handled as well, as those could take an arbitrary amount of time as well. Adding a new top-level service, associated events for that service, and new state machine wait states will be a bit involved, and I'm keen on getting a fix for the now common case of long job commits. If it's OK with you, I'd like to tackle that review comment in a separate JIRA. > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, > MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-4813: --- Status: Open (was: Patch Available) > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 2.0.1-alpha, 0.23.3 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4813: -- Target Version/s: 2.0.3-alpha, 0.23.6 Status: Patch Available (was: Open) > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 2.0.1-alpha, 0.23.3 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4813: -- Attachment: MAPREDUCE-4813.patch Patch that fixes the unit test failures and adds some testing of the new COMMITTING state. As a bonus, most of the tests in TestJobImpl actually test a JobImpl object rather than a mock of it. > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4813: -- Attachment: MAPREDUCE-4813.patch Posting a rough patch for comment. It adds a new, interal COMMITTING state to JobImpl. It's missing the state transition tests for the new state and also breaks a fair number of tests that are confused by the new state. I wanted to get this out there for initial comment in case this isn't the direction people think this should go. > AM timing out during job commit > --- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Priority: Critical > Attachments: MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during > JobImpl state transitions, which means the JobImpl write lock is held the > entire time the job is being committed. Holding the write lock prevents the > RM allocator thread from heartbeating to the RM. Therefore if committing the > job takes too long (e.g.: the job has tons of files to commit and/or the > namenode is bogged down) then the AM appears to be unresponsive to the RM and > the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira