[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-12-28 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-4813:
---

   Resolution: Fixed
Fix Version/s: 0.23.6
   2.0.3-alpha
   3.0.0
   Status: Resolved  (was: Patch Available)

Thanks Jason,

I looked at the differences between the 0.23 patch and the trunk patch and they 
all look fine. I put this into trunk, branch-2, and branch-0.23

> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Fix For: 3.0.0, 2.0.3-alpha, 0.23.6
>
> Attachments: JobImplStateMachine.pdf, 
> MAPREDUCE-4813-2-branch-0.23.patch, MAPREDUCE-4813-2.patch, 
> MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, 
> MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-12-27 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4813:
--

Attachment: MAPREDUCE-4813-2-branch-0.23.patch

Thanks, Bobby.  Adding the patch for branch-0.23.

> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: JobImplStateMachine.pdf, 
> MAPREDUCE-4813-2-branch-0.23.patch, MAPREDUCE-4813-2.patch, 
> MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, 
> MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-12-27 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4813:
--

Attachment: MAPREDUCE-4813-2.patch

Thanks for the review, Bobby.  Updated the patch to address the review comments:

* I went ahead and fixed the JOB_START in KILLED race since it's a one-line fix 
to ignore that event.
* I noticed that JOB_TASK_COMPLETED also needs to be handled in the FAIL_ABORT 
state.
* The commit cancel wait loop was busted, as it needs to recompute {{now}} 
within the loop to function properly.
* Standardized on StringUtils.stringifyException to report committer exceptions 
in diagnostics.  I had previously preserved the existing behavior, but I agree 
we should be consistent.
* Updated IOException to Exception within the try block of committer calls to 
make sure we don't "leak" exceptions and deadlock.

> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: JobImplStateMachine.pdf, MAPREDUCE-4813-2.patch, 
> MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, 
> MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-12-26 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4813:
--

Attachment: MAPREDUCE-4813-2.patch

Updated patch to trunk and cleaned up some unused import warnings.

> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: JobImplStateMachine.pdf, MAPREDUCE-4813-2.patch, 
> MAPREDUCE-4813-2.patch, MAPREDUCE-4813-2.patch, MAPREDUCE-4813.patch, 
> MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-12-17 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4813:
--

Attachment: MAPREDUCE-4813-2.patch

Updated patch to address FindBug warnings.

> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: JobImplStateMachine.pdf, MAPREDUCE-4813-2.patch, 
> MAPREDUCE-4813-2.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, 
> MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-12-14 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4813:
--

Attachment: MAPREDUCE-4813-2.patch
JobImplStateMachine.pdf

Per Vinod's suggestions, I updated the patch to move all of the committer 
interactions in JobImpl to a separate CommitterEventHandler which was 
previously known as TaskCleaner.

The asynchronous processing of committer callbacks required adding new internal 
states to JobImpl, specifically:

* SETUP which occurs after INITED while processing the setupJob callback
* COMMITTING which occurs after RUNNING while processing the commitJob callback
* FAIL_ABORT which occurs prior to FAILED while proccessing the abortJob 
callback
* KILL_ABORT which occurs prior to KILLED while processing the abortJob callback

One significant shift with this rework is that the committer's setupJob call is 
now performed *after* INITED and after the job reports externally that it is 
RUNNING.  Previously it processed the setupJob callback synchronously within 
the MRAppMaster.start method, and this seemed like the cleanest way to handle 
the now asynchronous nature of the committer callback.


> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: JobImplStateMachine.pdf, MAPREDUCE-4813-2.patch, 
> MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-11-29 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4813:
--

Status: Patch Available  (was: Open)

> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 2.0.1-alpha, 0.23.3
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, 
> MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-11-29 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4813:
--

Attachment: MAPREDUCE-4813.patch

Thanks for the review, Vinod!  I've attached a patch that hopefully addresses 
most of your comments.

I agree that abortJob, setupJob, etc. need to be handled as well, as those 
could take an arbitrary amount of time as well.  Adding a new top-level 
service, associated events for that service, and new state machine wait states 
will be a bit involved, and I'm keen on getting a fix for the now common case 
of long job commits.  If it's OK with you, I'd like to tackle that review 
comment in a separate JIRA.

> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, 
> MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-11-29 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated MAPREDUCE-4813:
---

Status: Open  (was: Patch Available)

> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 2.0.1-alpha, 0.23.3
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-11-28 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4813:
--

Target Version/s: 2.0.3-alpha, 0.23.6
  Status: Patch Available  (was: Open)

> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 2.0.1-alpha, 0.23.3
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-11-28 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4813:
--

Attachment: MAPREDUCE-4813.patch

Patch that fixes the unit test failures and adds some testing of the new 
COMMITTING state.  As a bonus, most of the tests in TestJobImpl actually test a 
JobImpl object rather than a mock of it.


> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

2012-11-26 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4813:
--

Attachment: MAPREDUCE-4813.patch

Posting a rough patch for comment.  It adds a new, interal COMMITTING state to 
JobImpl.  It's missing the state transition tests for the new state and also 
breaks a fair number of tests that are confused by the new state.  I wanted to 
get this out there for initial comment in case this isn't the direction people 
think this should go.

> AM timing out during job commit
> ---
>
> Key: MAPREDUCE-4813
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Priority: Critical
> Attachments: MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira