[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535126#comment-14535126 ] Siddharth Seth commented on TEZ-2426: - Thanks for the review Jeff. Committing. > Task input not complete before sending Task completed event > --- > > Key: TEZ-2426 > URL: https://issues.apache.org/jira/browse/TEZ-2426 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Bikas Saha >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log > > > Sequence of events > 1) Task A starts in a container > 2) Task A complete event comes to AM > 3) Task B starts in the same container > 4) Task A's input calls some method on its context. Crashes with NPE > 5) The crash sends an input failed event for Task A to the AM > 6) Task A state machine crashes saying cannot handle failed after success > In some cases, it could be that status update event is also sent after > completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534404#comment-14534404 ] Jeff Zhang commented on TEZ-2426: - lgtm +1 > Task input not complete before sending Task completed event > --- > > Key: TEZ-2426 > URL: https://issues.apache.org/jira/browse/TEZ-2426 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Bikas Saha >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log > > > Sequence of events > 1) Task A starts in a container > 2) Task A complete event comes to AM > 3) Task B starts in the same container > 4) Task A's input calls some method on its context. Crashes with NPE > 5) The crash sends an input failed event for Task A to the AM > 6) Task A state machine crashes saying cannot handle failed after success > In some cases, it could be that status update event is also sent after > completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533862#comment-14533862 ] Siddharth Seth commented on TEZ-2426: - Tested on a large noop job - ran through without any issues. > Task input not complete before sending Task completed event > --- > > Key: TEZ-2426 > URL: https://issues.apache.org/jira/browse/TEZ-2426 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Bikas Saha >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log > > > Sequence of events > 1) Task A starts in a container > 2) Task A complete event comes to AM > 3) Task B starts in the same container > 4) Task A's input calls some method on its context. Crashes with NPE > 5) The crash sends an input failed event for Task A to the AM > 6) Task A state machine crashes saying cannot handle failed after success > In some cases, it could be that status update event is also sent after > completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533861#comment-14533861 ] Siddharth Seth commented on TEZ-2426: - [~rajesh.balamohan], [~bikassaha], [~zjffdu] - please review. > Task input not complete before sending Task completed event > --- > > Key: TEZ-2426 > URL: https://issues.apache.org/jira/browse/TEZ-2426 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Bikas Saha >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log > > > Sequence of events > 1) Task A starts in a container > 2) Task A complete event comes to AM > 3) Task B starts in the same container > 4) Task A's input calls some method on its context. Crashes with NPE > 5) The crash sends an input failed event for Task A to the AM > 6) Task A state machine crashes saying cannot handle failed after success > In some cases, it could be that status update event is also sent after > completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533824#comment-14533824 ] TezQA commented on TEZ-2426: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731348/TEZ-2426.2.txt against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/654//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/654//console This message is automatically generated. > Task input not complete before sending Task completed event > --- > > Key: TEZ-2426 > URL: https://issues.apache.org/jira/browse/TEZ-2426 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Bikas Saha >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log > > > Sequence of events > 1) Task A starts in a container > 2) Task A complete event comes to AM > 3) Task B starts in the same container > 4) Task A's input calls some method on its context. Crashes with NPE > 5) The crash sends an input failed event for Task A to the AM > 6) Task A state machine crashes saying cannot handle failed after success > In some cases, it could be that status update event is also sent after > completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533764#comment-14533764 ] TezQA commented on TEZ-2426: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731316/TEZ-2426.1.txt against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/651//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/651//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-internals.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/651//console This message is automatically generated. > Task input not complete before sending Task completed event > --- > > Key: TEZ-2426 > URL: https://issues.apache.org/jira/browse/TEZ-2426 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Bikas Saha >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-2426.1.txt, am.log, container.log > > > Sequence of events > 1) Task A starts in a container > 2) Task A complete event comes to AM > 3) Task B starts in the same container > 4) Task A's input calls some method on its context. Crashes with NPE > 5) The crash sends an input failed event for Task A to the AM > 6) Task A state machine crashes saying cannot handle failed after success > In some cases, it could be that status update event is also sent after > completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533406#comment-14533406 ] Siddharth Seth commented on TEZ-2426: - Longer term - 0.8, may be worthwhile to rework some of this, along with protocol changes. > Task input not complete before sending Task completed event > --- > > Key: TEZ-2426 > URL: https://issues.apache.org/jira/browse/TEZ-2426 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha >Priority: Critical > Attachments: am.log, container.log > > > Sequence of events > 1) Task A starts in a container > 2) Task A complete event comes to AM > 3) Task B starts in the same container > 4) Task A's input calls some method on its context. Crashes with NPE > 5) The crash sends an input failed event for Task A to the AM > 6) Task A state machine crashes saying cannot handle failed after success > In some cases, it could be that status update event is also sent after > completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533404#comment-14533404 ] Siddharth Seth commented on TEZ-2426: - Alright. Have a theory on what's happening. Lots of threads involved. This ignores the LOG lines showing up in the wrong log files (assuming the logger doesn't guarantee ordering when logging from different threads). - TaskEventRouter for 456 sees an error. (This can happen because of clean up / some fields not being volatile in inputContext). - TaskEventRouter is swapped out. - TaskCompletes, sends out it's success message (heartbeat) - TaskEventRouter thread regains control - tries sending out the TaskFailed message. (This is all before the next start has started. It may or may not have got an interrupt by this point). - Main thread falls off. Starts running another task. This thread can heartbeat since it doesn't synchronize with the previous tasks heartbeats. - The TaskEventRouter for 465 regains control. Goes into the IPC layer and tries sending the FAILED message (via a future). There's a context switch before the futute.get(). The future runs. future.get() is interrupted, because the thread has seen it's interrupt status by this point. Leads to the various errors in the logs. This doesn't however explain a status_update after the failed message is sent. Don't really see what can cause that. Couple of things which need fixing here 1) Join on the TaskEventRouter 2) Join on the last tasks heartbeat thread 3) Fixes to *Context to revert fields back to final, or volatile 4) Avoid sending any more messages once any one final message has been sent. > Task input not complete before sending Task completed event > --- > > Key: TEZ-2426 > URL: https://issues.apache.org/jira/browse/TEZ-2426 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha >Priority: Critical > Attachments: am.log, container.log > > > Sequence of events > 1) Task A starts in a container > 2) Task A complete event comes to AM > 3) Task B starts in the same container > 4) Task A's input calls some method on its context. Crashes with NPE > 5) The crash sends an input failed event for Task A to the AM > 6) Task A state machine crashes saying cannot handle failed after success > In some cases, it could be that status update event is also sent after > completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533096#comment-14533096 ] Siddharth Seth commented on TEZ-2426: - The status update event after the task failed is also strange. Will look into that. The thread for the last running task may not be exiting properly. > Task input not complete before sending Task completed event > --- > > Key: TEZ-2426 > URL: https://issues.apache.org/jira/browse/TEZ-2426 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha >Priority: Critical > Attachments: am.log, container.log > > > Sequence of events > 1) Task A starts in a container > 2) Task A complete event comes to AM > 3) Task B starts in the same container > 4) Task A's input calls some method on its context. Crashes with NPE > 5) The crash sends an input failed event for Task A to the AM > 6) Task A state machine crashes saying cannot handle failed after success > In some cases, it could be that status update event is also sent after > completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533094#comment-14533094 ] Siddharth Seth commented on TEZ-2426: - [~bikassaha] - do you have additional logs - the entire AM log specifically. There seems to be a discrepancy in the AM / task log times as well. Assuming the nodes are out of sync. I can see how the exception happens during execution of the next task - since we don't join on the eventRouter thread. However, I'm not sure how the FAILED message will go through for the previous attempt as a result of this. It should have gone through for the currently running task. If it went for the previous task - the AM should have thrown an error related to an invalid taskAttemptId. That leads me to believe something else is broken at the same time. > Task input not complete before sending Task completed event > --- > > Key: TEZ-2426 > URL: https://issues.apache.org/jira/browse/TEZ-2426 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha >Priority: Critical > Attachments: am.log, container.log > > > Sequence of events > 1) Task A starts in a container > 2) Task A complete event comes to AM > 3) Task B starts in the same container > 4) Task A's input calls some method on its context. Crashes with NPE > 5) The crash sends an input failed event for Task A to the AM > 6) Task A state machine crashes saying cannot handle failed after success > In some cases, it could be that status update event is also sent after > completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)