[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event

2015-05-08 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535126#comment-14535126
 ] 

Siddharth Seth commented on TEZ-2426:
-

Thanks for the review Jeff. Committing.

> Task input not complete before sending Task completed event
> ---
>
> Key: TEZ-2426
> URL: https://issues.apache.org/jira/browse/TEZ-2426
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Bikas Saha
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log
>
>
> Sequence of events
> 1) Task A starts in a container
> 2) Task A complete event comes to AM
> 3) Task B starts in the same container
> 4) Task A's input calls some method on its context. Crashes with NPE
> 5) The crash sends an input failed event for Task A to the AM
> 6) Task A state machine crashes saying cannot handle failed after success
> In some cases, it could be that status update event is also sent after 
> completion, though not sure if its related to the failed event being sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event

2015-05-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534404#comment-14534404
 ] 

Jeff Zhang commented on TEZ-2426:
-

lgtm +1

> Task input not complete before sending Task completed event
> ---
>
> Key: TEZ-2426
> URL: https://issues.apache.org/jira/browse/TEZ-2426
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Bikas Saha
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log
>
>
> Sequence of events
> 1) Task A starts in a container
> 2) Task A complete event comes to AM
> 3) Task B starts in the same container
> 4) Task A's input calls some method on its context. Crashes with NPE
> 5) The crash sends an input failed event for Task A to the AM
> 6) Task A state machine crashes saying cannot handle failed after success
> In some cases, it could be that status update event is also sent after 
> completion, though not sure if its related to the failed event being sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event

2015-05-07 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533862#comment-14533862
 ] 

Siddharth Seth commented on TEZ-2426:
-

Tested on a large noop job - ran through without any issues.

> Task input not complete before sending Task completed event
> ---
>
> Key: TEZ-2426
> URL: https://issues.apache.org/jira/browse/TEZ-2426
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Bikas Saha
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log
>
>
> Sequence of events
> 1) Task A starts in a container
> 2) Task A complete event comes to AM
> 3) Task B starts in the same container
> 4) Task A's input calls some method on its context. Crashes with NPE
> 5) The crash sends an input failed event for Task A to the AM
> 6) Task A state machine crashes saying cannot handle failed after success
> In some cases, it could be that status update event is also sent after 
> completion, though not sure if its related to the failed event being sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event

2015-05-07 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533861#comment-14533861
 ] 

Siddharth Seth commented on TEZ-2426:
-

[~rajesh.balamohan], [~bikassaha], [~zjffdu] - please review.

> Task input not complete before sending Task completed event
> ---
>
> Key: TEZ-2426
> URL: https://issues.apache.org/jira/browse/TEZ-2426
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Bikas Saha
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log
>
>
> Sequence of events
> 1) Task A starts in a container
> 2) Task A complete event comes to AM
> 3) Task B starts in the same container
> 4) Task A's input calls some method on its context. Crashes with NPE
> 5) The crash sends an input failed event for Task A to the AM
> 6) Task A state machine crashes saying cannot handle failed after success
> In some cases, it could be that status update event is also sent after 
> completion, though not sure if its related to the failed event being sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event

2015-05-07 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533824#comment-14533824
 ] 

TezQA commented on TEZ-2426:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12731348/TEZ-2426.2.txt
  against master revision 05f77fe.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/654//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/654//console

This message is automatically generated.

> Task input not complete before sending Task completed event
> ---
>
> Key: TEZ-2426
> URL: https://issues.apache.org/jira/browse/TEZ-2426
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Bikas Saha
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log
>
>
> Sequence of events
> 1) Task A starts in a container
> 2) Task A complete event comes to AM
> 3) Task B starts in the same container
> 4) Task A's input calls some method on its context. Crashes with NPE
> 5) The crash sends an input failed event for Task A to the AM
> 6) Task A state machine crashes saying cannot handle failed after success
> In some cases, it could be that status update event is also sent after 
> completion, though not sure if its related to the failed event being sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event

2015-05-07 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533764#comment-14533764
 ] 

TezQA commented on TEZ-2426:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12731316/TEZ-2426.1.txt
  against master revision 05f77fe.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/651//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/651//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-internals.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/651//console

This message is automatically generated.

> Task input not complete before sending Task completed event
> ---
>
> Key: TEZ-2426
> URL: https://issues.apache.org/jira/browse/TEZ-2426
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Bikas Saha
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-2426.1.txt, am.log, container.log
>
>
> Sequence of events
> 1) Task A starts in a container
> 2) Task A complete event comes to AM
> 3) Task B starts in the same container
> 4) Task A's input calls some method on its context. Crashes with NPE
> 5) The crash sends an input failed event for Task A to the AM
> 6) Task A state machine crashes saying cannot handle failed after success
> In some cases, it could be that status update event is also sent after 
> completion, though not sure if its related to the failed event being sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event

2015-05-07 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533406#comment-14533406
 ] 

Siddharth Seth commented on TEZ-2426:
-

Longer term - 0.8, may be worthwhile to rework some of this, along with 
protocol changes.

> Task input not complete before sending Task completed event
> ---
>
> Key: TEZ-2426
> URL: https://issues.apache.org/jira/browse/TEZ-2426
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
>Priority: Critical
> Attachments: am.log, container.log
>
>
> Sequence of events
> 1) Task A starts in a container
> 2) Task A complete event comes to AM
> 3) Task B starts in the same container
> 4) Task A's input calls some method on its context. Crashes with NPE
> 5) The crash sends an input failed event for Task A to the AM
> 6) Task A state machine crashes saying cannot handle failed after success
> In some cases, it could be that status update event is also sent after 
> completion, though not sure if its related to the failed event being sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event

2015-05-07 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533404#comment-14533404
 ] 

Siddharth Seth commented on TEZ-2426:
-

Alright. Have a theory on what's happening. Lots of threads involved. This 
ignores the LOG lines showing up in the wrong log files (assuming the logger 
doesn't guarantee ordering when logging from different threads).

- TaskEventRouter for 456 sees an error. (This can happen because of clean up / 
some fields not being volatile in inputContext).
- TaskEventRouter is swapped out.
- TaskCompletes, sends out it's success message (heartbeat)
- TaskEventRouter thread regains control - tries sending out the TaskFailed 
message. (This is all before the next start has started. It may or may not have 
got an interrupt by this point).
- Main thread falls off. Starts running another task. This thread can heartbeat 
since it doesn't synchronize with the previous tasks heartbeats.
- The TaskEventRouter for 465 regains control. Goes into the IPC layer and 
tries sending the FAILED message (via a future). There's a context switch 
before the futute.get(). The future runs. future.get() is interrupted, because 
the thread has seen it's interrupt status by this point. Leads to the various 
errors in the logs.

This doesn't however explain a status_update after the failed message is sent. 
Don't really see what can cause that.

Couple of things which need fixing here 
1) Join on the TaskEventRouter
2) Join on the last tasks heartbeat thread
3) Fixes to *Context to revert fields back to final, or volatile
4) Avoid sending any more messages once any one final message has been sent.

> Task input not complete before sending Task completed event
> ---
>
> Key: TEZ-2426
> URL: https://issues.apache.org/jira/browse/TEZ-2426
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
>Priority: Critical
> Attachments: am.log, container.log
>
>
> Sequence of events
> 1) Task A starts in a container
> 2) Task A complete event comes to AM
> 3) Task B starts in the same container
> 4) Task A's input calls some method on its context. Crashes with NPE
> 5) The crash sends an input failed event for Task A to the AM
> 6) Task A state machine crashes saying cannot handle failed after success
> In some cases, it could be that status update event is also sent after 
> completion, though not sure if its related to the failed event being sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event

2015-05-07 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533096#comment-14533096
 ] 

Siddharth Seth commented on TEZ-2426:
-

The status update event after the task failed is also strange. Will look into 
that. The thread for the last running task may not be exiting properly.

> Task input not complete before sending Task completed event
> ---
>
> Key: TEZ-2426
> URL: https://issues.apache.org/jira/browse/TEZ-2426
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
>Priority: Critical
> Attachments: am.log, container.log
>
>
> Sequence of events
> 1) Task A starts in a container
> 2) Task A complete event comes to AM
> 3) Task B starts in the same container
> 4) Task A's input calls some method on its context. Crashes with NPE
> 5) The crash sends an input failed event for Task A to the AM
> 6) Task A state machine crashes saying cannot handle failed after success
> In some cases, it could be that status update event is also sent after 
> completion, though not sure if its related to the failed event being sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event

2015-05-07 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533094#comment-14533094
 ] 

Siddharth Seth commented on TEZ-2426:
-

[~bikassaha] - do you have additional logs - the entire AM log specifically. 
There seems to be a discrepancy in the AM / task log times as well. Assuming 
the nodes are out of sync. 

I can see how the exception happens during execution of the next task - since 
we don't join on the eventRouter thread.
However, I'm not sure how the FAILED message will go through for the previous 
attempt as a result of this. It should have gone through for the currently 
running task. If it went for the previous task - the AM should have thrown an 
error related to an invalid taskAttemptId. That leads me to believe something 
else is broken at the same time.

> Task input not complete before sending Task completed event
> ---
>
> Key: TEZ-2426
> URL: https://issues.apache.org/jira/browse/TEZ-2426
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
>Priority: Critical
> Attachments: am.log, container.log
>
>
> Sequence of events
> 1) Task A starts in a container
> 2) Task A complete event comes to AM
> 3) Task B starts in the same container
> 4) Task A's input calls some method on its context. Crashes with NPE
> 5) The crash sends an input failed event for Task A to the AM
> 6) Task A state machine crashes saying cannot handle failed after success
> In some cases, it could be that status update event is also sent after 
> completion, though not sure if its related to the failed event being sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)