[jira] [Updated] (TEZ-1424) Fixes to DAG text representation in debug mode

2014-10-17 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1424:
--
Attachment: TEZ-1424.1.patch

[~sseth] - Can you please review?

> Fixes to DAG text representation in debug mode
> --
>
> Key: TEZ-1424
> URL: https://issues.apache.org/jira/browse/TEZ-1424
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-1424.1.patch
>
>
> Several fixes required
> - Don't log entire tokens, just the identifier should be enough
> - DAG ID (or unique number needs to be used). Otherwise we get only one file 
> per session
> - This should not go into the local-directory - since that isn't accessible. 
> Instead the log directory would be a better place.
> Marking as critical for 0.5.1 since this is very useful for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed

2014-10-17 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174939#comment-14174939
 ] 

Jeff Zhang commented on TEZ-1584:
-

Committed to master 

> Restore counters from DAGFinishedEvent when DAG is completed
> 
>
> Key: TEZ-1584
> URL: https://issues.apache.org/jira/browse/TEZ-1584
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-1584.patch, Tez-1584-2.patch
>
>
> Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG 
> is completed, the recovery data may be incomplete, so we need to recover 
> counters from DAGFinishedEvent 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TEZ-1677) Add Jeff Zhang to team list

2014-10-17 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang closed TEZ-1677.
---

> Add Jeff Zhang to team list
> ---
>
> Key: TEZ-1677
> URL: https://issues.apache.org/jira/browse/TEZ-1677
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: Tez-1677.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed

2014-10-17 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang closed TEZ-1584.
---

> Restore counters from DAGFinishedEvent when DAG is completed
> 
>
> Key: TEZ-1584
> URL: https://issues.apache.org/jira/browse/TEZ-1584
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-1584.patch, Tez-1584-2.patch
>
>
> Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG 
> is completed, the recovery data may be incomplete, so we need to recover 
> counters from DAGFinishedEvent 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175129#comment-14175129
 ] 

Hitesh Shah commented on TEZ-1584:
--

[~zjffdu] Jiras should not be closed until the version they are committed to 
has been released. Also, should this also go into 0.5 ? 

> Restore counters from DAGFinishedEvent when DAG is completed
> 
>
> Key: TEZ-1584
> URL: https://issues.apache.org/jira/browse/TEZ-1584
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-1584.patch, Tez-1584-2.patch
>
>
> Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG 
> is completed, the recovery data may be incomplete, so we need to recover 
> counters from DAGFinishedEvent 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175129#comment-14175129
 ] 

Hitesh Shah edited comment on TEZ-1584 at 10/17/14 3:10 PM:


[~zjffdu] Jiras should not be closed until the version they are committed to 
has been released. Only resolve the jira as fixed and set the fix version. 
Also, shouldn't this also go into 0.5 ? 


was (Author: hitesh):
[~zjffdu] Jiras should not be closed until the version they are committed to 
has been released. Also, should this also go into 0.5 ? 

> Restore counters from DAGFinishedEvent when DAG is completed
> 
>
> Key: TEZ-1584
> URL: https://issues.apache.org/jira/browse/TEZ-1584
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-1584.patch, Tez-1584-2.patch
>
>
> Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG 
> is completed, the recovery data may be incomplete, so we need to recover 
> counters from DAGFinishedEvent 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed

2014-10-17 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175134#comment-14175134
 ] 

Jeff Zhang edited comment on TEZ-1584 at 10/17/14 3:16 PM:
---

[~hitesh] Looks like can not open it once closed, I will close it after it is 
released next time.  Do you mean go into 0.5.2 , 0.5 is already released ?


was (Author: zjffdu):
[~hitesh] Looks like can not open it once closed, I will only close it after it 
is released next time.  Do you mean go into 0.5.2 , 0.5 is already released ?

> Restore counters from DAGFinishedEvent when DAG is completed
> 
>
> Key: TEZ-1584
> URL: https://issues.apache.org/jira/browse/TEZ-1584
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-1584.patch, Tez-1584-2.patch
>
>
> Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG 
> is completed, the recovery data may be incomplete, so we need to recover 
> counters from DAGFinishedEvent 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175141#comment-14175141
 ] 

Hitesh Shah commented on TEZ-1584:
--

Yes - this jira can no longer be re-opened. Was the patch committed only to 
master and not branch-0.5?

Only it looks like there are a couple of issues that need to be fixed with the 
commit. Depending on what is the final release target of the jira, the 
CHANGES.txt should have been updated. If committed to master only, the 0.6.0 
section should be updated. If committed to master and cherry-picked to 
branch-0.5, the 0.5.2 section should be updated. 

To cherry-pick to a branch, use "git cherry-pick -x". 



> Restore counters from DAGFinishedEvent when DAG is completed
> 
>
> Key: TEZ-1584
> URL: https://issues.apache.org/jira/browse/TEZ-1584
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-1584.patch, Tez-1584-2.patch
>
>
> Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG 
> is completed, the recovery data may be incomplete, so we need to recover 
> counters from DAGFinishedEvent 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed

2014-10-17 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175134#comment-14175134
 ] 

Jeff Zhang commented on TEZ-1584:
-

[~hitesh] Looks like can not open it once closed, I will only close it after it 
is released next time.  Do you mean go into 0.5.2 , 0.5 is already released ?

> Restore counters from DAGFinishedEvent when DAG is completed
> 
>
> Key: TEZ-1584
> URL: https://issues.apache.org/jira/browse/TEZ-1584
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-1584.patch, Tez-1584-2.patch
>
>
> Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG 
> is completed, the recovery data may be incomplete, so we need to recover 
> counters from DAGFinishedEvent 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175141#comment-14175141
 ] 

Hitesh Shah edited comment on TEZ-1584 at 10/17/14 3:24 PM:


Yes - this jira can no longer be re-opened. Was the patch committed only to 
master and not branch-0.5?

It looks like there are a couple of issues that need to be fixed with the 
commit. Depending on what is the final release target of the jira, the 
CHANGES.txt should have been updated. If committed to master only, the 0.6.0 
section should be updated. If committed to master and cherry-picked to 
branch-0.5, the 0.5.2 section should be updated. 

To cherry-pick to a branch, use "git cherry-pick -x". 




was (Author: hitesh):
Yes - this jira can no longer be re-opened. Was the patch committed only to 
master and not branch-0.5?

Only it looks like there are a couple of issues that need to be fixed with the 
commit. Depending on what is the final release target of the jira, the 
CHANGES.txt should have been updated. If committed to master only, the 0.6.0 
section should be updated. If committed to master and cherry-picked to 
branch-0.5, the 0.5.2 section should be updated. 

To cherry-pick to a branch, use "git cherry-pick -x". 



> Restore counters from DAGFinishedEvent when DAG is completed
> 
>
> Key: TEZ-1584
> URL: https://issues.apache.org/jira/browse/TEZ-1584
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-1584.patch, Tez-1584-2.patch
>
>
> Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG 
> is completed, the recovery data may be incomplete, so we need to recover 
> counters from DAGFinishedEvent 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175145#comment-14175145
 ] 

Hitesh Shah commented on TEZ-1584:
--

In any case, I think this is a bug fix which should be committed to branch-0.5 
so that it can be part of the 0.5.2 release. Please cherry-pick the commit into 
the relevant branch and also update CHANGES.txt for both master and branch-0.5. 

> Restore counters from DAGFinishedEvent when DAG is completed
> 
>
> Key: TEZ-1584
> URL: https://issues.apache.org/jira/browse/TEZ-1584
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-1584.patch, Tez-1584-2.patch
>
>
> Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG 
> is completed, the recovery data may be incomplete, so we need to recover 
> counters from DAGFinishedEvent 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175146#comment-14175146
 ] 

Hitesh Shah commented on TEZ-1584:
--

bq.  I will close it after it is released next time

Regarding this, it is the release manager's responsibility to close out all 
jiras fixed in the release they are pushing out. For committers, the general 
guideline is to just the mark the jira as fixed/resolved. 

> Restore counters from DAGFinishedEvent when DAG is completed
> 
>
> Key: TEZ-1584
> URL: https://issues.apache.org/jira/browse/TEZ-1584
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-1584.patch, Tez-1584-2.patch
>
>
> Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG 
> is completed, the recovery data may be incomplete, so we need to recover 
> counters from DAGFinishedEvent 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1673) Increase the default value of allowed failures per node

2014-10-17 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175262#comment-14175262
 ] 

Siddharth Seth commented on TEZ-1673:
-

Also, the counter update interval from tasks, and the number of events per 
heartbeat.

> Increase the default value of allowed failures per node
> ---
>
> Key: TEZ-1673
> URL: https://issues.apache.org/jira/browse/TEZ-1673
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>
> The current number - 3 is something that was inherited from MapReduce.
> Since Tez is affected more by a node being marked as bad - where retries 
> could be triggered several levels up, I think a higher default value would be 
> better. I'd propose changing this to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1673) Increase the default value of allowed failures per node

2014-10-17 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1673:

Attachment: TEZ-1673.1.txt

Trivial patch to change 3 defaults. Review please.

> Increase the default value of allowed failures per node
> ---
>
> Key: TEZ-1673
> URL: https://issues.apache.org/jira/browse/TEZ-1673
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-1673.1.txt
>
>
> The current number - 3 is something that was inherited from MapReduce.
> Since Tez is affected more by a node being marked as bad - where retries 
> could be triggered several levels up, I think a higher default value would be 
> better. I'd propose changing this to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1673) Increase the default value of allowed failures per node

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175292#comment-14175292
 ] 

Hitesh Shah commented on TEZ-1673:
--

Looks good except for the no. of heartbeat events change. Are there any events 
that may be large in size that would cause a concern here? 

> Increase the default value of allowed failures per node
> ---
>
> Key: TEZ-1673
> URL: https://issues.apache.org/jira/browse/TEZ-1673
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-1673.1.txt
>
>
> The current number - 3 is something that was inherited from MapReduce.
> Since Tez is affected more by a node being marked as bad - where retries 
> could be triggered several levels up, I think a higher default value would be 
> better. I'd propose changing this to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1673) Increase the default value of allowed failures per node

2014-10-17 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175301#comment-14175301
 ] 

Siddharth Seth commented on TEZ-1673:
-

Have used this value (larger ones) on a fairly large cluster with 2500+ 
concurrent tasks without issues. The event size is typically less than 200 
bytes; that's less than 100K with a default of 500.

> Increase the default value of allowed failures per node
> ---
>
> Key: TEZ-1673
> URL: https://issues.apache.org/jira/browse/TEZ-1673
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-1673.1.txt
>
>
> The current number - 3 is something that was inherited from MapReduce.
> Since Tez is affected more by a node being marked as bad - where retries 
> could be triggered several levels up, I think a higher default value would be 
> better. I'd propose changing this to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1673) Increase the default value of allowed failures per node

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175325#comment-14175325
 ] 

Hitesh Shah commented on TEZ-1673:
--

+1

> Increase the default value of allowed failures per node
> ---
>
> Key: TEZ-1673
> URL: https://issues.apache.org/jira/browse/TEZ-1673
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-1673.1.txt
>
>
> The current number - 3 is something that was inherited from MapReduce.
> Since Tez is affected more by a node being marked as bad - where retries 
> could be triggered several levels up, I think a higher default value would be 
> better. I'd propose changing this to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1673) Increase the default value of allowed failures per node

2014-10-17 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175351#comment-14175351
 ] 

Siddharth Seth commented on TEZ-1673:
-

Thanks for the review. Committing.

> Increase the default value of allowed failures per node
> ---
>
> Key: TEZ-1673
> URL: https://issues.apache.org/jira/browse/TEZ-1673
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-1673.1.txt
>
>
> The current number - 3 is something that was inherited from MapReduce.
> Since Tez is affected more by a node being marked as bad - where retries 
> could be triggered several levels up, I think a higher default value would be 
> better. I'd propose changing this to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1673) Update the default value for allowed node failures, num events per heartbeat and counter update interval

2014-10-17 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1673:

Summary: Update the default value for allowed node failures, num events per 
heartbeat and counter update interval  (was: Increase the default value of 
allowed failures per node)

> Update the default value for allowed node failures, num events per heartbeat 
> and counter update interval
> 
>
> Key: TEZ-1673
> URL: https://issues.apache.org/jira/browse/TEZ-1673
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-1673.1.txt
>
>
> The current number - 3 is something that was inherited from MapReduce.
> Since Tez is affected more by a node being marked as bad - where retries 
> could be triggered several levels up, I think a higher default value would be 
> better. I'd propose changing this to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1643) DAGAppMaster kills DAG & shuts down, when RM is restarted

2014-10-17 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1643:
-
Attachment: TEZ-1643.5.patch

Attached patch with test added. 

[~bikassaha] review please. 

> DAGAppMaster kills DAG & shuts down, when RM is restarted
> -
>
> Key: TEZ-1643
> URL: https://issues.apache.org/jira/browse/TEZ-1643
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Hitesh Shah
> Attachments: TEZ-1643.3.patch, TEZ-1643.4.patch, TEZ-1643.5.patch, 
> TEZ-1643.wip.2.patch, TEZ-1643.wip.patch
>
>
> Scenario:
> 1. Start a long running job
> 2. Kill RM (recovery is enabled in RM. No RM-HA configured)
> 3. AMRMClientAsyncImpl$HeartbeatThread throws error (EOFException) which 
> internally causes the appmaster to kill DAG.
> 2014-10-08 02:24:06,705 INFO [IPC Server handler 6 on 55291] 
> org.apache.tez.dag.app.dag.impl.TaskImpl: 
> TaskAttempt:attempt_1412734988643_0001_1_00_00_0 sent events: (0-1)
> 2014-10-08 02:24:12,255 ERROR [AMRM Heartbeater thread] 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Exception 
> on heartbeat
> java.io.IOException: Failed on local exception: java.io.EOFException; Host 
> Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is: "
> m-tez-uns-try-3":8030;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy27.allocate(Unknown Source)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy28.allocate(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
> at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> 2014-10-08 02:24:12,256 INFO [AMRM Callback Handler Thread] 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Interrupted 
> while waiting for queue
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274)
> 2014-10-08 02:24:12,257 ERROR [AMRM Callback Handler Thread] 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Stopping 
> callback due to:
> java.io.IOException: Failed on local exception: java.io.EOFException; Host 
> Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is: 
> "m-tez-uns-try-3":8030;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy27.allocate(Unknown Source)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apach

[jira] [Commented] (TEZ-1633) TestTaskRecovery.testRecovery_OneTA - expected:<1> but was:<2>

2014-10-17 Thread Alexander Pivovarov (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175381#comment-14175381
 ] 

Alexander Pivovarov commented on TEZ-1633:
--

+1

> TestTaskRecovery.testRecovery_OneTA - expected:<1> but was:<2>
> --
>
> Key: TEZ-1633
> URL: https://issues.apache.org/jira/browse/TEZ-1633
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Alexander Pivovarov
>Assignee: Alexander Pivovarov
> Attachments: TEZ-1633.1.patch, Tez-1633-2.patch
>
>
> $ mvn clean package
> {code}
> Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.747 sec 
> <<< FAILURE!
> testRecovery_OneTAStarted(org.apache.tez.dag.app.dag.impl.TestTaskRecovery)  
> Time elapsed: 0.051 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.tez.dag.app.dag.impl.TestTaskRecovery.testRecovery_OneTAStarted(TestTaskRecovery.java:277)
> Running org.apache.tez.dag.app.dag.impl.TestVertexImpl
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1678) [Umbrella] Improve swimlanes tool usability

2014-10-17 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-1678:


 Summary: [Umbrella] Improve swimlanes tool usability
 Key: TEZ-1678
 URL: https://issues.apache.org/jira/browse/TEZ-1678
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1679) yarn-swimlanes is not OS X friendly

2014-10-17 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-1679:


 Summary: yarn-swimlanes is not OS X friendly 
 Key: TEZ-1679
 URL: https://issues.apache.org/jira/browse/TEZ-1679
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah


Use of mktemp requires a template for it to work on OS X



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1680) Better error handling in swimlanes tool

2014-10-17 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-1680:


 Summary: Better error handling in swimlanes tool 
 Key: TEZ-1680
 URL: https://issues.apache.org/jira/browse/TEZ-1680
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah


If yarn command is not found on classpath, the script silently fails. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1681) Script should be robust enough to be called from outside of its location dir

2014-10-17 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-1681:


 Summary: Script should be robust enough to be called from outside 
of its location dir 
 Key: TEZ-1681
 URL: https://issues.apache.org/jira/browse/TEZ-1681
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah


Script does not check for its actual physical location. It assumes other helper 
python scripts are in the current working dir. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1678) [Umbrella] Improve swimlanes tool usability

2014-10-17 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1678:
-
Priority: Minor  (was: Major)

> [Umbrella] Improve swimlanes tool usability
> ---
>
> Key: TEZ-1678
> URL: https://issues.apache.org/jira/browse/TEZ-1678
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1633) TestTaskRecovery.testRecovery_OneTA - expected:<1> but was:<2>

2014-10-17 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1633:
-
Attachment: TEZ-1632.2.rebased.patch

Patch looks fine. Rebased and committing shortly. 

> TestTaskRecovery.testRecovery_OneTA - expected:<1> but was:<2>
> --
>
> Key: TEZ-1633
> URL: https://issues.apache.org/jira/browse/TEZ-1633
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Alexander Pivovarov
>Assignee: Alexander Pivovarov
> Attachments: TEZ-1632.2.rebased.patch, TEZ-1633.1.patch, 
> Tez-1633-2.patch
>
>
> $ mvn clean package
> {code}
> Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.747 sec 
> <<< FAILURE!
> testRecovery_OneTAStarted(org.apache.tez.dag.app.dag.impl.TestTaskRecovery)  
> Time elapsed: 0.051 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.tez.dag.app.dag.impl.TestTaskRecovery.testRecovery_OneTAStarted(TestTaskRecovery.java:277)
> Running org.apache.tez.dag.app.dag.impl.TestVertexImpl
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1633) Fixed expected values in TestTaskRecovery.testRecovery_OneTAStarted

2014-10-17 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1633:
-
Summary: Fixed expected values in 
TestTaskRecovery.testRecovery_OneTAStarted  (was: 
TestTaskRecovery.testRecovery_OneTA - expected:<1> but was:<2>)

> Fixed expected values in TestTaskRecovery.testRecovery_OneTAStarted
> ---
>
> Key: TEZ-1633
> URL: https://issues.apache.org/jira/browse/TEZ-1633
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Alexander Pivovarov
>Assignee: Alexander Pivovarov
> Attachments: TEZ-1632.2.rebased.patch, TEZ-1633.1.patch, 
> Tez-1633-2.patch
>
>
> $ mvn clean package
> {code}
> Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.747 sec 
> <<< FAILURE!
> testRecovery_OneTAStarted(org.apache.tez.dag.app.dag.impl.TestTaskRecovery)  
> Time elapsed: 0.051 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.tez.dag.app.dag.impl.TestTaskRecovery.testRecovery_OneTAStarted(TestTaskRecovery.java:277)
> Running org.apache.tez.dag.app.dag.impl.TestVertexImpl
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1669) yarn-swimlanes.sh throws error

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175421#comment-14175421
 ] 

Hitesh Shah commented on TEZ-1669:
--

+1. Works fine after the patch with latest code. 

> yarn-swimlanes.sh throws error
> --
>
> Key: TEZ-1669
> URL: https://issues.apache.org/jira/browse/TEZ-1669
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Critical
> Attachments: TEZ-1669.1.patch
>
>
> Traceback (most recent call last):
>   File "swimlane.py", line 201, in 
> sys.exit(main(sys.argv[1:]))
>   File "swimlane.py", line 121, in main
> log = AMLog(args[0]).structure()
>   File "/yyy/tez-autobuild/tez/tez-tools/swimlanes/amlogparser.py", line 185, 
> in __init__
> self.events = filter(lambda a:a, [self.parse(l.strip()) for l in fp])
>   File "/yyy/tez-autobuild/tez/tez-tools/swimlanes/amlogparser.py", line 246, 
> in parse
> ts = m.group("ts")
> AttributeError: 'NoneType' object has no attribute 'group'
> Not sure if it has got anything to do with the recent logging changes 
> introduced in TEZ-1566 (which trims the package name to just 2 levels).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1669) yarn-swimlanes.sh throws error post TEZ-1556

2014-10-17 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1669:
-
Summary: yarn-swimlanes.sh throws error post TEZ-1556  (was: 
yarn-swimlanes.sh throws error)

> yarn-swimlanes.sh throws error post TEZ-1556
> 
>
> Key: TEZ-1669
> URL: https://issues.apache.org/jira/browse/TEZ-1669
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Critical
> Attachments: TEZ-1669.1.patch
>
>
> Traceback (most recent call last):
>   File "swimlane.py", line 201, in 
> sys.exit(main(sys.argv[1:]))
>   File "swimlane.py", line 121, in main
> log = AMLog(args[0]).structure()
>   File "/yyy/tez-autobuild/tez/tez-tools/swimlanes/amlogparser.py", line 185, 
> in __init__
> self.events = filter(lambda a:a, [self.parse(l.strip()) for l in fp])
>   File "/yyy/tez-autobuild/tez/tez-tools/swimlanes/amlogparser.py", line 246, 
> in parse
> ts = m.group("ts")
> AttributeError: 'NoneType' object has no attribute 'group'
> Not sure if it has got anything to do with the recent logging changes 
> introduced in TEZ-1566 (which trims the package name to just 2 levels).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1633) Fixed expected values in TestTaskRecovery.testRecovery_OneTAStarted

2014-10-17 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1633:
-
Attachment: TEZ-1633.addendum.patch

Fix for messed up rebase. 

> Fixed expected values in TestTaskRecovery.testRecovery_OneTAStarted
> ---
>
> Key: TEZ-1633
> URL: https://issues.apache.org/jira/browse/TEZ-1633
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Alexander Pivovarov
>Assignee: Alexander Pivovarov
> Fix For: 0.5.2
>
> Attachments: TEZ-1632.2.rebased.patch, TEZ-1633.1.patch, 
> TEZ-1633.addendum.patch, Tez-1633-2.patch
>
>
> $ mvn clean package
> {code}
> Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.747 sec 
> <<< FAILURE!
> testRecovery_OneTAStarted(org.apache.tez.dag.app.dag.impl.TestTaskRecovery)  
> Time elapsed: 0.051 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.tez.dag.app.dag.impl.TestTaskRecovery.testRecovery_OneTAStarted(TestTaskRecovery.java:277)
> Running org.apache.tez.dag.app.dag.impl.TestVertexImpl
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1525) BroadcastLoadGen testcase

2014-10-17 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated TEZ-1525:
-
Attachment: TEZ-1525.2.patch

Rebase after TEZ-1479

> BroadcastLoadGen testcase
> -
>
> Key: TEZ-1525
> URL: https://issues.apache.org/jira/browse/TEZ-1525
> Project: Apache Tez
>  Issue Type: Test
>Affects Versions: 0.6.0
>Reporter: Gopal V
>Assignee: Gopal V
> Attachments: TEZ-1525.1.patch, TEZ-1525.2.patch
>
>
> Broadcast load generator test example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1682) Tez AM hangs at times when there are task failures

2014-10-17 Thread Siddharth Seth (JIRA)
Siddharth Seth created TEZ-1682:
---

 Summary: Tez AM hangs at times when there are task failures
 Key: TEZ-1682
 URL: https://issues.apache.org/jira/browse/TEZ-1682
 Project: Apache Tez
  Issue Type: Bug
Reporter: Siddharth Seth
Assignee: Siddharth Seth
Priority: Blocker


Reported by [~karams]. 

The Task does not move into it's final state, and effectively does not send the 
relevant events to the Vertex.
Happens when there's multiple attempts for the task - caused by Node failure 
for instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1682) Tez AM hangs at times when there are task failures

2014-10-17 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1682:

Attachment: TEZ-1682.1.txt

Fairly straight forward patch. 
task.taskAttemptStatus.clear() on a KillRequest seems incorrect - since it's 
used to keep track of completed events.

Added a test to verify the Task state change.

[~hitesh], [~zjffdu] - please review - keeping in mind that multiple Finished 
events should not be generated.

> Tez AM hangs at times when there are task failures
> --
>
> Key: TEZ-1682
> URL: https://issues.apache.org/jira/browse/TEZ-1682
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Blocker
> Attachments: TEZ-1682.1.txt
>
>
> Reported by [~karams]. 
> The Task does not move into it's final state, and effectively does not send 
> the relevant events to the Vertex.
> Happens when there's multiple attempts for the task - caused by Node failure 
> for instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1682) Tez AM hangs at times when there are task failures

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175722#comment-14175722
 ] 

Hitesh Shah commented on TEZ-1682:
--

[~sseth]

is the "+  taskAttemptStatus.put(attempt.getID().getId(), true);"  needed 
in killUnfinishedAttempt()? Adding it would imply that the attempt has 
completed even though it has not. 

> Tez AM hangs at times when there are task failures
> --
>
> Key: TEZ-1682
> URL: https://issues.apache.org/jira/browse/TEZ-1682
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Blocker
> Attachments: TEZ-1682.1.txt
>
>
> Reported by [~karams]. 
> The Task does not move into it's final state, and effectively does not send 
> the relevant events to the Vertex.
> Happens when there's multiple attempts for the task - caused by Node failure 
> for instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1682) Tez AM hangs at times when there are task failures

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175725#comment-14175725
 ] 

Hitesh Shah commented on TEZ-1682:
--

I am guessing that the fix should just be to remove the clear() in the kill 
transition. 

> Tez AM hangs at times when there are task failures
> --
>
> Key: TEZ-1682
> URL: https://issues.apache.org/jira/browse/TEZ-1682
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Blocker
> Attachments: TEZ-1682.1.txt
>
>
> Reported by [~karams]. 
> The Task does not move into it's final state, and effectively does not send 
> the relevant events to the Vertex.
> Happens when there's multiple attempts for the task - caused by Node failure 
> for instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1682) Tez AM hangs at times when there are task failures

2014-10-17 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175732#comment-14175732
 ] 

Hitesh Shah commented on TEZ-1682:
--

Good catch on the invalid clear(). My mistake on not catching in the review for 
the original change. The patch looks good to commit once previous comments are 
addressed. 

> Tez AM hangs at times when there are task failures
> --
>
> Key: TEZ-1682
> URL: https://issues.apache.org/jira/browse/TEZ-1682
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Blocker
> Attachments: TEZ-1682.1.txt
>
>
> Reported by [~karams]. 
> The Task does not move into it's final state, and effectively does not send 
> the relevant events to the Vertex.
> Happens when there's multiple attempts for the task - caused by Node failure 
> for instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1682) Tez AM hangs at times when there are task failures

2014-10-17 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175756#comment-14175756
 ] 

Siddharth Seth commented on TEZ-1682:
-

bq. is the "+ taskAttemptStatus.put(attempt.getID().getId(), true);" needed in 
killUnfinishedAttempt()?
Removing it in the next patch, since it's handled when the TaskAttempt 
eventually reports back. Good catch. Uploading another patch and committing. 
Thanks for the review.
TaskAttempt itself deals with "FINISHING" vs "FINISHED" states a little 
differently, where all events are sent out when entering a FINISHING state 
instead of when reaching FINISHED. That should take care of terminating the DAG 
fast.

> Tez AM hangs at times when there are task failures
> --
>
> Key: TEZ-1682
> URL: https://issues.apache.org/jira/browse/TEZ-1682
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Blocker
> Attachments: TEZ-1682.1.txt
>
>
> Reported by [~karams]. 
> The Task does not move into it's final state, and effectively does not send 
> the relevant events to the Vertex.
> Happens when there's multiple attempts for the task - caused by Node failure 
> for instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1682) Tez AM hangs at times when there are task failures

2014-10-17 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1682:

Attachment: TEZ-1682.2.txt

Updated patch addressing review comments.

> Tez AM hangs at times when there are task failures
> --
>
> Key: TEZ-1682
> URL: https://issues.apache.org/jira/browse/TEZ-1682
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Blocker
> Attachments: TEZ-1682.1.txt, TEZ-1682.2.txt
>
>
> Reported by [~karams]. 
> The Task does not move into it's final state, and effectively does not send 
> the relevant events to the Vertex.
> Happens when there's multiple attempts for the task - caused by Node failure 
> for instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1682) Tez AM hangs at times when there are task failures

2014-10-17 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1682:

Affects Version/s: 0.5.2

> Tez AM hangs at times when there are task failures
> --
>
> Key: TEZ-1682
> URL: https://issues.apache.org/jira/browse/TEZ-1682
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Blocker
> Fix For: 0.5.2
>
> Attachments: TEZ-1682.1.txt, TEZ-1682.2.txt
>
>
> Reported by [~karams]. 
> The Task does not move into it's final state, and effectively does not send 
> the relevant events to the Vertex.
> Happens when there's multiple attempts for the task - caused by Node failure 
> for instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1683) Do ugi::getGroups only when necessary when checking ACLs

2014-10-17 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-1683:


 Summary: Do ugi::getGroups only when necessary when checking ACLs 
 Key: TEZ-1683
 URL: https://issues.apache.org/jira/browse/TEZ-1683
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Hitesh Shah






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1141) DAGStatus.Progress should include number of failed attempts

2014-10-17 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1141:
-
Attachment: TEZ-1141.1.patch

[~gopalv] [~sseth] mind doing a review? 

> DAGStatus.Progress should include number of failed attempts
> ---
>
> Key: TEZ-1141
> URL: https://issues.apache.org/jira/browse/TEZ-1141
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.5.0
>Reporter: Bikas Saha
>Assignee: Hitesh Shah
> Attachments: TEZ-1141.1.patch
>
>
> Currently its impossible to know whether a job is seeing a lot of issues and 
> failures because we only report running tasks. Eventually the job fails but 
> before that we have no indication that a bunch of task failures have been 
> happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1683) Do ugi::getGroups only when necessary when checking ACLs

2014-10-17 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1683:
-
Attachment: TEZ-1683.1.patch

[~gopalv] [[~rajesh.balamohan] Mind reviewing the patch to see if it reduces 
the perf issues with getGroup calls? 

> Do ugi::getGroups only when necessary when checking ACLs 
> -
>
> Key: TEZ-1683
> URL: https://issues.apache.org/jira/browse/TEZ-1683
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
> Attachments: TEZ-1683.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TEZ-1344) Combiner counters reported by Tez look wrong

2014-10-17 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov resolved TEZ-1344.
--
Resolution: Cannot Reproduce

> Combiner counters reported by Tez look wrong
> 
>
> Key: TEZ-1344
> URL: https://issues.apache.org/jira/browse/TEZ-1344
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Priority: Minor
>
> Combiner input/output counters reported by a Tez job seems wrong
> {code}
> org.apache.hadoop.mapreduce.TaskCounter:
> COMBINE_OUTPUT_RECORDS 35,977,263,353
> COMBINE_INPUT_RECORDS 1,000,529,333
> {code}
> As can be seen, combiner output records > input records?!
> The same counters from a MR job looks as follows-
> {code}
> Map-Reduce Framework:
> Combine output records 1,000,316,600
> Combine input records 35,977,049,632
> {code}
> Somehow input and output are swapped?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TEZ-1344) Combiner counters reported by Tez look wrong

2014-10-17 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov closed TEZ-1344.


> Combiner counters reported by Tez look wrong
> 
>
> Key: TEZ-1344
> URL: https://issues.apache.org/jira/browse/TEZ-1344
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Priority: Minor
>
> Combiner input/output counters reported by a Tez job seems wrong
> {code}
> org.apache.hadoop.mapreduce.TaskCounter:
> COMBINE_OUTPUT_RECORDS 35,977,263,353
> COMBINE_INPUT_RECORDS 1,000,529,333
> {code}
> As can be seen, combiner output records > input records?!
> The same counters from a MR job looks as follows-
> {code}
> Map-Reduce Framework:
> Combine output records 1,000,316,600
> Combine input records 35,977,049,632
> {code}
> Somehow input and output are swapped?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1683) Do ugi::getGroups only when necessary when checking ACLs

2014-10-17 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175865#comment-14175865
 ] 

Gopal V commented on TEZ-1683:
--

+1 - confirmed that this does not trigger the shell fork for getGroups.

> Do ugi::getGroups only when necessary when checking ACLs 
> -
>
> Key: TEZ-1683
> URL: https://issues.apache.org/jira/browse/TEZ-1683
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
> Attachments: TEZ-1683.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1141) DAGStatus.Progress should include number of failed attempts

2014-10-17 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175874#comment-14175874
 ] 

Gopal V commented on TEZ-1141:
--

LGTM, I found that this doesn't track NM blacklisting, but that is a completely 
different problem.

I've updated patch on HIVE-7838 to use this and it is useful, to narrow down 
query failures (particularly reducer OOMs happening).

> DAGStatus.Progress should include number of failed attempts
> ---
>
> Key: TEZ-1141
> URL: https://issues.apache.org/jira/browse/TEZ-1141
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.5.0
>Reporter: Bikas Saha
>Assignee: Hitesh Shah
> Attachments: TEZ-1141.1.patch
>
>
> Currently its impossible to know whether a job is seeing a lot of issues and 
> failures because we only report running tasks. Eventually the job fails but 
> before that we have no indication that a bunch of task failures have been 
> happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)