[jira] [Updated] (TEZ-1424) Fixes to DAG text representation in debug mode
[ https://issues.apache.org/jira/browse/TEZ-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1424: -- Attachment: TEZ-1424.1.patch [~sseth] - Can you please review? > Fixes to DAG text representation in debug mode > -- > > Key: TEZ-1424 > URL: https://issues.apache.org/jira/browse/TEZ-1424 > Project: Apache Tez > Issue Type: Bug >Reporter: Siddharth Seth >Priority: Critical > Attachments: TEZ-1424.1.patch > > > Several fixes required > - Don't log entire tokens, just the identifier should be enough > - DAG ID (or unique number needs to be used). Otherwise we get only one file > per session > - This should not go into the local-directory - since that isn't accessible. > Instead the log directory would be a better place. > Marking as critical for 0.5.1 since this is very useful for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed
[ https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174939#comment-14174939 ] Jeff Zhang commented on TEZ-1584: - Committed to master > Restore counters from DAGFinishedEvent when DAG is completed > > > Key: TEZ-1584 > URL: https://issues.apache.org/jira/browse/TEZ-1584 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-1584.patch, Tez-1584-2.patch > > > Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG > is completed, the recovery data may be incomplete, so we need to recover > counters from DAGFinishedEvent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TEZ-1677) Add Jeff Zhang to team list
[ https://issues.apache.org/jira/browse/TEZ-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang closed TEZ-1677. --- > Add Jeff Zhang to team list > --- > > Key: TEZ-1677 > URL: https://issues.apache.org/jira/browse/TEZ-1677 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: Tez-1677.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed
[ https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang closed TEZ-1584. --- > Restore counters from DAGFinishedEvent when DAG is completed > > > Key: TEZ-1584 > URL: https://issues.apache.org/jira/browse/TEZ-1584 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-1584.patch, Tez-1584-2.patch > > > Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG > is completed, the recovery data may be incomplete, so we need to recover > counters from DAGFinishedEvent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed
[ https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175129#comment-14175129 ] Hitesh Shah commented on TEZ-1584: -- [~zjffdu] Jiras should not be closed until the version they are committed to has been released. Also, should this also go into 0.5 ? > Restore counters from DAGFinishedEvent when DAG is completed > > > Key: TEZ-1584 > URL: https://issues.apache.org/jira/browse/TEZ-1584 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-1584.patch, Tez-1584-2.patch > > > Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG > is completed, the recovery data may be incomplete, so we need to recover > counters from DAGFinishedEvent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed
[ https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175129#comment-14175129 ] Hitesh Shah edited comment on TEZ-1584 at 10/17/14 3:10 PM: [~zjffdu] Jiras should not be closed until the version they are committed to has been released. Only resolve the jira as fixed and set the fix version. Also, shouldn't this also go into 0.5 ? was (Author: hitesh): [~zjffdu] Jiras should not be closed until the version they are committed to has been released. Also, should this also go into 0.5 ? > Restore counters from DAGFinishedEvent when DAG is completed > > > Key: TEZ-1584 > URL: https://issues.apache.org/jira/browse/TEZ-1584 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-1584.patch, Tez-1584-2.patch > > > Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG > is completed, the recovery data may be incomplete, so we need to recover > counters from DAGFinishedEvent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed
[ https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175134#comment-14175134 ] Jeff Zhang edited comment on TEZ-1584 at 10/17/14 3:16 PM: --- [~hitesh] Looks like can not open it once closed, I will close it after it is released next time. Do you mean go into 0.5.2 , 0.5 is already released ? was (Author: zjffdu): [~hitesh] Looks like can not open it once closed, I will only close it after it is released next time. Do you mean go into 0.5.2 , 0.5 is already released ? > Restore counters from DAGFinishedEvent when DAG is completed > > > Key: TEZ-1584 > URL: https://issues.apache.org/jira/browse/TEZ-1584 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-1584.patch, Tez-1584-2.patch > > > Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG > is completed, the recovery data may be incomplete, so we need to recover > counters from DAGFinishedEvent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed
[ https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175141#comment-14175141 ] Hitesh Shah commented on TEZ-1584: -- Yes - this jira can no longer be re-opened. Was the patch committed only to master and not branch-0.5? Only it looks like there are a couple of issues that need to be fixed with the commit. Depending on what is the final release target of the jira, the CHANGES.txt should have been updated. If committed to master only, the 0.6.0 section should be updated. If committed to master and cherry-picked to branch-0.5, the 0.5.2 section should be updated. To cherry-pick to a branch, use "git cherry-pick -x". > Restore counters from DAGFinishedEvent when DAG is completed > > > Key: TEZ-1584 > URL: https://issues.apache.org/jira/browse/TEZ-1584 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-1584.patch, Tez-1584-2.patch > > > Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG > is completed, the recovery data may be incomplete, so we need to recover > counters from DAGFinishedEvent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed
[ https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175134#comment-14175134 ] Jeff Zhang commented on TEZ-1584: - [~hitesh] Looks like can not open it once closed, I will only close it after it is released next time. Do you mean go into 0.5.2 , 0.5 is already released ? > Restore counters from DAGFinishedEvent when DAG is completed > > > Key: TEZ-1584 > URL: https://issues.apache.org/jira/browse/TEZ-1584 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-1584.patch, Tez-1584-2.patch > > > Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG > is completed, the recovery data may be incomplete, so we need to recover > counters from DAGFinishedEvent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed
[ https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175141#comment-14175141 ] Hitesh Shah edited comment on TEZ-1584 at 10/17/14 3:24 PM: Yes - this jira can no longer be re-opened. Was the patch committed only to master and not branch-0.5? It looks like there are a couple of issues that need to be fixed with the commit. Depending on what is the final release target of the jira, the CHANGES.txt should have been updated. If committed to master only, the 0.6.0 section should be updated. If committed to master and cherry-picked to branch-0.5, the 0.5.2 section should be updated. To cherry-pick to a branch, use "git cherry-pick -x". was (Author: hitesh): Yes - this jira can no longer be re-opened. Was the patch committed only to master and not branch-0.5? Only it looks like there are a couple of issues that need to be fixed with the commit. Depending on what is the final release target of the jira, the CHANGES.txt should have been updated. If committed to master only, the 0.6.0 section should be updated. If committed to master and cherry-picked to branch-0.5, the 0.5.2 section should be updated. To cherry-pick to a branch, use "git cherry-pick -x". > Restore counters from DAGFinishedEvent when DAG is completed > > > Key: TEZ-1584 > URL: https://issues.apache.org/jira/browse/TEZ-1584 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-1584.patch, Tez-1584-2.patch > > > Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG > is completed, the recovery data may be incomplete, so we need to recover > counters from DAGFinishedEvent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed
[ https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175145#comment-14175145 ] Hitesh Shah commented on TEZ-1584: -- In any case, I think this is a bug fix which should be committed to branch-0.5 so that it can be part of the 0.5.2 release. Please cherry-pick the commit into the relevant branch and also update CHANGES.txt for both master and branch-0.5. > Restore counters from DAGFinishedEvent when DAG is completed > > > Key: TEZ-1584 > URL: https://issues.apache.org/jira/browse/TEZ-1584 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-1584.patch, Tez-1584-2.patch > > > Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG > is completed, the recovery data may be incomplete, so we need to recover > counters from DAGFinishedEvent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1584) Restore counters from DAGFinishedEvent when DAG is completed
[ https://issues.apache.org/jira/browse/TEZ-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175146#comment-14175146 ] Hitesh Shah commented on TEZ-1584: -- bq. I will close it after it is released next time Regarding this, it is the release manager's responsibility to close out all jiras fixed in the release they are pushing out. For committers, the general guideline is to just the mark the jira as fixed/resolved. > Restore counters from DAGFinishedEvent when DAG is completed > > > Key: TEZ-1584 > URL: https://issues.apache.org/jira/browse/TEZ-1584 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-1584.patch, Tez-1584-2.patch > > > Follow up [TEZ-853|https://issues.apache.org/jira/browse/TEZ-853], when DAG > is completed, the recovery data may be incomplete, so we need to recover > counters from DAGFinishedEvent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1673) Increase the default value of allowed failures per node
[ https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175262#comment-14175262 ] Siddharth Seth commented on TEZ-1673: - Also, the counter update interval from tasks, and the number of events per heartbeat. > Increase the default value of allowed failures per node > --- > > Key: TEZ-1673 > URL: https://issues.apache.org/jira/browse/TEZ-1673 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > > The current number - 3 is something that was inherited from MapReduce. > Since Tez is affected more by a node being marked as bad - where retries > could be triggered several levels up, I think a higher default value would be > better. I'd propose changing this to 10. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1673) Increase the default value of allowed failures per node
[ https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1673: Attachment: TEZ-1673.1.txt Trivial patch to change 3 defaults. Review please. > Increase the default value of allowed failures per node > --- > > Key: TEZ-1673 > URL: https://issues.apache.org/jira/browse/TEZ-1673 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-1673.1.txt > > > The current number - 3 is something that was inherited from MapReduce. > Since Tez is affected more by a node being marked as bad - where retries > could be triggered several levels up, I think a higher default value would be > better. I'd propose changing this to 10. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1673) Increase the default value of allowed failures per node
[ https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175292#comment-14175292 ] Hitesh Shah commented on TEZ-1673: -- Looks good except for the no. of heartbeat events change. Are there any events that may be large in size that would cause a concern here? > Increase the default value of allowed failures per node > --- > > Key: TEZ-1673 > URL: https://issues.apache.org/jira/browse/TEZ-1673 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-1673.1.txt > > > The current number - 3 is something that was inherited from MapReduce. > Since Tez is affected more by a node being marked as bad - where retries > could be triggered several levels up, I think a higher default value would be > better. I'd propose changing this to 10. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1673) Increase the default value of allowed failures per node
[ https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175301#comment-14175301 ] Siddharth Seth commented on TEZ-1673: - Have used this value (larger ones) on a fairly large cluster with 2500+ concurrent tasks without issues. The event size is typically less than 200 bytes; that's less than 100K with a default of 500. > Increase the default value of allowed failures per node > --- > > Key: TEZ-1673 > URL: https://issues.apache.org/jira/browse/TEZ-1673 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-1673.1.txt > > > The current number - 3 is something that was inherited from MapReduce. > Since Tez is affected more by a node being marked as bad - where retries > could be triggered several levels up, I think a higher default value would be > better. I'd propose changing this to 10. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1673) Increase the default value of allowed failures per node
[ https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175325#comment-14175325 ] Hitesh Shah commented on TEZ-1673: -- +1 > Increase the default value of allowed failures per node > --- > > Key: TEZ-1673 > URL: https://issues.apache.org/jira/browse/TEZ-1673 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-1673.1.txt > > > The current number - 3 is something that was inherited from MapReduce. > Since Tez is affected more by a node being marked as bad - where retries > could be triggered several levels up, I think a higher default value would be > better. I'd propose changing this to 10. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1673) Increase the default value of allowed failures per node
[ https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175351#comment-14175351 ] Siddharth Seth commented on TEZ-1673: - Thanks for the review. Committing. > Increase the default value of allowed failures per node > --- > > Key: TEZ-1673 > URL: https://issues.apache.org/jira/browse/TEZ-1673 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-1673.1.txt > > > The current number - 3 is something that was inherited from MapReduce. > Since Tez is affected more by a node being marked as bad - where retries > could be triggered several levels up, I think a higher default value would be > better. I'd propose changing this to 10. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1673) Update the default value for allowed node failures, num events per heartbeat and counter update interval
[ https://issues.apache.org/jira/browse/TEZ-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1673: Summary: Update the default value for allowed node failures, num events per heartbeat and counter update interval (was: Increase the default value of allowed failures per node) > Update the default value for allowed node failures, num events per heartbeat > and counter update interval > > > Key: TEZ-1673 > URL: https://issues.apache.org/jira/browse/TEZ-1673 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-1673.1.txt > > > The current number - 3 is something that was inherited from MapReduce. > Since Tez is affected more by a node being marked as bad - where retries > could be triggered several levels up, I think a higher default value would be > better. I'd propose changing this to 10. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1643) DAGAppMaster kills DAG & shuts down, when RM is restarted
[ https://issues.apache.org/jira/browse/TEZ-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1643: - Attachment: TEZ-1643.5.patch Attached patch with test added. [~bikassaha] review please. > DAGAppMaster kills DAG & shuts down, when RM is restarted > - > > Key: TEZ-1643 > URL: https://issues.apache.org/jira/browse/TEZ-1643 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Hitesh Shah > Attachments: TEZ-1643.3.patch, TEZ-1643.4.patch, TEZ-1643.5.patch, > TEZ-1643.wip.2.patch, TEZ-1643.wip.patch > > > Scenario: > 1. Start a long running job > 2. Kill RM (recovery is enabled in RM. No RM-HA configured) > 3. AMRMClientAsyncImpl$HeartbeatThread throws error (EOFException) which > internally causes the appmaster to kill DAG. > 2014-10-08 02:24:06,705 INFO [IPC Server handler 6 on 55291] > org.apache.tez.dag.app.dag.impl.TaskImpl: > TaskAttempt:attempt_1412734988643_0001_1_00_00_0 sent events: (0-1) > 2014-10-08 02:24:12,255 ERROR [AMRM Heartbeater thread] > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Exception > on heartbeat > java.io.IOException: Failed on local exception: java.io.EOFException; Host > Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is: " > m-tez-uns-try-3":8030; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy27.allocate(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy28.allocate(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278) > at > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966) > 2014-10-08 02:24:12,256 INFO [AMRM Callback Handler Thread] > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Interrupted > while waiting for queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274) > 2014-10-08 02:24:12,257 ERROR [AMRM Callback Handler Thread] > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Stopping > callback due to: > java.io.IOException: Failed on local exception: java.io.EOFException; Host > Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is: > "m-tez-uns-try-3":8030; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy27.allocate(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apach
[jira] [Commented] (TEZ-1633) TestTaskRecovery.testRecovery_OneTA - expected:<1> but was:<2>
[ https://issues.apache.org/jira/browse/TEZ-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175381#comment-14175381 ] Alexander Pivovarov commented on TEZ-1633: -- +1 > TestTaskRecovery.testRecovery_OneTA - expected:<1> but was:<2> > -- > > Key: TEZ-1633 > URL: https://issues.apache.org/jira/browse/TEZ-1633 > Project: Apache Tez > Issue Type: Bug >Reporter: Alexander Pivovarov >Assignee: Alexander Pivovarov > Attachments: TEZ-1633.1.patch, Tez-1633-2.patch > > > $ mvn clean package > {code} > Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.747 sec > <<< FAILURE! > testRecovery_OneTAStarted(org.apache.tez.dag.app.dag.impl.TestTaskRecovery) > Time elapsed: 0.051 sec <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.tez.dag.app.dag.impl.TestTaskRecovery.testRecovery_OneTAStarted(TestTaskRecovery.java:277) > Running org.apache.tez.dag.app.dag.impl.TestVertexImpl > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1678) [Umbrella] Improve swimlanes tool usability
Hitesh Shah created TEZ-1678: Summary: [Umbrella] Improve swimlanes tool usability Key: TEZ-1678 URL: https://issues.apache.org/jira/browse/TEZ-1678 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1679) yarn-swimlanes is not OS X friendly
Hitesh Shah created TEZ-1679: Summary: yarn-swimlanes is not OS X friendly Key: TEZ-1679 URL: https://issues.apache.org/jira/browse/TEZ-1679 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Use of mktemp requires a template for it to work on OS X -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1680) Better error handling in swimlanes tool
Hitesh Shah created TEZ-1680: Summary: Better error handling in swimlanes tool Key: TEZ-1680 URL: https://issues.apache.org/jira/browse/TEZ-1680 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah If yarn command is not found on classpath, the script silently fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1681) Script should be robust enough to be called from outside of its location dir
Hitesh Shah created TEZ-1681: Summary: Script should be robust enough to be called from outside of its location dir Key: TEZ-1681 URL: https://issues.apache.org/jira/browse/TEZ-1681 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Script does not check for its actual physical location. It assumes other helper python scripts are in the current working dir. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1678) [Umbrella] Improve swimlanes tool usability
[ https://issues.apache.org/jira/browse/TEZ-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1678: - Priority: Minor (was: Major) > [Umbrella] Improve swimlanes tool usability > --- > > Key: TEZ-1678 > URL: https://issues.apache.org/jira/browse/TEZ-1678 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1633) TestTaskRecovery.testRecovery_OneTA - expected:<1> but was:<2>
[ https://issues.apache.org/jira/browse/TEZ-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1633: - Attachment: TEZ-1632.2.rebased.patch Patch looks fine. Rebased and committing shortly. > TestTaskRecovery.testRecovery_OneTA - expected:<1> but was:<2> > -- > > Key: TEZ-1633 > URL: https://issues.apache.org/jira/browse/TEZ-1633 > Project: Apache Tez > Issue Type: Bug >Reporter: Alexander Pivovarov >Assignee: Alexander Pivovarov > Attachments: TEZ-1632.2.rebased.patch, TEZ-1633.1.patch, > Tez-1633-2.patch > > > $ mvn clean package > {code} > Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.747 sec > <<< FAILURE! > testRecovery_OneTAStarted(org.apache.tez.dag.app.dag.impl.TestTaskRecovery) > Time elapsed: 0.051 sec <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.tez.dag.app.dag.impl.TestTaskRecovery.testRecovery_OneTAStarted(TestTaskRecovery.java:277) > Running org.apache.tez.dag.app.dag.impl.TestVertexImpl > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1633) Fixed expected values in TestTaskRecovery.testRecovery_OneTAStarted
[ https://issues.apache.org/jira/browse/TEZ-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1633: - Summary: Fixed expected values in TestTaskRecovery.testRecovery_OneTAStarted (was: TestTaskRecovery.testRecovery_OneTA - expected:<1> but was:<2>) > Fixed expected values in TestTaskRecovery.testRecovery_OneTAStarted > --- > > Key: TEZ-1633 > URL: https://issues.apache.org/jira/browse/TEZ-1633 > Project: Apache Tez > Issue Type: Bug >Reporter: Alexander Pivovarov >Assignee: Alexander Pivovarov > Attachments: TEZ-1632.2.rebased.patch, TEZ-1633.1.patch, > Tez-1633-2.patch > > > $ mvn clean package > {code} > Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.747 sec > <<< FAILURE! > testRecovery_OneTAStarted(org.apache.tez.dag.app.dag.impl.TestTaskRecovery) > Time elapsed: 0.051 sec <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.tez.dag.app.dag.impl.TestTaskRecovery.testRecovery_OneTAStarted(TestTaskRecovery.java:277) > Running org.apache.tez.dag.app.dag.impl.TestVertexImpl > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1669) yarn-swimlanes.sh throws error
[ https://issues.apache.org/jira/browse/TEZ-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175421#comment-14175421 ] Hitesh Shah commented on TEZ-1669: -- +1. Works fine after the patch with latest code. > yarn-swimlanes.sh throws error > -- > > Key: TEZ-1669 > URL: https://issues.apache.org/jira/browse/TEZ-1669 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan >Priority: Critical > Attachments: TEZ-1669.1.patch > > > Traceback (most recent call last): > File "swimlane.py", line 201, in > sys.exit(main(sys.argv[1:])) > File "swimlane.py", line 121, in main > log = AMLog(args[0]).structure() > File "/yyy/tez-autobuild/tez/tez-tools/swimlanes/amlogparser.py", line 185, > in __init__ > self.events = filter(lambda a:a, [self.parse(l.strip()) for l in fp]) > File "/yyy/tez-autobuild/tez/tez-tools/swimlanes/amlogparser.py", line 246, > in parse > ts = m.group("ts") > AttributeError: 'NoneType' object has no attribute 'group' > Not sure if it has got anything to do with the recent logging changes > introduced in TEZ-1566 (which trims the package name to just 2 levels). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1669) yarn-swimlanes.sh throws error post TEZ-1556
[ https://issues.apache.org/jira/browse/TEZ-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1669: - Summary: yarn-swimlanes.sh throws error post TEZ-1556 (was: yarn-swimlanes.sh throws error) > yarn-swimlanes.sh throws error post TEZ-1556 > > > Key: TEZ-1669 > URL: https://issues.apache.org/jira/browse/TEZ-1669 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan >Priority: Critical > Attachments: TEZ-1669.1.patch > > > Traceback (most recent call last): > File "swimlane.py", line 201, in > sys.exit(main(sys.argv[1:])) > File "swimlane.py", line 121, in main > log = AMLog(args[0]).structure() > File "/yyy/tez-autobuild/tez/tez-tools/swimlanes/amlogparser.py", line 185, > in __init__ > self.events = filter(lambda a:a, [self.parse(l.strip()) for l in fp]) > File "/yyy/tez-autobuild/tez/tez-tools/swimlanes/amlogparser.py", line 246, > in parse > ts = m.group("ts") > AttributeError: 'NoneType' object has no attribute 'group' > Not sure if it has got anything to do with the recent logging changes > introduced in TEZ-1566 (which trims the package name to just 2 levels). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1633) Fixed expected values in TestTaskRecovery.testRecovery_OneTAStarted
[ https://issues.apache.org/jira/browse/TEZ-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1633: - Attachment: TEZ-1633.addendum.patch Fix for messed up rebase. > Fixed expected values in TestTaskRecovery.testRecovery_OneTAStarted > --- > > Key: TEZ-1633 > URL: https://issues.apache.org/jira/browse/TEZ-1633 > Project: Apache Tez > Issue Type: Bug >Reporter: Alexander Pivovarov >Assignee: Alexander Pivovarov > Fix For: 0.5.2 > > Attachments: TEZ-1632.2.rebased.patch, TEZ-1633.1.patch, > TEZ-1633.addendum.patch, Tez-1633-2.patch > > > $ mvn clean package > {code} > Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.747 sec > <<< FAILURE! > testRecovery_OneTAStarted(org.apache.tez.dag.app.dag.impl.TestTaskRecovery) > Time elapsed: 0.051 sec <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.tez.dag.app.dag.impl.TestTaskRecovery.testRecovery_OneTAStarted(TestTaskRecovery.java:277) > Running org.apache.tez.dag.app.dag.impl.TestVertexImpl > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1525) BroadcastLoadGen testcase
[ https://issues.apache.org/jira/browse/TEZ-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1525: - Attachment: TEZ-1525.2.patch Rebase after TEZ-1479 > BroadcastLoadGen testcase > - > > Key: TEZ-1525 > URL: https://issues.apache.org/jira/browse/TEZ-1525 > Project: Apache Tez > Issue Type: Test >Affects Versions: 0.6.0 >Reporter: Gopal V >Assignee: Gopal V > Attachments: TEZ-1525.1.patch, TEZ-1525.2.patch > > > Broadcast load generator test example -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1682) Tez AM hangs at times when there are task failures
Siddharth Seth created TEZ-1682: --- Summary: Tez AM hangs at times when there are task failures Key: TEZ-1682 URL: https://issues.apache.org/jira/browse/TEZ-1682 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Blocker Reported by [~karams]. The Task does not move into it's final state, and effectively does not send the relevant events to the Vertex. Happens when there's multiple attempts for the task - caused by Node failure for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1682) Tez AM hangs at times when there are task failures
[ https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1682: Attachment: TEZ-1682.1.txt Fairly straight forward patch. task.taskAttemptStatus.clear() on a KillRequest seems incorrect - since it's used to keep track of completed events. Added a test to verify the Task state change. [~hitesh], [~zjffdu] - please review - keeping in mind that multiple Finished events should not be generated. > Tez AM hangs at times when there are task failures > -- > > Key: TEZ-1682 > URL: https://issues.apache.org/jira/browse/TEZ-1682 > Project: Apache Tez > Issue Type: Bug >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Blocker > Attachments: TEZ-1682.1.txt > > > Reported by [~karams]. > The Task does not move into it's final state, and effectively does not send > the relevant events to the Vertex. > Happens when there's multiple attempts for the task - caused by Node failure > for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1682) Tez AM hangs at times when there are task failures
[ https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175722#comment-14175722 ] Hitesh Shah commented on TEZ-1682: -- [~sseth] is the "+ taskAttemptStatus.put(attempt.getID().getId(), true);" needed in killUnfinishedAttempt()? Adding it would imply that the attempt has completed even though it has not. > Tez AM hangs at times when there are task failures > -- > > Key: TEZ-1682 > URL: https://issues.apache.org/jira/browse/TEZ-1682 > Project: Apache Tez > Issue Type: Bug >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Blocker > Attachments: TEZ-1682.1.txt > > > Reported by [~karams]. > The Task does not move into it's final state, and effectively does not send > the relevant events to the Vertex. > Happens when there's multiple attempts for the task - caused by Node failure > for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1682) Tez AM hangs at times when there are task failures
[ https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175725#comment-14175725 ] Hitesh Shah commented on TEZ-1682: -- I am guessing that the fix should just be to remove the clear() in the kill transition. > Tez AM hangs at times when there are task failures > -- > > Key: TEZ-1682 > URL: https://issues.apache.org/jira/browse/TEZ-1682 > Project: Apache Tez > Issue Type: Bug >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Blocker > Attachments: TEZ-1682.1.txt > > > Reported by [~karams]. > The Task does not move into it's final state, and effectively does not send > the relevant events to the Vertex. > Happens when there's multiple attempts for the task - caused by Node failure > for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1682) Tez AM hangs at times when there are task failures
[ https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175732#comment-14175732 ] Hitesh Shah commented on TEZ-1682: -- Good catch on the invalid clear(). My mistake on not catching in the review for the original change. The patch looks good to commit once previous comments are addressed. > Tez AM hangs at times when there are task failures > -- > > Key: TEZ-1682 > URL: https://issues.apache.org/jira/browse/TEZ-1682 > Project: Apache Tez > Issue Type: Bug >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Blocker > Attachments: TEZ-1682.1.txt > > > Reported by [~karams]. > The Task does not move into it's final state, and effectively does not send > the relevant events to the Vertex. > Happens when there's multiple attempts for the task - caused by Node failure > for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1682) Tez AM hangs at times when there are task failures
[ https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175756#comment-14175756 ] Siddharth Seth commented on TEZ-1682: - bq. is the "+ taskAttemptStatus.put(attempt.getID().getId(), true);" needed in killUnfinishedAttempt()? Removing it in the next patch, since it's handled when the TaskAttempt eventually reports back. Good catch. Uploading another patch and committing. Thanks for the review. TaskAttempt itself deals with "FINISHING" vs "FINISHED" states a little differently, where all events are sent out when entering a FINISHING state instead of when reaching FINISHED. That should take care of terminating the DAG fast. > Tez AM hangs at times when there are task failures > -- > > Key: TEZ-1682 > URL: https://issues.apache.org/jira/browse/TEZ-1682 > Project: Apache Tez > Issue Type: Bug >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Blocker > Attachments: TEZ-1682.1.txt > > > Reported by [~karams]. > The Task does not move into it's final state, and effectively does not send > the relevant events to the Vertex. > Happens when there's multiple attempts for the task - caused by Node failure > for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1682) Tez AM hangs at times when there are task failures
[ https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1682: Attachment: TEZ-1682.2.txt Updated patch addressing review comments. > Tez AM hangs at times when there are task failures > -- > > Key: TEZ-1682 > URL: https://issues.apache.org/jira/browse/TEZ-1682 > Project: Apache Tez > Issue Type: Bug >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Blocker > Attachments: TEZ-1682.1.txt, TEZ-1682.2.txt > > > Reported by [~karams]. > The Task does not move into it's final state, and effectively does not send > the relevant events to the Vertex. > Happens when there's multiple attempts for the task - caused by Node failure > for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1682) Tez AM hangs at times when there are task failures
[ https://issues.apache.org/jira/browse/TEZ-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1682: Affects Version/s: 0.5.2 > Tez AM hangs at times when there are task failures > -- > > Key: TEZ-1682 > URL: https://issues.apache.org/jira/browse/TEZ-1682 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Blocker > Fix For: 0.5.2 > > Attachments: TEZ-1682.1.txt, TEZ-1682.2.txt > > > Reported by [~karams]. > The Task does not move into it's final state, and effectively does not send > the relevant events to the Vertex. > Happens when there's multiple attempts for the task - caused by Node failure > for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1683) Do ugi::getGroups only when necessary when checking ACLs
Hitesh Shah created TEZ-1683: Summary: Do ugi::getGroups only when necessary when checking ACLs Key: TEZ-1683 URL: https://issues.apache.org/jira/browse/TEZ-1683 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1141) DAGStatus.Progress should include number of failed attempts
[ https://issues.apache.org/jira/browse/TEZ-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1141: - Attachment: TEZ-1141.1.patch [~gopalv] [~sseth] mind doing a review? > DAGStatus.Progress should include number of failed attempts > --- > > Key: TEZ-1141 > URL: https://issues.apache.org/jira/browse/TEZ-1141 > Project: Apache Tez > Issue Type: Improvement >Affects Versions: 0.5.0 >Reporter: Bikas Saha >Assignee: Hitesh Shah > Attachments: TEZ-1141.1.patch > > > Currently its impossible to know whether a job is seeing a lot of issues and > failures because we only report running tasks. Eventually the job fails but > before that we have no indication that a bunch of task failures have been > happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1683) Do ugi::getGroups only when necessary when checking ACLs
[ https://issues.apache.org/jira/browse/TEZ-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1683: - Attachment: TEZ-1683.1.patch [~gopalv] [[~rajesh.balamohan] Mind reviewing the patch to see if it reduces the perf issues with getGroup calls? > Do ugi::getGroups only when necessary when checking ACLs > - > > Key: TEZ-1683 > URL: https://issues.apache.org/jira/browse/TEZ-1683 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Attachments: TEZ-1683.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-1344) Combiner counters reported by Tez look wrong
[ https://issues.apache.org/jira/browse/TEZ-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Pivovarov resolved TEZ-1344. -- Resolution: Cannot Reproduce > Combiner counters reported by Tez look wrong > > > Key: TEZ-1344 > URL: https://issues.apache.org/jira/browse/TEZ-1344 > Project: Apache Tez > Issue Type: Bug >Reporter: Cheolsoo Park >Priority: Minor > > Combiner input/output counters reported by a Tez job seems wrong > {code} > org.apache.hadoop.mapreduce.TaskCounter: > COMBINE_OUTPUT_RECORDS 35,977,263,353 > COMBINE_INPUT_RECORDS 1,000,529,333 > {code} > As can be seen, combiner output records > input records?! > The same counters from a MR job looks as follows- > {code} > Map-Reduce Framework: > Combine output records 1,000,316,600 > Combine input records 35,977,049,632 > {code} > Somehow input and output are swapped? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TEZ-1344) Combiner counters reported by Tez look wrong
[ https://issues.apache.org/jira/browse/TEZ-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Pivovarov closed TEZ-1344. > Combiner counters reported by Tez look wrong > > > Key: TEZ-1344 > URL: https://issues.apache.org/jira/browse/TEZ-1344 > Project: Apache Tez > Issue Type: Bug >Reporter: Cheolsoo Park >Priority: Minor > > Combiner input/output counters reported by a Tez job seems wrong > {code} > org.apache.hadoop.mapreduce.TaskCounter: > COMBINE_OUTPUT_RECORDS 35,977,263,353 > COMBINE_INPUT_RECORDS 1,000,529,333 > {code} > As can be seen, combiner output records > input records?! > The same counters from a MR job looks as follows- > {code} > Map-Reduce Framework: > Combine output records 1,000,316,600 > Combine input records 35,977,049,632 > {code} > Somehow input and output are swapped? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1683) Do ugi::getGroups only when necessary when checking ACLs
[ https://issues.apache.org/jira/browse/TEZ-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175865#comment-14175865 ] Gopal V commented on TEZ-1683: -- +1 - confirmed that this does not trigger the shell fork for getGroups. > Do ugi::getGroups only when necessary when checking ACLs > - > > Key: TEZ-1683 > URL: https://issues.apache.org/jira/browse/TEZ-1683 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Attachments: TEZ-1683.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1141) DAGStatus.Progress should include number of failed attempts
[ https://issues.apache.org/jira/browse/TEZ-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175874#comment-14175874 ] Gopal V commented on TEZ-1141: -- LGTM, I found that this doesn't track NM blacklisting, but that is a completely different problem. I've updated patch on HIVE-7838 to use this and it is useful, to narrow down query failures (particularly reducer OOMs happening). > DAGStatus.Progress should include number of failed attempts > --- > > Key: TEZ-1141 > URL: https://issues.apache.org/jira/browse/TEZ-1141 > Project: Apache Tez > Issue Type: Improvement >Affects Versions: 0.5.0 >Reporter: Bikas Saha >Assignee: Hitesh Shah > Attachments: TEZ-1141.1.patch > > > Currently its impossible to know whether a job is seeing a lot of issues and > failures because we only report running tasks. Eventually the job fails but > before that we have no indication that a bunch of task failures have been > happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)