[jira] [Commented] (TEZ-1703) Exception handling for InputInitializer
[ https://issues.apache.org/jira/browse/TEZ-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188050#comment-14188050 ] Siddharth Seth commented on TEZ-1703: - [~zjffdu] - I don't think we should be changing the StateChangeNotifier at all as part of this patch. That's just a mechanism for notifying interested entities on when the state of vertices / tasks changes. The StateChangeNotifier has no context beyond this - whether the notification was sent out to a VMPlugin / EMPlugin / InputInitializer or maybe other entities later on. It can't really take a decision on what needs to be done in case of failure. It's not a public API and is meant to be invoked by Tez internal components - which would have context information on how to handle errors. onStateUpdated(VertexStateUpdate) in RootInputInitializerManager can just catch the exception from the user code and inform the Vertex via an event - indicating ROOT_INPUT_INITIALIZER failures (VertexEventRootInputFailed). It could potentially interrupt the corresponding Initializer thread as well - but that will eventually happen via the state machines in any case. Similarly for handleInputInitializerEvents and onTaskSucceeded (sendEvents). These exceptions should not make it back to the stateChangeNotifier since it wouldn't know how to handle them. Eventually, the InputInitializerManager will likely have a separate thread to send the events to the user (instead of using the AsyncDispatcher thread / statenotifier thread). It'll be better to use the same mechanism of sending a VertexEventRootInputFailed event IMHO. - Does DAG have to change to add getAppContext() ? Exception handling for InputInitializer --- Key: TEZ-1703 URL: https://issues.apache.org/jira/browse/TEZ-1703 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Attachments: TEZ-1703.patch For handleInputInitializerEvent - this should be fairly straightfoward to handle. At the moment this is an inline call from within the AsyncDispatcher, and will end up causing a RuntimeException. The RuntimeException can be changed to a AMUserCodeException which will take care of this. For onVertexStateUpdated, this eventually gets invoked from within RootInputInitializerManager. Catching exceptions there and sending a RootInputInitialzierFailedEvent should be enough to fix this ? May require some state machine changes to handle this event on a few more states. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1547) Make use of state change notifier in VertexManagerPlugins
[ https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188079#comment-14188079 ] Rajesh Balamohan commented on TEZ-1547: --- Issue is already captured in TEZ-1714 Make use of state change notifier in VertexManagerPlugins - Key: TEZ-1547 URL: https://issues.apache.org/jira/browse/TEZ-1547 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Bikas Saha Attachments: TEZ-1547.1.patch, TEZ-1547.3.patch, TEZ-1547.4.patch, TEZ-1547.5.patch Instead of the various APIs like onVertexStarted, simple notifications could be sent. Some existing APIs could end up being deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TEZ-1711) Don't cache outputSpecList in VertexImpl.getOutputSpecList(taskIndex)
[ https://issues.apache.org/jira/browse/TEZ-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang reassigned TEZ-1711: --- Assignee: Jeff Zhang Don't cache outputSpecList in VertexImpl.getOutputSpecList(taskIndex) - Key: TEZ-1711 URL: https://issues.apache.org/jira/browse/TEZ-1711 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.1 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-1711.patch It would cache the outputSpecList in its VertexImpl.getOutputSepcList(taskIndex), but I don't think we should cache it as it depends on the taskIndex, although in all the EdgeManagerPlugin Implementations, the value is the same no matter what the taskIndex is. But it has risk that if we have a new EdgeManagerPlugin that has different behavior. Or if this case would never happens, then just remove the taskIndex from the method parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1711) Don't cache outputSpecList in VertexImpl.getOutputSpecList(taskIndex)
[ https://issues.apache.org/jira/browse/TEZ-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188285#comment-14188285 ] Jeff Zhang commented on TEZ-1711: - Attach patch. * Just remove the cache in get getOutputSpecList * One unit test is affected, fix it. [~sseth], [~bikassaha] please help review. Don't cache outputSpecList in VertexImpl.getOutputSpecList(taskIndex) - Key: TEZ-1711 URL: https://issues.apache.org/jira/browse/TEZ-1711 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.1 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-1711.patch It would cache the outputSpecList in its VertexImpl.getOutputSepcList(taskIndex), but I don't think we should cache it as it depends on the taskIndex, although in all the EdgeManagerPlugin Implementations, the value is the same no matter what the taskIndex is. But it has risk that if we have a new EdgeManagerPlugin that has different behavior. Or if this case would never happens, then just remove the taskIndex from the method parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1703) Exception handling for InputInitializer
[ https://issues.apache.org/jira/browse/TEZ-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1703: Attachment: TEZ-1703-2.patch Exception handling for InputInitializer --- Key: TEZ-1703 URL: https://issues.apache.org/jira/browse/TEZ-1703 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Attachments: TEZ-1703-2.patch, TEZ-1703.patch For handleInputInitializerEvent - this should be fairly straightfoward to handle. At the moment this is an inline call from within the AsyncDispatcher, and will end up causing a RuntimeException. The RuntimeException can be changed to a AMUserCodeException which will take care of this. For onVertexStateUpdated, this eventually gets invoked from within RootInputInitializerManager. Catching exceptions there and sending a RootInputInitialzierFailedEvent should be enough to fix this ? May require some state machine changes to handle this event on a few more states. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1703) Exception handling for InputInitializer
[ https://issues.apache.org/jira/browse/TEZ-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188297#comment-14188297 ] Jeff Zhang commented on TEZ-1703: - [~sseth] Thanks for your review and suggestion. You are right, catch the exception in RootInputInitializerManager and send VertexEventRootInputFailed would be much more simple and clean. I attach a new patch, please help review. bq. Does DAG have to change to add getAppContext() ? Revert the change as we don't need it in the new patch. Exception handling for InputInitializer --- Key: TEZ-1703 URL: https://issues.apache.org/jira/browse/TEZ-1703 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.1 Reporter: Jeff Zhang Attachments: TEZ-1703-2.patch, TEZ-1703.patch For handleInputInitializerEvent - this should be fairly straightfoward to handle. At the moment this is an inline call from within the AsyncDispatcher, and will end up causing a RuntimeException. The RuntimeException can be changed to a AMUserCodeException which will take care of this. For onVertexStateUpdated, this eventually gets invoked from within RootInputInitializerManager. Catching exceptions there and sending a RootInputInitialzierFailedEvent should be enough to fix this ? May require some state machine changes to handle this event on a few more states. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1689) Exception handling for EdgeManagerPlugin
[ https://issues.apache.org/jira/browse/TEZ-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1689: Attachment: TEZ-1689-addendum.patch Exception handling for EdgeManagerPlugin Key: TEZ-1689 URL: https://issues.apache.org/jira/browse/TEZ-1689 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Critical Fix For: 0.5.2 Attachments: TEZ-1689-2.patch, TEZ-1689-3.patch, TEZ-1689-4.patch, TEZ-1689-addendum.patch, TEZ-1689.patch EdgeManagePlugin and InputInitializer are both user code which could lead exception, we should handle it, fail the DAG and display the exception in client side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1689) Exception handling for EdgeManagerPlugin
[ https://issues.apache.org/jira/browse/TEZ-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188302#comment-14188302 ] Jeff Zhang commented on TEZ-1689: - Attach the addendum patch to fix the unit test commit 1ffbc1935646f7c422b551e6e0ffdc001311d074 (HEAD, origin/master, origin/HEAD, master, TEZ-1711, TEZ-1703, TEZ-1689) Author: Jeff Zhang zjf...@apache.org Date: Wed Oct 29 18:58:49 2014 +0800 TEZ-1689. addendum - fix unit test failure. (zjffdu) Exception handling for EdgeManagerPlugin Key: TEZ-1689 URL: https://issues.apache.org/jira/browse/TEZ-1689 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Critical Fix For: 0.5.2 Attachments: TEZ-1689-2.patch, TEZ-1689-3.patch, TEZ-1689-4.patch, TEZ-1689-addendum.patch, TEZ-1689.patch EdgeManagePlugin and InputInitializer are both user code which could lead exception, we should handle it, fail the DAG and display the exception in client side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1703) Exception handling for InputInitializer
[ https://issues.apache.org/jira/browse/TEZ-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188306#comment-14188306 ] Jeff Zhang commented on TEZ-1703: - BTW, I think we should rename RootInputInitializerManager to InputInitializerManager because not only root vertex has InputInitializer, non-root vertex can also have InputInitializer. Exception handling for InputInitializer --- Key: TEZ-1703 URL: https://issues.apache.org/jira/browse/TEZ-1703 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.1 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-1703-2.patch, TEZ-1703.patch For handleInputInitializerEvent - this should be fairly straightfoward to handle. At the moment this is an inline call from within the AsyncDispatcher, and will end up causing a RuntimeException. The RuntimeException can be changed to a AMUserCodeException which will take care of this. For onVertexStateUpdated, this eventually gets invoked from within RootInputInitializerManager. Catching exceptions there and sending a RootInputInitialzierFailedEvent should be enough to fix this ? May require some state machine changes to handle this event on a few more states. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1719) Allow IFile reducer merge-sort to disable crc32 checksums
Gopal V created TEZ-1719: Summary: Allow IFile reducer merge-sort to disable crc32 checksums Key: TEZ-1719 URL: https://issues.apache.org/jira/browse/TEZ-1719 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Next-gen filesystems like BTRFS and ZFS provide their own checksumming for disk data. Using PureJavaCrc32 for data written for temporary spills to such filesystems is a complete waste of CPU resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1719) Allow IFile reducer merge-sort to disable crc32 checksums
[ https://issues.apache.org/jira/browse/TEZ-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1719: - Labels: Performance (was: ) Allow IFile reducer merge-sort to disable crc32 checksums - Key: TEZ-1719 URL: https://issues.apache.org/jira/browse/TEZ-1719 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Labels: Performance Next-gen filesystems like BTRFS and ZFS provide their own checksumming for disk data. Using PureJavaCrc32 for data written for temporary spills to such filesystems is a complete waste of CPU resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1547) Make use of state change notifier in VertexManagerPlugins
[ https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1547: Attachment: TEZ-1547.6.patch Patch comments out the done notification for now until TEZ-1714 is fixed. This will not affect functionality. Make use of state change notifier in VertexManagerPlugins - Key: TEZ-1547 URL: https://issues.apache.org/jira/browse/TEZ-1547 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Bikas Saha Attachments: TEZ-1547.1.patch, TEZ-1547.3.patch, TEZ-1547.4.patch, TEZ-1547.5.patch, TEZ-1547.6.patch Instead of the various APIs like onVertexStarted, simple notifications could be sent. Some existing APIs could end up being deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1547) Make use of state change notifier in VertexManagerPlugins
[ https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1547: Attachment: TEZ-1547.6.patch Rebased Make use of state change notifier in VertexManagerPlugins - Key: TEZ-1547 URL: https://issues.apache.org/jira/browse/TEZ-1547 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Bikas Saha Attachments: TEZ-1547.1.patch, TEZ-1547.3.patch, TEZ-1547.4.patch, TEZ-1547.5.patch, TEZ-1547.6.patch Instead of the various APIs like onVertexStarted, simple notifications could be sent. Some existing APIs could end up being deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1547) Make use of state change notifier in VertexManagerPlugins
[ https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1547: Attachment: (was: TEZ-1547.6.patch) Make use of state change notifier in VertexManagerPlugins - Key: TEZ-1547 URL: https://issues.apache.org/jira/browse/TEZ-1547 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Bikas Saha Attachments: TEZ-1547.1.patch, TEZ-1547.3.patch, TEZ-1547.4.patch, TEZ-1547.5.patch, TEZ-1547.6.patch Instead of the various APIs like onVertexStarted, simple notifications could be sent. Some existing APIs could end up being deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1547) Make use of state change notifier in VertexManagerPlugins
[ https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1547: Attachment: (was: TEZ-1547.6.patch) Make use of state change notifier in VertexManagerPlugins - Key: TEZ-1547 URL: https://issues.apache.org/jira/browse/TEZ-1547 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Bikas Saha Attachments: TEZ-1547.1.patch, TEZ-1547.3.patch, TEZ-1547.4.patch, TEZ-1547.5.patch Instead of the various APIs like onVertexStarted, simple notifications could be sent. Some existing APIs could end up being deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1702) Hive : With Auto reduce parallelism enabled TPC-DS query 31 gets stuck in Reducer 12
[ https://issues.apache.org/jira/browse/TEZ-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188758#comment-14188758 ] Mostafa Mokhtar commented on TEZ-1702: -- [~rajesh.balamohan] Query runs fine on latest. Hive : With Auto reduce parallelism enabled TPC-DS query 31 gets stuck in Reducer 12 - Key: TEZ-1702 URL: https://issues.apache.org/jira/browse/TEZ-1702 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Mostafa Mokhtar Priority: Critical Attachments: Logs for container_1414029100044_0150_01_01.zip, json_events.log, query31_logs_stuck.txt.gz, tez-1702-am.log Issue found in branch-0.5 , with latest commit as {code} commit 2e65de88af709d30207403fea881b697a4853dd6 Author: Bikas Saha bi...@apache.org Date: Tue Oct 21 14:59:56 2014 -0700 {code} Running TPC-DS Query 31 with Auto reduce parallelism enabled the query gets stuck in Reducer 12 Call Stack for stuck thread {code} Thread 14575: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - org.apache.tez.runtime.library.shuffle.common.impl.ShuffleManager.getNextInput() @bci=67, line=663 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.moveToNextInput() @bci=26, line=176 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.next() @bci=30, line=117 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.HashTableLoader.load(org.apache.hadoop.hive.ql.exec.persistence.MapJoinTableContainer[], org.apache.hadoop.hive.ql.exec.persistence.MapJoinTableContainerSerDe[]) @bci=259, line=112 (Compiled frame) - org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable() @bci=86, line=190 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(java.lang.Object, int) @bci=12, line=244 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.Operator.forward(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector) @bci=63, line=815 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(java.lang.Object, int) @bci=121, line=84 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.Operator.forward(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector) @bci=63, line=815 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(java.lang.Object[], org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AggregationBuffer[]) @bci=97, line=1072 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector, org.apache.hadoop.hive.ql.exec.KeyWrapper) @bci=71, line=881 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector) @bci=34, line=741 (Interpreted frame) 222,0-1 79% - org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector, org.apache.hadoop.hive.ql.exec.KeyWrapper) @bci=71, line=881 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector) @bci=34, line=741 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(java.lang.Object, int) @bci=457, line=809 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processKeyValues(java.lang.Iterable, byte) @bci=174, line=308 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord() @bci=218, line=252 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run() @bci=155, line=168 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(java.util.Map, java.util.Map) @bci=224, line=163 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(java.util.Map, java.util.Map) @bci=86, line=138 (Interpreted frame) -
[jira] [Resolved] (TEZ-1702) Hive : With Auto reduce parallelism enabled TPC-DS query 31 gets stuck in Reducer 12
[ https://issues.apache.org/jira/browse/TEZ-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mostafa Mokhtar resolved TEZ-1702. -- Resolution: Fixed Hive : With Auto reduce parallelism enabled TPC-DS query 31 gets stuck in Reducer 12 - Key: TEZ-1702 URL: https://issues.apache.org/jira/browse/TEZ-1702 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Mostafa Mokhtar Priority: Critical Attachments: Logs for container_1414029100044_0150_01_01.zip, json_events.log, query31_logs_stuck.txt.gz, tez-1702-am.log Issue found in branch-0.5 , with latest commit as {code} commit 2e65de88af709d30207403fea881b697a4853dd6 Author: Bikas Saha bi...@apache.org Date: Tue Oct 21 14:59:56 2014 -0700 {code} Running TPC-DS Query 31 with Auto reduce parallelism enabled the query gets stuck in Reducer 12 Call Stack for stuck thread {code} Thread 14575: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - org.apache.tez.runtime.library.shuffle.common.impl.ShuffleManager.getNextInput() @bci=67, line=663 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.moveToNextInput() @bci=26, line=176 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.next() @bci=30, line=117 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.HashTableLoader.load(org.apache.hadoop.hive.ql.exec.persistence.MapJoinTableContainer[], org.apache.hadoop.hive.ql.exec.persistence.MapJoinTableContainerSerDe[]) @bci=259, line=112 (Compiled frame) - org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable() @bci=86, line=190 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(java.lang.Object, int) @bci=12, line=244 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.Operator.forward(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector) @bci=63, line=815 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(java.lang.Object, int) @bci=121, line=84 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.Operator.forward(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector) @bci=63, line=815 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(java.lang.Object[], org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AggregationBuffer[]) @bci=97, line=1072 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector, org.apache.hadoop.hive.ql.exec.KeyWrapper) @bci=71, line=881 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector) @bci=34, line=741 (Interpreted frame) 222,0-1 79% - org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector, org.apache.hadoop.hive.ql.exec.KeyWrapper) @bci=71, line=881 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector) @bci=34, line=741 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(java.lang.Object, int) @bci=457, line=809 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processKeyValues(java.lang.Iterable, byte) @bci=174, line=308 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord() @bci=218, line=252 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run() @bci=155, line=168 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(java.util.Map, java.util.Map) @bci=224, line=163 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(java.util.Map, java.util.Map) @bci=86, line=138 (Interpreted frame) - org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run() @bci=76, line=324 (Interpreted frame) -
[jira] [Created] (TEZ-1720) Allow filters in all tables and also to pass in filters using url params
Prakash Ramachandran created TEZ-1720: - Summary: Allow filters in all tables and also to pass in filters using url params Key: TEZ-1720 URL: https://issues.apache.org/jira/browse/TEZ-1720 Project: Apache Tez Issue Type: Sub-task Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Need to make sure that all the tables in the ui can use filters and allow them to be set through url. this is needed for showing for ex the failed tasks for a dag/vertex etc and to bookmark searches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1720) Allow filters in all tables and also to pass in filters using url params
[ https://issues.apache.org/jira/browse/TEZ-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-1720: -- Attachment: tez-1720.1.patch - added filtering to all tables using url params. Allow filters in all tables and also to pass in filters using url params Key: TEZ-1720 URL: https://issues.apache.org/jira/browse/TEZ-1720 Project: Apache Tez Issue Type: Sub-task Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Attachments: tez-1720.1.patch Need to make sure that all the tables in the ui can use filters and allow them to be set through url. this is needed for showing for ex the failed tasks for a dag/vertex etc and to bookmark searches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1547) Make use of state change notifier in VertexManagerPlugins
[ https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1547: Attachment: TEZ-1547.7.patch Fixing newly added master branch tests to work with new changes. Make use of state change notifier in VertexManagerPlugins - Key: TEZ-1547 URL: https://issues.apache.org/jira/browse/TEZ-1547 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Bikas Saha Attachments: TEZ-1547.1.patch, TEZ-1547.3.patch, TEZ-1547.4.patch, TEZ-1547.5.patch, TEZ-1547.6.patch, TEZ-1547.7.patch Instead of the various APIs like onVertexStarted, simple notifications could be sent. Some existing APIs could end up being deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1716) Additional ATS data for UI
[ https://issues.apache.org/jira/browse/TEZ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1716: - Summary: Additional ATS data for UI (was: Add failed attempts info to History at Vertex and DAG level.) Additional ATS data for UI -- Key: TEZ-1716 URL: https://issues.apache.org/jira/browse/TEZ-1716 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1716) Additional ATS data for UI
[ https://issues.apache.org/jira/browse/TEZ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1716: - Description: Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity Additional ATS data for UI -- Key: TEZ-1716 URL: https://issues.apache.org/jira/browse/TEZ-1716 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1703) Exception handling for InputInitializer
[ https://issues.apache.org/jira/browse/TEZ-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188956#comment-14188956 ] Siddharth Seth commented on TEZ-1703: - Comments on the patch. {code} -DAGTerminationCause.VERTEX_FAILURE, -vertexEvent.getVertexTerminationCause() == null ? VertexTerminationCause.OTHER_VERTEX_FAILURE -: vertexEvent.getVertexTerminationCause()); +DAGTerminationCause.VERTEX_FAILURE, VertexTerminationCause.OTHER_VERTEX_FAILURE); {code} This is required so that all vertices don't get the same termination cause as the first vertex to fail ? We should remove getVertexTerminationCause in a follow up jira, since that seems to be of no use. {code} +String diagnosticMsg = Vertex failed/killed due to VertexManagerPlugin/EdgeManagerPlugin failed. {code} Will inputInitializer failures never go through this transition ? It may be better to set this up based on the SOURCE information available in the exception. There's some race conditions possible in the InputInitialzier. Prior to the patch - It's possible for events/notifications to be sent to a complete Initializer since the initializers / events are handled in separate threads. The setComplete() and isComplete checks aren't sufficient to avoid this. - Ideally, completed initializers should just handle these events gracefully, but that's not something that Tez can guarantee. We need to handle such situations, likely in a separate jira. With the patch, It's possible for a INITIALIZER_FAILED event to go out after an INITIALIZER_SUCCESS goes out. Sequence: T1: initializer running, T2: eventReceived/VertexUpdateReceived, throws Exception. T1: completes (the event could be partially handled which triggers completion of initialize()). Similarly it's possible to get INITILZIER_SUCCEEDED messages after a INITIALIER_FAILED message (in a FAILEd etc state). This isn't as harmful. This means we could end up getting INITIALIZER_FAILED messages in the INITED / RUNNING and possibly other states. The state machine in VertexImpl will need to change to handle INITIALIER_FAILED in some more states, and fail the vertex. Exception handling for InputInitializer --- Key: TEZ-1703 URL: https://issues.apache.org/jira/browse/TEZ-1703 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.1 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-1703-2.patch, TEZ-1703.patch For handleInputInitializerEvent - this should be fairly straightfoward to handle. At the moment this is an inline call from within the AsyncDispatcher, and will end up causing a RuntimeException. The RuntimeException can be changed to a AMUserCodeException which will take care of this. For onVertexStateUpdated, this eventually gets invoked from within RootInputInitializerManager. Catching exceptions there and sending a RootInputInitialzierFailedEvent should be enough to fix this ? May require some state machine changes to handle this event on a few more states. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1703) Exception handling for InputInitializer
[ https://issues.apache.org/jira/browse/TEZ-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188974#comment-14188974 ] Siddharth Seth commented on TEZ-1703: - And +1 for renaming the file. Please do that just before the commit though - not as part of iterative patches. Exception handling for InputInitializer --- Key: TEZ-1703 URL: https://issues.apache.org/jira/browse/TEZ-1703 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.1 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-1703-2.patch, TEZ-1703.patch For handleInputInitializerEvent - this should be fairly straightfoward to handle. At the moment this is an inline call from within the AsyncDispatcher, and will end up causing a RuntimeException. The RuntimeException can be changed to a AMUserCodeException which will take care of this. For onVertexStateUpdated, this eventually gets invoked from within RootInputInitializerManager. Catching exceptions there and sending a RootInputInitialzierFailedEvent should be enough to fix this ? May require some state machine changes to handle this event on a few more states. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1666) UserPayload should be null if the payload is not specified
[ https://issues.apache.org/jira/browse/TEZ-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188981#comment-14188981 ] Hitesh Shah commented on TEZ-1666: -- +1. With the version check in place, 0.5.1 clients cannot use the 0.5.2 tarball in any case. I think we can assume everyone is using the same version of client jars with the HDFS tarball ( but may need some updates in the INSTALL instructions ). UserPayload should be null if the payload is not specified -- Key: TEZ-1666 URL: https://issues.apache.org/jira/browse/TEZ-1666 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Critical Attachments: TEZ-1666.1.txt, TEZ-1666.2.txt As an example in the ProcessorDescriptor - if no payload is specified, context.getUserPayload should return null. SleepProcessor has an explicit check for a null payload, to enable default sleep value - which fails. Marking as critical since this is an API behaviour inconsistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1721) Update INSTALL instructions for clarifying tez client jars compatibility with runtime tarball on HDFS
Hitesh Shah created TEZ-1721: Summary: Update INSTALL instructions for clarifying tez client jars compatibility with runtime tarball on HDFS Key: TEZ-1721 URL: https://issues.apache.org/jira/browse/TEZ-1721 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1721) Update INSTALL instructions for clarifying tez client jars compatibility with runtime tarball on HDFS
[ https://issues.apache.org/jira/browse/TEZ-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1721: - Priority: Critical (was: Major) Update INSTALL instructions for clarifying tez client jars compatibility with runtime tarball on HDFS - Key: TEZ-1721 URL: https://issues.apache.org/jira/browse/TEZ-1721 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1721) Update INSTALL instructions for clarifying tez client jars compatibility with runtime tarball on HDFS
[ https://issues.apache.org/jira/browse/TEZ-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1721: - Target Version/s: 0.5.2 Update INSTALL instructions for clarifying tez client jars compatibility with runtime tarball on HDFS - Key: TEZ-1721 URL: https://issues.apache.org/jira/browse/TEZ-1721 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1547) Make use of state change notifier in VertexManagerPlugins
[ https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189059#comment-14189059 ] Bikas Saha commented on TEZ-1547: - [~rajesh.balamohan] [~sseth] [~hitesh] Please review. Make use of state change notifier in VertexManagerPlugins - Key: TEZ-1547 URL: https://issues.apache.org/jira/browse/TEZ-1547 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Bikas Saha Attachments: TEZ-1547.1.patch, TEZ-1547.3.patch, TEZ-1547.4.patch, TEZ-1547.5.patch, TEZ-1547.6.patch, TEZ-1547.7.patch Instead of the various APIs like onVertexStarted, simple notifications could be sent. Some existing APIs could end up being deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1547) Make use of state change notifier in VertexManagerPlugins
[ https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189059#comment-14189059 ] Bikas Saha edited comment on TEZ-1547 at 10/29/14 9:50 PM: --- [~rajesh.balamohan] [~sseth] [~hitesh] Please review. The patch has a small issue of the canInit() precondition check to be inside the try block. Will fix in the next iteration. was (Author: bikassaha): [~rajesh.balamohan] [~sseth] [~hitesh] Please review. Make use of state change notifier in VertexManagerPlugins - Key: TEZ-1547 URL: https://issues.apache.org/jira/browse/TEZ-1547 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Bikas Saha Attachments: TEZ-1547.1.patch, TEZ-1547.3.patch, TEZ-1547.4.patch, TEZ-1547.5.patch, TEZ-1547.6.patch, TEZ-1547.7.patch Instead of the various APIs like onVertexStarted, simple notifications could be sent. Some existing APIs could end up being deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1547) Make use of state change notifier in VertexManagerPlugins
[ https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189085#comment-14189085 ] Siddharth Seth commented on TEZ-1547: - In process. The patch has a bunch of changes unrelated to state notification which we've discussed in the past. Can probably close 2-3 old jiras after this goes in. Make use of state change notifier in VertexManagerPlugins - Key: TEZ-1547 URL: https://issues.apache.org/jira/browse/TEZ-1547 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Bikas Saha Attachments: TEZ-1547.1.patch, TEZ-1547.3.patch, TEZ-1547.4.patch, TEZ-1547.5.patch, TEZ-1547.6.patch, TEZ-1547.7.patch Instead of the various APIs like onVertexStarted, simple notifications could be sent. Some existing APIs could end up being deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1547) Make use of state change notifier in VertexManagerPlugins
[ https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189115#comment-14189115 ] Bikas Saha commented on TEZ-1547: - Patch 1 has just the state change notification and can be reviewed separately. Subsequent patches add usage for the notifications but I lost my git history due to hard drive loss. So I put up the combined patch next. Make use of state change notifier in VertexManagerPlugins - Key: TEZ-1547 URL: https://issues.apache.org/jira/browse/TEZ-1547 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Bikas Saha Attachments: TEZ-1547.1.patch, TEZ-1547.3.patch, TEZ-1547.4.patch, TEZ-1547.5.patch, TEZ-1547.6.patch, TEZ-1547.7.patch Instead of the various APIs like onVertexStarted, simple notifications could be sent. Some existing APIs could end up being deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1699) Vertex.setParallelism should throw an exception for invalid invocations
[ https://issues.apache.org/jira/browse/TEZ-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1699: Attachment: TEZ-1699.1.patch Removes the boolean return value and throws exceptions instead. Tests added. Marked incompatible. AFAIK noone uses the return value. So its best to remove it now. [~sseth] [~hitesh] Please review. Vertex.setParallelism should throw an exception for invalid invocations --- Key: TEZ-1699 URL: https://issues.apache.org/jira/browse/TEZ-1699 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Priority: Critical Attachments: TEZ-1699.1.patch There is a return value of false when setParallelism is not successful. However that may be ignored and in some cases the invocation is actually incorrect and its better to throw an exception than return false. Throwing an unchecked exception can allow doing this compatibly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1700) Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity
[ https://issues.apache.org/jira/browse/TEZ-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189155#comment-14189155 ] Hitesh Shah commented on TEZ-1700: -- Comments: Not sure if this is really safe from a binary compatibility point of view. Might be worth testing a job that uses TaskLocationHints compiled against 0.5.0 and run using a 0.5.2-SNAPSHOT runtime. {code} +if (affinitizedTask != null) { + if (affinitizedTask.getTaskIndex() != other.affinitizedTask.getTaskIndex()) { +return false; + } else if (!affinitizedTask.getVertexName().equals(other.affinitizedTask.getVertexName())) { return false; } -} else if (other.containerId != null) { +} else if (other.affinitizedTask != null) { return false; } {code} - I believe the other.affinitizedTask != null should be done earlier before doing the != comparisons for vertex name and task index {code} + taskScheduler.allocateTask(taskAttempt, + event.getCapability(), + taskAttempt.getAssignedContainerID(), + Priority.newInstance(event.getPriority()), + event.getContainerContext(), + event); {code} - should this be using the affinityAttempt's container id? If the unit tests are not catching this, maybe add one more test? Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity Key: TEZ-1700 URL: https://issues.apache.org/jira/browse/TEZ-1700 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-1700.1.patch Today 1-1 dependencies are affinitized by creating a task location hint with the producer task container id. It can be created by affinitizing to the producer task-index+vertexname combination instead and internally Tez can map it to the container. This also allows this dependency to be specified before the container is assigned. This allows the dependency to be generic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1700) Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity
[ https://issues.apache.org/jira/browse/TEZ-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189155#comment-14189155 ] Hitesh Shah edited comment on TEZ-1700 at 10/29/14 10:32 PM: - Comments: Not sure if this is really safe from a binary compatibility point of view. Might be worth testing a job that uses TaskLocationHints compiled against 0.5.0 and run using a 0.5.2-SNAPSHOT runtime. {code} +if (affinitizedTask != null) { + if (affinitizedTask.getTaskIndex() != other.affinitizedTask.getTaskIndex()) { +return false; + } else if (!affinitizedTask.getVertexName().equals(other.affinitizedTask.getVertexName())) { return false; } -} else if (other.containerId != null) { +} else if (other.affinitizedTask != null) { return false; } {code} - I believe the other.affinitizedTask != null check should be done earlier before doing the != comparisons for vertex name and task index {code} + taskScheduler.allocateTask(taskAttempt, + event.getCapability(), + taskAttempt.getAssignedContainerID(), + Priority.newInstance(event.getPriority()), + event.getContainerContext(), + event); {code} - should this be using the affinityAttempt's container id? If the unit tests are not catching this, maybe add one more test? was (Author: hitesh): Comments: Not sure if this is really safe from a binary compatibility point of view. Might be worth testing a job that uses TaskLocationHints compiled against 0.5.0 and run using a 0.5.2-SNAPSHOT runtime. {code} +if (affinitizedTask != null) { + if (affinitizedTask.getTaskIndex() != other.affinitizedTask.getTaskIndex()) { +return false; + } else if (!affinitizedTask.getVertexName().equals(other.affinitizedTask.getVertexName())) { return false; } -} else if (other.containerId != null) { +} else if (other.affinitizedTask != null) { return false; } {code} - I believe the other.affinitizedTask != null should be done earlier before doing the != comparisons for vertex name and task index {code} + taskScheduler.allocateTask(taskAttempt, + event.getCapability(), + taskAttempt.getAssignedContainerID(), + Priority.newInstance(event.getPriority()), + event.getContainerContext(), + event); {code} - should this be using the affinityAttempt's container id? If the unit tests are not catching this, maybe add one more test? Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity Key: TEZ-1700 URL: https://issues.apache.org/jira/browse/TEZ-1700 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-1700.1.patch Today 1-1 dependencies are affinitized by creating a task location hint with the producer task container id. It can be created by affinitizing to the producer task-index+vertexname combination instead and internally Tez can map it to the container. This also allows this dependency to be specified before the container is assigned. This allows the dependency to be generic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1716) Additional ATS data for UI
[ https://issues.apache.org/jira/browse/TEZ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1716: - Attachment: TEZ-1716.1.patch [~bikassaha] [~sseth] [~gopalv] please review. Additional ATS data for UI -- Key: TEZ-1716 URL: https://issues.apache.org/jira/browse/TEZ-1716 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1716.1.patch Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1720) Allow filters in all tables and also to pass in filters using url params
[ https://issues.apache.org/jira/browse/TEZ-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1720: - Fix Version/s: 0.6.0 Allow filters in all tables and also to pass in filters using url params Key: TEZ-1720 URL: https://issues.apache.org/jira/browse/TEZ-1720 Project: Apache Tez Issue Type: Sub-task Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Fix For: 0.6.0 Attachments: tez-1720.1.patch Need to make sure that all the tables in the ui can use filters and allow them to be set through url. this is needed for showing for ex the failed tasks for a dag/vertex etc and to bookmark searches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1720) Allow filters in all tables and also to pass in filters using url params
[ https://issues.apache.org/jira/browse/TEZ-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189211#comment-14189211 ] Hitesh Shah commented on TEZ-1720: -- Committed to branch TEZ-8 Allow filters in all tables and also to pass in filters using url params Key: TEZ-1720 URL: https://issues.apache.org/jira/browse/TEZ-1720 Project: Apache Tez Issue Type: Sub-task Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Fix For: 0.6.0 Attachments: tez-1720.1.patch Need to make sure that all the tables in the ui can use filters and allow them to be set through url. this is needed for showing for ex the failed tasks for a dag/vertex etc and to bookmark searches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1700) Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity
[ https://issues.apache.org/jira/browse/TEZ-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189239#comment-14189239 ] Bikas Saha commented on TEZ-1700: - Binary compatibility should be fine. Would be hard to do since 0.5.1 and 0.5.2 are already incompatible. bq. other.affinitizedTask != null Was just following the existing code flow. Will check this. bq. affinityAttempt's container id? Good catch. Will check why the existing test for affinity did not catch this. Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity Key: TEZ-1700 URL: https://issues.apache.org/jira/browse/TEZ-1700 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-1700.1.patch Today 1-1 dependencies are affinitized by creating a task location hint with the producer task container id. It can be created by affinitizing to the producer task-index+vertexname combination instead and internally Tez can map it to the container. This also allows this dependency to be specified before the container is assigned. This allows the dependency to be generic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1700) Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity
[ https://issues.apache.org/jira/browse/TEZ-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189327#comment-14189327 ] Hitesh Shah commented on TEZ-1700: -- bq. Binary compatibility should be fine. Would be hard to do since 0.5.1 and 0.5.2 are already incompatible. There are incompatible in that a 0.5.1 client cannot use a 0.5.2 AM. But a job compiled against either 0.5.0 or 0.5.1 should work when used with the 0.5.2 jars ( both client and AM ). Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity Key: TEZ-1700 URL: https://issues.apache.org/jira/browse/TEZ-1700 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-1700.1.patch Today 1-1 dependencies are affinitized by creating a task location hint with the producer task container id. It can be created by affinitizing to the producer task-index+vertexname combination instead and internally Tez can map it to the container. This also allows this dependency to be specified before the container is assigned. This allows the dependency to be generic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1716) Additional ATS data for UI
[ https://issues.apache.org/jira/browse/TEZ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1716: - Attachment: TEZ-1716.2.patch Additional changes to push successful attempt id to ATS. Additional ATS data for UI -- Key: TEZ-1716 URL: https://issues.apache.org/jira/browse/TEZ-1716 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1716.1.patch, TEZ-1716.2.patch Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1716) Additional ATS data for UI
[ https://issues.apache.org/jira/browse/TEZ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1716: - Description: Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity Add task's successful attempt id in task data. was: Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity Additional ATS data for UI -- Key: TEZ-1716 URL: https://issues.apache.org/jira/browse/TEZ-1716 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1716.1.patch, TEZ-1716.2.patch Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity Add task's successful attempt id in task data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1700) Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity
[ https://issues.apache.org/jira/browse/TEZ-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189382#comment-14189382 ] Bikas Saha commented on TEZ-1700: - Made the changes. Tried with broadcastAndOneToOneExample from master branch and it worked fine. The existing test case (broadcastAndOneToOneExample) does not catch this because in the mini cluster there isnt enough parallelism and the preferred container gets matched by chance because there are only 2 containers around. The test might have become flaky after this. I added a new test case in TestTaskSchedulerEventHandler to test that its doing the translation. Please take another look. Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity Key: TEZ-1700 URL: https://issues.apache.org/jira/browse/TEZ-1700 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-1700.1.patch Today 1-1 dependencies are affinitized by creating a task location hint with the producer task container id. It can be created by affinitizing to the producer task-index+vertexname combination instead and internally Tez can map it to the container. This also allows this dependency to be specified before the container is assigned. This allows the dependency to be generic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1700) Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity
[ https://issues.apache.org/jira/browse/TEZ-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1700: Attachment: TEZ-1700.2.patch Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity Key: TEZ-1700 URL: https://issues.apache.org/jira/browse/TEZ-1700 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-1700.1.patch, TEZ-1700.2.patch Today 1-1 dependencies are affinitized by creating a task location hint with the producer task container id. It can be created by affinitizing to the producer task-index+vertexname combination instead and internally Tez can map it to the container. This also allows this dependency to be specified before the container is assigned. This allows the dependency to be generic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1700) Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity
[ https://issues.apache.org/jira/browse/TEZ-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189392#comment-14189392 ] Hitesh Shah commented on TEZ-1700: -- Most looks good except for the equals() check. it does not handle other being non-null and this.affinity being null. I think the equals() probably deserves a unit test now. Replace containerId from TaskLocationHint with [TaskIndex+Vertex] based affinity Key: TEZ-1700 URL: https://issues.apache.org/jira/browse/TEZ-1700 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-1700.1.patch, TEZ-1700.2.patch Today 1-1 dependencies are affinitized by creating a task location hint with the producer task container id. It can be created by affinitizing to the producer task-index+vertexname combination instead and internally Tez can map it to the container. This also allows this dependency to be specified before the container is assigned. This allows the dependency to be generic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1699) Vertex.setParallelism should throw an exception for invalid invocations
[ https://issues.apache.org/jira/browse/TEZ-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189395#comment-14189395 ] Hitesh Shah commented on TEZ-1699: -- +1 Vertex.setParallelism should throw an exception for invalid invocations --- Key: TEZ-1699 URL: https://issues.apache.org/jira/browse/TEZ-1699 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Priority: Critical Attachments: TEZ-1699.1.patch There is a return value of false when setParallelism is not successful. However that may be ignored and in some cases the invocation is actually incorrect and its better to throw an exception than return false. Throwing an unchecked exception can allow doing this compatibly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1716) Additional ATS data for UI
[ https://issues.apache.org/jira/browse/TEZ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189406#comment-14189406 ] Rajesh Balamohan commented on TEZ-1716: --- Minor comments. In DAGImpl.java, taskStats computation can be done in separate method to avoid code duplication? {code} MapString, Integer taskStats = new HashMapString, Integer(); ProgressBuilder progressBuilder = getDAGProgress(); taskStats.put(ATSConstants.NUM_COMPLETED_TASKS, progressBuilder.getTotalTaskCount()); ... {code} In DAGSubmittedEvent.java, can vertexNameIDMap be removed? Additional ATS data for UI -- Key: TEZ-1716 URL: https://issues.apache.org/jira/browse/TEZ-1716 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1716.1.patch, TEZ-1716.2.patch Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity Add task's successful attempt id in task data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1716) Additional ATS data for UI
[ https://issues.apache.org/jira/browse/TEZ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189484#comment-14189484 ] Bikas Saha commented on TEZ-1716: - Shouldnt this come from YARN AHS? {code}+AppLaunchedEvent appLaunchedEvent = new AppLaunchedEvent(appAttemptID.getApplicationId(), +startTime, appSubmitTime, appMasterUgi.getShortUserName(), this.amConf); +historyEventHandler.handle({code} Why is vertexName To Id mapping moved to inited from submitted event? Can this mapping be passed in the vertex initialized event instead of via an initial map? Doing it via the vertex initialized event will make it continue to work when we add vertices at runtime. Additional ATS data for UI -- Key: TEZ-1716 URL: https://issues.apache.org/jira/browse/TEZ-1716 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1716.1.patch, TEZ-1716.2.patch Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity Add task's successful attempt id in task data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1716) Additional ATS data for UI
[ https://issues.apache.org/jira/browse/TEZ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189484#comment-14189484 ] Bikas Saha edited comment on TEZ-1716 at 10/30/14 2:04 AM: --- Why is vertexNameToId mapping moved to inited from submitted event? Can this mapping be passed in the vertex initialized event instead of via an initial map? Doing it via the vertex initialized event will make it continue to work when we add vertices at runtime. was (Author: bikassaha): Shouldnt this come from YARN AHS? {code}+AppLaunchedEvent appLaunchedEvent = new AppLaunchedEvent(appAttemptID.getApplicationId(), +startTime, appSubmitTime, appMasterUgi.getShortUserName(), this.amConf); +historyEventHandler.handle({code} Why is vertexName To Id mapping moved to inited from submitted event? Can this mapping be passed in the vertex initialized event instead of via an initial map? Doing it via the vertex initialized event will make it continue to work when we add vertices at runtime. Additional ATS data for UI -- Key: TEZ-1716 URL: https://issues.apache.org/jira/browse/TEZ-1716 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1716.1.patch, TEZ-1716.2.patch Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity Add task's successful attempt id in task data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1716) Additional ATS data for UI
[ https://issues.apache.org/jira/browse/TEZ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189506#comment-14189506 ] Hitesh Shah commented on TEZ-1716: -- [~bikassaha] It is moving to DAGInitialized as that is where the ids are being generated. Cannot use the VertexInit event as that will result in one more additional call to ATS to update the DAG entity. The main objective of this was to get the name to id mapping from the dag entity instead of querying all vertices to do the correlation. [~rajesh.balamohan] Will address the comments in the next patch. Additional ATS data for UI -- Key: TEZ-1716 URL: https://issues.apache.org/jira/browse/TEZ-1716 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1716.1.patch, TEZ-1716.2.patch Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity Add task's successful attempt id in task data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1711) Don't cache outputSpecList in VertexImpl.getOutputSpecList(taskIndex)
[ https://issues.apache.org/jira/browse/TEZ-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1711: Attachment: TEZ-1711-2.patch Don't cache outputSpecList in VertexImpl.getOutputSpecList(taskIndex) - Key: TEZ-1711 URL: https://issues.apache.org/jira/browse/TEZ-1711 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.1 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-1711-2.patch, TEZ-1711.patch It would cache the outputSpecList in its VertexImpl.getOutputSepcList(taskIndex), but I don't think we should cache it as it depends on the taskIndex, although in all the EdgeManagerPlugin Implementations, the value is the same no matter what the taskIndex is. But it has risk that if we have a new EdgeManagerPlugin that has different behavior. Or if this case would never happens, then just remove the taskIndex from the method parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1716) Additional ATS data for UI
[ https://issues.apache.org/jira/browse/TEZ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1716: - Attachment: TEZ-1716.3.patch Patch with [~rajesh.balamohan]'s comments addressed. Additional ATS data for UI -- Key: TEZ-1716 URL: https://issues.apache.org/jira/browse/TEZ-1716 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1716.1.patch, TEZ-1716.2.patch, TEZ-1716.3.patch Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity Add task's successful attempt id in task data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1711) Don't cache outputSpecList in VertexImpl.getOutputSpecList(taskIndex)
[ https://issues.apache.org/jira/browse/TEZ-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189534#comment-14189534 ] Jeff Zhang commented on TEZ-1711: - [~bikassaha] Attach a new patch, bq. Given this change, should we remove inputSpecList and outputSpecList as member vars of VertexImpl? Still use them as member of VertexImpl, but new them in constructor and clear them in getInputSpecList and getOutputSpectList. This could avoid creating new List especially for large job. Does it make sense to you ? bq. Why is this change making the test DAG fail? The affected test case is for TEZ-1689 ( Exception handling for EdgeManagerPlugin ). Without this patch, only the first task attempt is failed in AM side, the following task attempts wouldn't not been affected in AM side ( because we cache the outputSpecList ), but will throw exception in TezChild since we don't get the correct outputSpecList, (but that can not been simulated in unit test case , the unit test case can only simulate behavior in AM side). So without this patch, AM would think the dag is still running. While with this patch, all the task attempts would fail in AM side, and finally cause the DAG fail. Don't cache outputSpecList in VertexImpl.getOutputSpecList(taskIndex) - Key: TEZ-1711 URL: https://issues.apache.org/jira/browse/TEZ-1711 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.1 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-1711-2.patch, TEZ-1711.patch It would cache the outputSpecList in its VertexImpl.getOutputSepcList(taskIndex), but I don't think we should cache it as it depends on the taskIndex, although in all the EdgeManagerPlugin Implementations, the value is the same no matter what the taskIndex is. But it has risk that if we have a new EdgeManagerPlugin that has different behavior. Or if this case would never happens, then just remove the taskIndex from the method parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1703) Exception handling for InputInitializer
[ https://issues.apache.org/jira/browse/TEZ-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1703: Attachment: TEZ-1703-3.patch Exception handling for InputInitializer --- Key: TEZ-1703 URL: https://issues.apache.org/jira/browse/TEZ-1703 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.1 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-1703-2.patch, TEZ-1703-3.patch, TEZ-1703.patch For handleInputInitializerEvent - this should be fairly straightfoward to handle. At the moment this is an inline call from within the AsyncDispatcher, and will end up causing a RuntimeException. The RuntimeException can be changed to a AMUserCodeException which will take care of this. For onVertexStateUpdated, this eventually gets invoked from within RootInputInitializerManager. Catching exceptions there and sending a RootInputInitialzierFailedEvent should be enough to fix this ? May require some state machine changes to handle this event on a few more states. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1703) Exception handling for InputInitializer
[ https://issues.apache.org/jira/browse/TEZ-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189576#comment-14189576 ] Jeff Zhang commented on TEZ-1703: - bq. {code} DAGTerminationCause.VERTEX_FAILURE, vertexEvent.getVertexTerminationCause() == null ? VertexTerminationCause.OTHER_VERTEX_FAILURE : vertexEvent.getVertexTerminationCause()); DAGTerminationCause.VERTEX_FAILURE, VertexTerminationCause.OTHER_VERTEX_FAILURE); {code} bq. This is required so that all vertices don't get the same termination cause as the first vertex to fail ? Yes, otherwise all the vertices' termination would be the same which don't make sense to me. Beside there will be one issue in VertexImpl.checkVertexForCompletion where we will check the termination cause where we don't check ROOT_INPUT_INIT_FAILURE. bq. Prior to the patch bq. It's possible for events/notifications to be sent to a complete Initializer since the initializers / events are handled in separate threads. The setComplete() and isComplete checks aren't sufficient to avoid this. bq. Ideally, completed initializers should just handle these events gracefully, but that's not something that Tez can guarantee. We need to handle such situations, likely in a separate jira. After Initialize completed, InputInitliazerManager would been shutdown, will that solve this issue ? bq. Will inputInitializer failures never go through this transition ? It may be better to set this up based on the SOURCE information available in the exception. InputInitializer will set TerminationCause as ROOT_INIT_FAILURE rather than AM_USERCODE_EXCEPTION which is a special cause. Maybe we could still split AMUserCodeException into VertexManagerException/EdgeManagerException, then it would be much more clear and consistency. bq. The state machine in VertexImpl will need to change to handle INITIALIER_FAILED in some more states, and fail the vertex. Add more transition in the state machine. But there will be on tricky case that INIT_SUCCEEDED following by INIT_FAILURE, because INIT_SUCCEEDED would shutdown InputInitliazerManager, in that cast the InputInitliazer Thread would been interupted, and bq. And +1 for renaming the file. Please do that just before the commit though - not as part of iterative patches. Actually it is more about renaming RootInputInitlaizerManager, do the following changes: * RootInputInitializerManager - InputInitializerManager * TezRootInputInitializerContextImpl - TezInputInitializerContextImpl * VertexEventRootInputInitialized - VertexEventInputInitialized * VertexEventRootInputFailed - VertexEventInputFailed * VertexTerminationCause.ROOT_INPUT_INIT_FAILURE - VertexTerminationCause.INPUT_INIT_FAILURE. * EventType.ROOT_INPUT_DATA_INFORMATION_EVENT - EventType.INPUT_DATA_INFORMATION_EVENT * EventType.ROOT_INPUT_INITIALIZER_EVENT - EventType.INPUT_INITIALIZER_EVENT * VertexEventType.V_ROOT_INPUT_INITIALIZED - VertexEventType.V_INPUT_INITIALIZED * VertexEventType.V_ROOT_INPUT_FAILED - VertexEventType.V_INPUT_INIT_FAILED Exception handling for InputInitializer --- Key: TEZ-1703 URL: https://issues.apache.org/jira/browse/TEZ-1703 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.1 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-1703-2.patch, TEZ-1703-3.patch, TEZ-1703.patch For handleInputInitializerEvent - this should be fairly straightfoward to handle. At the moment this is an inline call from within the AsyncDispatcher, and will end up causing a RuntimeException. The RuntimeException can be changed to a AMUserCodeException which will take care of this. For onVertexStateUpdated, this eventually gets invoked from within RootInputInitializerManager. Catching exceptions there and sending a RootInputInitialzierFailedEvent should be enough to fix this ? May require some state machine changes to handle this event on a few more states. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1703) Exception handling for InputInitializer
[ https://issues.apache.org/jira/browse/TEZ-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189576#comment-14189576 ] Jeff Zhang edited comment on TEZ-1703 at 10/30/14 3:28 AM: --- bq. {code} DAGTerminationCause.VERTEX_FAILURE, vertexEvent.getVertexTerminationCause() == null ? VertexTerminationCause.OTHER_VERTEX_FAILURE : vertexEvent.getVertexTerminationCause()); DAGTerminationCause.VERTEX_FAILURE, VertexTerminationCause.OTHER_VERTEX_FAILURE); {code} bq. This is required so that all vertices don't get the same termination cause as the first vertex to fail ? Yes, otherwise all the vertices' termination would be the same which don't make sense to me. Beside there will be one issue in VertexImpl.checkVertexForCompletion where we will check the termination cause where we don't check ROOT_INPUT_INIT_FAILURE. bq. Prior to the patch bq. It's possible for events/notifications to be sent to a complete Initializer since the initializers / events are handled in separate threads. The setComplete() and isComplete checks aren't sufficient to avoid this. bq. Ideally, completed initializers should just handle these events gracefully, but that's not something that Tez can guarantee. We need to handle such situations, likely in a separate jira. After Initialize completed, InputInitliazerManager would been shutdown, will that solve this issue ? bq. Will inputInitializer failures never go through this transition ? It may be better to set this up based on the SOURCE information available in the exception. InputInitializer will set TerminationCause as ROOT_INIT_FAILURE rather than AM_USERCODE_EXCEPTION which is a special cause. Maybe we could still split AMUserCodeException into VertexManagerException/EdgeManagerException, then it would be much more clear and consistency. bq. The state machine in VertexImpl will need to change to handle INITIALIER_FAILED in some more states, and fail the vertex. Add more transition in the state machine. bq. And +1 for renaming the file. Please do that just before the commit though - not as part of iterative patches. Actually it is more about renaming RootInputInitlaizerManager, do the following changes: * RootInputInitializerManager - InputInitializerManager * TezRootInputInitializerContextImpl - TezInputInitializerContextImpl * VertexEventRootInputInitialized - VertexEventInputInitialized * VertexEventRootInputFailed - VertexEventInputFailed * VertexTerminationCause.ROOT_INPUT_INIT_FAILURE - VertexTerminationCause.INPUT_INIT_FAILURE. * EventType.ROOT_INPUT_DATA_INFORMATION_EVENT - EventType.INPUT_DATA_INFORMATION_EVENT * EventType.ROOT_INPUT_INITIALIZER_EVENT - EventType.INPUT_INITIALIZER_EVENT * VertexEventType.V_ROOT_INPUT_INITIALIZED - VertexEventType.V_INPUT_INITIALIZED * VertexEventType.V_ROOT_INPUT_FAILED - VertexEventType.V_INPUT_INIT_FAILED was (Author: zjffdu): bq. {code} DAGTerminationCause.VERTEX_FAILURE, vertexEvent.getVertexTerminationCause() == null ? VertexTerminationCause.OTHER_VERTEX_FAILURE : vertexEvent.getVertexTerminationCause()); DAGTerminationCause.VERTEX_FAILURE, VertexTerminationCause.OTHER_VERTEX_FAILURE); {code} bq. This is required so that all vertices don't get the same termination cause as the first vertex to fail ? Yes, otherwise all the vertices' termination would be the same which don't make sense to me. Beside there will be one issue in VertexImpl.checkVertexForCompletion where we will check the termination cause where we don't check ROOT_INPUT_INIT_FAILURE. bq. Prior to the patch bq. It's possible for events/notifications to be sent to a complete Initializer since the initializers / events are handled in separate threads. The setComplete() and isComplete checks aren't sufficient to avoid this. bq. Ideally, completed initializers should just handle these events gracefully, but that's not something that Tez can guarantee. We need to handle such situations, likely in a separate jira. After Initialize completed, InputInitliazerManager would been shutdown, will that solve this issue ? bq. Will inputInitializer failures never go through this transition ? It may be better to set this up based on the SOURCE information available in the exception. InputInitializer will set TerminationCause as ROOT_INIT_FAILURE rather than AM_USERCODE_EXCEPTION which is a special cause. Maybe we could still split AMUserCodeException into VertexManagerException/EdgeManagerException, then it would be much more clear and consistency. bq. The state machine in VertexImpl will need to change to handle INITIALIER_FAILED in some more states, and fail the vertex. Add more transition in the state machine. But there will be on tricky case that INIT_SUCCEEDED following by INIT_FAILURE, because INIT_SUCCEEDED would shutdown InputInitliazerManager, in that cast the InputInitliazer Thread would been interupted, and bq.
[jira] [Commented] (TEZ-1547) Make use of state change notifier in VertexManagerPlugins
[ https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189606#comment-14189606 ] Rajesh Balamohan commented on TEZ-1547: --- Corner case in ImmediateStartVertexManager: 1. onVertexStarted() gets called and is the middle of populating srcVertexConfigured. Assume it has to populate 2 items in srcVertexConfigured and has populated 1 item in the map. 2. In the mean time, onVertexStateUpdated() gets called with COMPLETELY_CONFIGURED for the item in srcVertexConfigured. 3. In this case, canScheduleTasks() would return true (without being aware of the 2nd item that is yet to be populated in srcVertexConfigured). 4. If source pertaining to 2nd item changes its parallelism, DAG can hang indefinitely. {code} e.g log: 2014-10-29 20:38:46,172 INFO [AsyncDispatcher event handler] impl.ImmediateStartVertexManager: Task count in Map_7: 1 2014-10-29 20:38:46,173 INFO [AsyncDispatcher event handler] impl.ImmediateStartVertexManager: Received configured notification : COMPLETELY_CONFIGURED for vertex: Map_7 2014-10-29 20:38:46,173 INFO [AsyncDispatcher event handler] impl.ImmediateStartVertexManager: Starting 10 in Map_5 2014-10-29 20:38:46,173 INFO [AsyncDispatcher event handler] impl.ImmediateStartVertexManager: Task count in Reducer_3: 2 ... ... 2014-10-29 20:39:18,682 INFO [AsyncDispatcher event handler] vertexmanager.ShuffleVertexManager: Reduce auto parallelism for vertex: Reducer_3 to 1 from 2 . Expected output: 0 based on actual output: 0 from 1 vertex manager events. desiredTaskInputSize: 104857600 max slow start tasks:0.1 num sources completed:1 {code} In short, check in scheduleTasks() should be added to ensure that srcVertexConfigured is completely populated in onVertexStarted(). Make use of state change notifier in VertexManagerPlugins - Key: TEZ-1547 URL: https://issues.apache.org/jira/browse/TEZ-1547 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Bikas Saha Attachments: TEZ-1547.1.patch, TEZ-1547.3.patch, TEZ-1547.4.patch, TEZ-1547.5.patch, TEZ-1547.6.patch, TEZ-1547.7.patch Instead of the various APIs like onVertexStarted, simple notifications could be sent. Some existing APIs could end up being deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1716) Additional ATS data for UI
[ https://issues.apache.org/jira/browse/TEZ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189642#comment-14189642 ] Rajesh Balamohan commented on TEZ-1716: --- Looked at the latest patch. lgtm. +1. Additional ATS data for UI -- Key: TEZ-1716 URL: https://issues.apache.org/jira/browse/TEZ-1716 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1716.1.patch, TEZ-1716.2.patch, TEZ-1716.3.patch Add failed and killed attempt info at DAG and Vertex Level. Add tez-site configuration contents to Tez App Entity Add task's successful attempt id in task data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)