[jira] [Updated] (TEZ-1524) getDAGStatus seems to fork out the entire JVM
[ https://issues.apache.org/jira/browse/TEZ-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1524: - Attachment: TEZ-1524.2.patch getDAGStatus seems to fork out the entire JVM - Key: TEZ-1524 URL: https://issues.apache.org/jira/browse/TEZ-1524 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Assignee: Gopal V Attachments: TEZ-1524.1.patch, TEZ-1524.2.patch Tracked down a consistent fork() call to {code} at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:83) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:52) at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:50) at org.apache.hadoop.security.Groups.getGroups(Groups.java:139) at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1409) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getRPCUserGroups(DAGClientAMProtocolBlockingPBServerImpl.java:75) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) {code} [~hitesh] - would it make sense to cache this at all? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1534) Make client side configs available to AM and tasks
[ https://issues.apache.org/jira/browse/TEZ-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1534: Attachment: TEZ-1534.3.txt Updated to remove the extra conf creation. Wasn't sure if the ACLManager modified the conf. THanks for the reviews. Committing. Make client side configs available to AM and tasks -- Key: TEZ-1534 URL: https://issues.apache.org/jira/browse/TEZ-1534 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1534.1.txt, TEZ-1534.2.txt, TEZ-1534.3.txt Configs from the client (specifically the ones provided to TezClient, along with YARN additions) should be shipped over to the cluster (AM and tasks), instead of AM/tasks depending on configs present on cluster nodes. These configs will primarily be used for Tez components like RPC servers, clients etc - and not by the Processor / Input / Output - which should be sending over fully configured payloads in any case. Tez should continue to run without core-site, hdfs-site, yarn-site etc in the classpath on cluster nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1571) Add create method for DataSinkDescriptor
Jeff Zhang created TEZ-1571: --- Summary: Add create method for DataSinkDescriptor Key: TEZ-1571 URL: https://issues.apache.org/jira/browse/TEZ-1571 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Add create method for DataSinkDescriptor, and make the constructor private. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1572) Add throw Exception for method handleEvents of OutputFrameworkInterface
Jeff Zhang created TEZ-1572: --- Summary: Add throw Exception for method handleEvents of OutputFrameworkInterface Key: TEZ-1572 URL: https://issues.apache.org/jira/browse/TEZ-1572 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang align the interface with InputFrameworkInterface -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
[ https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1539: Attachment: TEZ-1539.3.txt Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code Key: TEZ-1539 URL: https://issues.apache.org/jira/browse/TEZ-1539 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt Specifically for InputInitalizerEvents and VertexManagerEvents. Pasting comment from TEZ-1447 In a majority of cases, events generated by different attempts of the same task will be identical - in which case just making use of the event generated by the first successful attempt is adequate. Doing something like this manes that users don't worry about retries, indices etc - and can just rely on receiving a set of events which are to be processed once the vertex succeeds. If different attempts of the same workload generate different events - processing is likely to be incorrect, since it's very possible for all data to be processed (VERTEX successful), then a failure and retry - which generates a different event. The initializer doesn't even run at this point, since it's already done it's work and is complete. Handling such scenarios, likely involves re-running the entire initializer and re-starting the vertex which processed the event from scratch. In situations like this, where data generated may be different, the best bet is for speculation to be disabled (when it's supported), and max-attempts to be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1573) Exception from InputInitializer is not propogated to client
[ https://issues.apache.org/jira/browse/TEZ-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1573: Issue Type: Sub-task (was: Bug) Parent: TEZ-1240 Exception from InputInitializer is not propogated to client --- Key: TEZ-1573 URL: https://issues.apache.org/jira/browse/TEZ-1573 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1568) Add system test for propagation of diagnostics for errors
[ https://issues.apache.org/jira/browse/TEZ-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1568: Description: Design system test where exception come from Input, Output, Processor, InputInitializer and VertexManagerPlugin (was: Design system test where exception come from Input, Output, Processor) Add system test for propagation of diagnostics for errors - Key: TEZ-1568 URL: https://issues.apache.org/jira/browse/TEZ-1568 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang Design system test where exception come from Input, Output, Processor, InputInitializer and VertexManagerPlugin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-853) Support counters recovery
[ https://issues.apache.org/jira/browse/TEZ-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-853: --- Attachment: Tez-853-3.patch Support counters recovery - Key: TEZ-853 URL: https://issues.apache.org/jira/browse/TEZ-853 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: Tez-853-2.patch, Tez-853-3.patch, Tez-853.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-853) Support counters recovery
[ https://issues.apache.org/jira/browse/TEZ-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130011#comment-14130011 ] Jeff Zhang commented on TEZ-853: Changes in the new patch. * Fix one small bug in DAGStatus ( use method getDAGCounters instead of field dagCounters, because it may not have been deserlized from proto ) * remove counters from VertexFinishedProto and TaskFinishedProto but keep it in java class VertexFinishedEvent and TaskFinishedEvent for history event. * Fix counters recovery issue in TaskAttemptImpl and DAGImpl * Move the counter recovery unit test into TestAMRecovery. Support counters recovery - Key: TEZ-853 URL: https://issues.apache.org/jira/browse/TEZ-853 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: Tez-853-2.patch, Tez-853-3.patch, Tez-853.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1572) Add throw Exception for method handleEvents of OutputFrameworkInterface and ProcessorFrameworkInterface
[ https://issues.apache.org/jira/browse/TEZ-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated TEZ-1572: - Attachment: TEZ-1572.patch add throw exception to OutputFrameworkInterface and ProcessorFrameworkInterface, also classes that implement above two interface, fix some warnings in related classes. Add throw Exception for method handleEvents of OutputFrameworkInterface and ProcessorFrameworkInterface --- Key: TEZ-1572 URL: https://issues.apache.org/jira/browse/TEZ-1572 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Attachments: TEZ-1572.patch align the interface with InputFrameworkInterface -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1345) Add checks to guarantee all init events are written to recovery to consider vertex initialized
[ https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1345: - Attachment: TEZ-1345.12.rebased.patch Attaching rebased patch - ready to commit. Add checks to guarantee all init events are written to recovery to consider vertex initialized -- Key: TEZ-1345 URL: https://issues.apache.org/jira/browse/TEZ-1345 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-1345.12.rebased.patch, Tez-1345-10.patch, Tez-1345-11.patch, Tez-1345-12.patch, Tez-1345-2.patch, Tez-1345-3.patch, Tez-1345-4.patch, Tez-1345-5.patch, Tez-1345-6.patch, Tez-1345-7.patch, Tez-1345-8.patch, Tez-1345-9.patch, Tez-1345.patch Related to issue discovered in TEZ-1033 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1345) Add checks to guarantee all init events are written to recovery to consider vertex initialized
[ https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1345: - Attachment: TEZ-1345.12.rebased.2.patch With missing file. Add checks to guarantee all init events are written to recovery to consider vertex initialized -- Key: TEZ-1345 URL: https://issues.apache.org/jira/browse/TEZ-1345 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-1345.12.rebased.2.patch, TEZ-1345.12.rebased.patch, Tez-1345-10.patch, Tez-1345-11.patch, Tez-1345-12.patch, Tez-1345-2.patch, Tez-1345-3.patch, Tez-1345-4.patch, Tez-1345-5.patch, Tez-1345-6.patch, Tez-1345-7.patch, Tez-1345-8.patch, Tez-1345-9.patch, Tez-1345.patch Related to issue discovered in TEZ-1033 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (TEZ-1345) Add checks to guarantee all init events are written to recovery to consider vertex initialized
[ https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah reopened TEZ-1345: -- Add checks to guarantee all init events are written to recovery to consider vertex initialized -- Key: TEZ-1345 URL: https://issues.apache.org/jira/browse/TEZ-1345 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Fix For: 0.6.0 Attachments: TEZ-1345.12.rebased.2.patch, TEZ-1345.12.rebased.patch, Tez-1345-10.patch, Tez-1345-11.patch, Tez-1345-12.patch, Tez-1345-2.patch, Tez-1345-3.patch, Tez-1345-4.patch, Tez-1345-5.patch, Tez-1345-6.patch, Tez-1345-7.patch, Tez-1345-8.patch, Tez-1345-9.patch, Tez-1345.patch Related to issue discovered in TEZ-1033 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1345) Add checks to guarantee all init events are written to recovery to consider vertex initialized
[ https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130448#comment-14130448 ] Hitesh Shah commented on TEZ-1345: -- Reverted commit for now as unit tests failed in build https://builds.apache.org/job/Tez-Build/628 Add checks to guarantee all init events are written to recovery to consider vertex initialized -- Key: TEZ-1345 URL: https://issues.apache.org/jira/browse/TEZ-1345 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Fix For: 0.6.0 Attachments: TEZ-1345.12.rebased.2.patch, TEZ-1345.12.rebased.patch, Tez-1345-10.patch, Tez-1345-11.patch, Tez-1345-12.patch, Tez-1345-2.patch, Tez-1345-3.patch, Tez-1345-4.patch, Tez-1345-5.patch, Tez-1345-6.patch, Tez-1345-7.patch, Tez-1345-8.patch, Tez-1345-9.patch, Tez-1345.patch Related to issue discovered in TEZ-1033 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-853) Support counters recovery
[ https://issues.apache.org/jira/browse/TEZ-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130500#comment-14130500 ] Hitesh Shah commented on TEZ-853: - Mostly looks good. With respect to the tests, they rely on timing. Have you done multiple runs of the test to ensure that they are not flaky? Support counters recovery - Key: TEZ-853 URL: https://issues.apache.org/jira/browse/TEZ-853 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: Tez-853-2.patch, Tez-853-3.patch, Tez-853.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1574) Support additional formats for the tez deployed archive
Siddharth Seth created TEZ-1574: --- Summary: Support additional formats for the tez deployed archive Key: TEZ-1574 URL: https://issues.apache.org/jira/browse/TEZ-1574 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Assignee: Siddharth Seth Currently, we only look for .tgz and .tar.gz. Looking at extensions isn't the best method - but for now, this jira is to expand this list. Improving the mechanism to detect an archive will be a separate jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1574) Support additional formats for the tez deployed archive
[ https://issues.apache.org/jira/browse/TEZ-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1574: Attachment: TEZ-1574.1.txt Trivial patch. Additional checks for .zip and .tar. [~hitesh] - please review. Support additional formats for the tez deployed archive --- Key: TEZ-1574 URL: https://issues.apache.org/jira/browse/TEZ-1574 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1574.1.txt Currently, we only look for .tgz and .tar.gz. Looking at extensions isn't the best method - but for now, this jira is to expand this list. Improving the mechanism to detect an archive will be a separate jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-853) Support counters recovery
[ https://issues.apache.org/jira/browse/TEZ-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130662#comment-14130662 ] Hitesh Shah commented on TEZ-853: - After applying the patch, saw unit tests failing: {code} Failed tests: TestTaskAttemptRecovery.testTARecovery_START:120 eventHandler.handle(any); Wanted 1 time: - at org.apache.tez.dag.app.dag.impl.TestTaskAttemptRecovery.testTARecovery_START(TestTaskAttemptRecovery.java:120) But was 3 times. Undesired invocation: - at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.sendEvent(TaskAttemptImpl.java:822) TestTaskAttemptRecovery.testTARecovery_FAILED:162 eventHandler.handle(any); Never wanted here: - at org.apache.tez.dag.app.dag.impl.TestTaskAttemptRecovery.testTARecovery_FAILED(TestTaskAttemptRecovery.java:162) But invoked here: - at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.sendEvent(TaskAttemptImpl.java:822) TestTaskAttemptRecovery.testTARecovery_KIILED:148 eventHandler.handle(any); Never wanted here: - at org.apache.tez.dag.app.dag.impl.TestTaskAttemptRecovery.testTARecovery_KIILED(TestTaskAttemptRecovery.java:148) But invoked here: - at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.sendEvent(TaskAttemptImpl.java:822) TestTaskAttemptRecovery.testTARecovery_SUCCEED:134 eventHandler.handle(any); Never wanted here: - at org.apache.tez.dag.app.dag.impl.TestTaskAttemptRecovery.testTARecovery_SUCCEED(TestTaskAttemptRecovery.java:134) But invoked here: - at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.sendEvent(TaskAttemptImpl.java:822) TestTaskAttemptRecovery.testTARecovery_NEW:108 eventHandler.handle(any); Wanted 1 time: - at org.apache.tez.dag.app.dag.impl.TestTaskAttemptRecovery.testTARecovery_NEW(TestTaskAttemptRecovery.java:108) But was 2 times. Undesired invocation: - at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.sendEvent(TaskAttemptImpl.java:822) {code} Support counters recovery - Key: TEZ-853 URL: https://issues.apache.org/jira/browse/TEZ-853 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: Tez-853-2.patch, Tez-853-3.patch, Tez-853.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
[ https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130701#comment-14130701 ] Bikas Saha commented on TEZ-1539: - TaskStateChangeNotification should be a separate jira. Its not following the VertexStatusUpdateEvent/ENUM pattern that is being followed by VertexStatusChangeNotification. There is no good reason for these similar things to follow different code patterns. Nor are there specific tests for TaskStateChangeNotification. For the rest of the code related to this jira, there are a lot of changes which overall look fine. But given the number of buffers and if else conditions do you think the 2 test cases part of this patch are providing sufficient coverage? There arent any e2e tests covering initializer events being generated and nothing in our code/test code uses initializer events. Maybe if that had been there then we may have realized the need for the current changes when we had initially put in these new events. So adding them now would be a useful way to ascertain that things work as expected before we put this in the wild and have it used e2e for the first time by a user. Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code Key: TEZ-1539 URL: https://issues.apache.org/jira/browse/TEZ-1539 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt Specifically for InputInitalizerEvents and VertexManagerEvents. Pasting comment from TEZ-1447 In a majority of cases, events generated by different attempts of the same task will be identical - in which case just making use of the event generated by the first successful attempt is adequate. Doing something like this manes that users don't worry about retries, indices etc - and can just rely on receiving a set of events which are to be processed once the vertex succeeds. If different attempts of the same workload generate different events - processing is likely to be incorrect, since it's very possible for all data to be processed (VERTEX successful), then a failure and retry - which generates a different event. The initializer doesn't even run at this point, since it's already done it's work and is complete. Handling such scenarios, likely involves re-running the entire initializer and re-starting the vertex which processed the event from scratch. In situations like this, where data generated may be different, the best bet is for speculation to be disabled (when it's supported), and max-attempts to be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1575) MRRSleepJob does not pick MR settings for container size and java opts
Hitesh Shah created TEZ-1575: Summary: MRRSleepJob does not pick MR settings for container size and java opts Key: TEZ-1575 URL: https://issues.apache.org/jira/browse/TEZ-1575 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1575) MRRSleepJob does not pick MR settings for container size and java opts
[ https://issues.apache.org/jira/browse/TEZ-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1575: - Attachment: TEZ-1575.1.patch Given that this is an MR-based job, it might be better for it to use MR settings. [~bikassaha] Review please? MRRSleepJob does not pick MR settings for container size and java opts -- Key: TEZ-1575 URL: https://issues.apache.org/jira/browse/TEZ-1575 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1575.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1574) Support additional formats for the tez deployed archive
[ https://issues.apache.org/jira/browse/TEZ-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130811#comment-14130811 ] Hitesh Shah commented on TEZ-1574: -- Sounds good. Support additional formats for the tez deployed archive --- Key: TEZ-1574 URL: https://issues.apache.org/jira/browse/TEZ-1574 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1574.1.txt Currently, we only look for .tgz and .tar.gz. Looking at extensions isn't the best method - but for now, this jira is to expand this list. Improving the mechanism to detect an archive will be a separate jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1543) Shuffle Errors on heavy load (causing task retries)
[ https://issues.apache.org/jira/browse/TEZ-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1543: -- Fix Version/s: 0.6.0 Shuffle Errors on heavy load (causing task retries) --- Key: TEZ-1543 URL: https://issues.apache.org/jira/browse/TEZ-1543 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Fix For: 0.6.0 Attachments: TEZ-1543.1.patch, syn_app_with_issue.svg, with_patch.svg org.apache.tez.runtime.library.common.shuffle.impl.Shuffle: ShuffleRunner failed with error org.apache.tez.runtime.library.common.shuffle.impl.Shuffle$ShuffleError: error in shuffle in fetcher [initialmap] #13 at org.apache.tez.runtime.library.common.shuffle.impl.Shuffle$RunShuffleCallable.call(Shuffle.java:336) at org.apache.tez.runtime.library.common.shuffle.impl.Shuffle$RunShuffleCallable.call(Shuffle.java:318) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NullPointerException at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) at org.apache.hadoop.io.WritableUtils.readStringSafely(WritableUtils.java:475) at org.apache.tez.runtime.library.common.shuffle.impl.ShuffleHeader.readFields(ShuffleHeader.java:82) at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyMapOutput(Fetcher.java:350) at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyFromHost(Fetcher.java:294) at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.run(Fetcher.java:160) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
[ https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130820#comment-14130820 ] Hitesh Shah commented on TEZ-1539: -- [~zjffdu] To clarify this use-case, InputInitializerEvents are similar to DataMovementEvents i.e. they are generated by a source vertex's task and sent downstream. [~sseth] For the most part, I believe the recovery logic seems fine as the events are being stored and restored from the log. The only missing piece is injecting the correct logic to handle their routing in routeRecoveredEvents(). This is one of the missing places that needs fixing as it does not use the RouteEventTransition. Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code Key: TEZ-1539 URL: https://issues.apache.org/jira/browse/TEZ-1539 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt Specifically for InputInitalizerEvents and VertexManagerEvents. Pasting comment from TEZ-1447 In a majority of cases, events generated by different attempts of the same task will be identical - in which case just making use of the event generated by the first successful attempt is adequate. Doing something like this manes that users don't worry about retries, indices etc - and can just rely on receiving a set of events which are to be processed once the vertex succeeds. If different attempts of the same workload generate different events - processing is likely to be incorrect, since it's very possible for all data to be processed (VERTEX successful), then a failure and retry - which generates a different event. The initializer doesn't even run at this point, since it's already done it's work and is complete. Handling such scenarios, likely involves re-running the entire initializer and re-starting the vertex which processed the event from scratch. In situations like this, where data generated may be different, the best bet is for speculation to be disabled (when it's supported), and max-attempts to be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
[ https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1539: Attachment: TEZ-1539.4.txt Updated patch with the recovery fixes. Also adds two new unit tests to handle multiple tasks, multiple sources, and different events. If this looks good, I'll rename the RecoveryEvent just before commit to avoid noise. Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code Key: TEZ-1539 URL: https://issues.apache.org/jira/browse/TEZ-1539 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt, TEZ-1539.4.txt Specifically for InputInitalizerEvents and VertexManagerEvents. Pasting comment from TEZ-1447 In a majority of cases, events generated by different attempts of the same task will be identical - in which case just making use of the event generated by the first successful attempt is adequate. Doing something like this manes that users don't worry about retries, indices etc - and can just rely on receiving a set of events which are to be processed once the vertex succeeds. If different attempts of the same workload generate different events - processing is likely to be incorrect, since it's very possible for all data to be processed (VERTEX successful), then a failure and retry - which generates a different event. The initializer doesn't even run at this point, since it's already done it's work and is complete. Handling such scenarios, likely involves re-running the entire initializer and re-starting the vertex which processed the event from scratch. In situations like this, where data generated may be different, the best bet is for speculation to be disabled (when it's supported), and max-attempts to be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1575) MRRSleepJob does not pick MR settings for container size and java opts
[ https://issues.apache.org/jira/browse/TEZ-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130956#comment-14130956 ] Bikas Saha commented on TEZ-1575: - lgtm MRRSleepJob does not pick MR settings for container size and java opts -- Key: TEZ-1575 URL: https://issues.apache.org/jira/browse/TEZ-1575 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1575.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1267) Exception handling when Routing Events
[ https://issues.apache.org/jira/browse/TEZ-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1267: Priority: Critical (was: Major) Exception handling when Routing Events -- Key: TEZ-1267 URL: https://issues.apache.org/jira/browse/TEZ-1267 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Priority: Critical Events are generated by user code. In some places they're also handled by user code within the AM. Currently, exceptions which are generated when handling user code will end up killing the AM (and hence leading to a retry). Instead, failure to handle such events, should cause the application to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
[ https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130965#comment-14130965 ] Hitesh Shah commented on TEZ-1539: -- [~zjffdu] Could you keep a watch on this as this may affect your patches. [~sseth] change for routeRecoveredEvents looks fine. Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code Key: TEZ-1539 URL: https://issues.apache.org/jira/browse/TEZ-1539 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt, TEZ-1539.4.txt Specifically for InputInitalizerEvents and VertexManagerEvents. Pasting comment from TEZ-1447 In a majority of cases, events generated by different attempts of the same task will be identical - in which case just making use of the event generated by the first successful attempt is adequate. Doing something like this manes that users don't worry about retries, indices etc - and can just rely on receiving a set of events which are to be processed once the vertex succeeds. If different attempts of the same workload generate different events - processing is likely to be incorrect, since it's very possible for all data to be processed (VERTEX successful), then a failure and retry - which generates a different event. The initializer doesn't even run at this point, since it's already done it's work and is complete. Handling such scenarios, likely involves re-running the entire initializer and re-starting the vertex which processed the event from scratch. In situations like this, where data generated may be different, the best bet is for speculation to be disabled (when it's supported), and max-attempts to be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
[ https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130993#comment-14130993 ] Bikas Saha commented on TEZ-1539: - Please mark this as an incompatible change since behavior has changed wrt 0.5.0 and users may need to adjust to it. Its fine to get this in and iterate with e2e example in a separate jira. But thats important to get done before we can be confident it works e2e. Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code Key: TEZ-1539 URL: https://issues.apache.org/jira/browse/TEZ-1539 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt, TEZ-1539.4.txt Specifically for InputInitalizerEvents and VertexManagerEvents. Pasting comment from TEZ-1447 In a majority of cases, events generated by different attempts of the same task will be identical - in which case just making use of the event generated by the first successful attempt is adequate. Doing something like this manes that users don't worry about retries, indices etc - and can just rely on receiving a set of events which are to be processed once the vertex succeeds. If different attempts of the same workload generate different events - processing is likely to be incorrect, since it's very possible for all data to be processed (VERTEX successful), then a failure and retry - which generates a different event. The initializer doesn't even run at this point, since it's already done it's work and is complete. Handling such scenarios, likely involves re-running the entire initializer and re-starting the vertex which processed the event from scratch. In situations like this, where data generated may be different, the best bet is for speculation to be disabled (when it's supported), and max-attempts to be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1571) Add create method for DataSinkDescriptor
[ https://issues.apache.org/jira/browse/TEZ-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1571: Affects Version/s: 0.5.0 Add create method for DataSinkDescriptor Key: TEZ-1571 URL: https://issues.apache.org/jira/browse/TEZ-1571 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Add create method for DataSinkDescriptor, and make the constructor private. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1571) Add create method for DataSinkDescriptor
[ https://issues.apache.org/jira/browse/TEZ-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1571: Priority: Blocker (was: Major) Add create method for DataSinkDescriptor Key: TEZ-1571 URL: https://issues.apache.org/jira/browse/TEZ-1571 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Add create method for DataSinkDescriptor, and make the constructor private. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-853) Support counters recovery
[ https://issues.apache.org/jira/browse/TEZ-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-853: --- Attachment: Tez-853-4.patch Support counters recovery - Key: TEZ-853 URL: https://issues.apache.org/jira/browse/TEZ-853 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: Tez-853-2.patch, Tez-853-3.patch, Tez-853-4.patch, Tez-853.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1345) Add checks to guarantee all init events are written to recovery to consider vertex initialized
[ https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1345: Attachment: TEZ-1345-13.patch Attach new patch, fix the test failure. Add checks to guarantee all init events are written to recovery to consider vertex initialized -- Key: TEZ-1345 URL: https://issues.apache.org/jira/browse/TEZ-1345 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Fix For: 0.6.0 Attachments: TEZ-1345-13.patch, TEZ-1345.12.rebased.2.patch, TEZ-1345.12.rebased.patch, Tez-1345-10.patch, Tez-1345-11.patch, Tez-1345-12.patch, Tez-1345-2.patch, Tez-1345-3.patch, Tez-1345-4.patch, Tez-1345-5.patch, Tez-1345-6.patch, Tez-1345-7.patch, Tez-1345-8.patch, Tez-1345-9.patch, Tez-1345.patch Related to issue discovered in TEZ-1033 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
[ https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131046#comment-14131046 ] Jeff Zhang commented on TEZ-1539: - [~hitesh] I will keep watching this jira. Regarding the impact on recovery, I think it will have impact on recovery in case when the source vertex is succeeded, the InputInitiliazerEvent won't be regenerated, we also don't save them in recovery. So I think we should save the InputInitializerEvent in recovery log and also add unit test for this. Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code Key: TEZ-1539 URL: https://issues.apache.org/jira/browse/TEZ-1539 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt, TEZ-1539.4.txt Specifically for InputInitalizerEvents and VertexManagerEvents. Pasting comment from TEZ-1447 In a majority of cases, events generated by different attempts of the same task will be identical - in which case just making use of the event generated by the first successful attempt is adequate. Doing something like this manes that users don't worry about retries, indices etc - and can just rely on receiving a set of events which are to be processed once the vertex succeeds. If different attempts of the same workload generate different events - processing is likely to be incorrect, since it's very possible for all data to be processed (VERTEX successful), then a failure and retry - which generates a different event. The initializer doesn't even run at this point, since it's already done it's work and is complete. Handling such scenarios, likely involves re-running the entire initializer and re-starting the vertex which processed the event from scratch. In situations like this, where data generated may be different, the best bet is for speculation to be disabled (when it's supported), and max-attempts to be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1571) Add create method for DataSinkDescriptor
[ https://issues.apache.org/jira/browse/TEZ-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1571: Attachment: Tez-1571.patch Attach patch. Add create method for DataSinkDescriptor Key: TEZ-1571 URL: https://issues.apache.org/jira/browse/TEZ-1571 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Attachments: Tez-1571.patch Add create method for DataSinkDescriptor, and make the constructor private. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1569) Add tests for preemption
[ https://issues.apache.org/jira/browse/TEZ-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1569: Attachment: TEZ-1569.1.patch Add tests for preemption Key: TEZ-1569 URL: https://issues.apache.org/jira/browse/TEZ-1569 Project: Apache Tez Issue Type: Test Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-1569.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1568) Add system test for propagation of diagnostics for errors
[ https://issues.apache.org/jira/browse/TEZ-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1568: Attachment: TEZ-1568.patch Add system test for propagation of diagnostics for errors - Key: TEZ-1568 URL: https://issues.apache.org/jira/browse/TEZ-1568 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-1568.patch Design system test where exception come from Input, Output, Processor, InputInitializer and VertexManagerPlugin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1568) Add system test for propagation of diagnostics for errors
[ https://issues.apache.org/jira/browse/TEZ-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131068#comment-14131068 ] Jeff Zhang commented on TEZ-1568: - Attach the patch. * Verify the exception from Input/Output/Processor could be propagated to client side. For the II and VM cases, leave it in [TEZ-1573|https://issues.apache.org/jira/browse/TEZ-1573] * handleEvents of Output and Processor is not supported, so didn't include test for them. (Only find ConsumeType of Input in Edge) Add system test for propagation of diagnostics for errors - Key: TEZ-1568 URL: https://issues.apache.org/jira/browse/TEZ-1568 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-1568.patch Design system test where exception come from Input, Output, Processor, InputInitializer and VertexManagerPlugin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
[ https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131070#comment-14131070 ] Bikas Saha commented on TEZ-1539: - bq.Agreed. The last patch done by Sid should be saving those events into recovery log within the source vertex itself. [~hitesh] from the patch it looks like II events are being stored in the destination vertex and not the source vertex, unless I am reading it wrong {code} if (!isEventFromVertex(vertex, tezEvent.getSourceInfo())) { -continue; +if (tezEvent.getEventType().equals(EventType.ROOT_INPUT_INITIALIZER_EVENT)) { + recoveryEvents.add(tezEvent); +} else { + continue; +} }{code} It might still work but shouldn't the flow be consistent with all other events where the events are stored in their source vertex? Might break things down the road. Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code Key: TEZ-1539 URL: https://issues.apache.org/jira/browse/TEZ-1539 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt, TEZ-1539.4.txt Specifically for InputInitalizerEvents and VertexManagerEvents. Pasting comment from TEZ-1447 In a majority of cases, events generated by different attempts of the same task will be identical - in which case just making use of the event generated by the first successful attempt is adequate. Doing something like this manes that users don't worry about retries, indices etc - and can just rely on receiving a set of events which are to be processed once the vertex succeeds. If different attempts of the same workload generate different events - processing is likely to be incorrect, since it's very possible for all data to be processed (VERTEX successful), then a failure and retry - which generates a different event. The initializer doesn't even run at this point, since it's already done it's work and is complete. Handling such scenarios, likely involves re-running the entire initializer and re-starting the vertex which processed the event from scratch. In situations like this, where data generated may be different, the best bet is for speculation to be disabled (when it's supported), and max-attempts to be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1569) Add tests for preemption
[ https://issues.apache.org/jira/browse/TEZ-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131084#comment-14131084 ] Bikas Saha commented on TEZ-1569: - Patch adds e2e tests for preemption by using local mode and custom container launcher to quickly simulate job execution without launching any tasks. It simulates preemption by sending the preemption event to the engine in exactly the same manner in which YARN sends the preemption event. This enables us to explicitly specify the exact tasks we want to preempt and check for expected behavior. The patch tests. In session mode, different combinations of DAG edges and vertices with different number of preempted attempts In non-session mode, multiple preemptions for one case of a DAG. [~tassapola] Please review. Thanks! Add tests for preemption Key: TEZ-1569 URL: https://issues.apache.org/jira/browse/TEZ-1569 Project: Apache Tez Issue Type: Test Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-1569.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1569) Add tests for preemption
[ https://issues.apache.org/jira/browse/TEZ-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1569: Attachment: (was: TEZ-1569.1.patch) Add tests for preemption Key: TEZ-1569 URL: https://issues.apache.org/jira/browse/TEZ-1569 Project: Apache Tez Issue Type: Test Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-1569.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1569) Add tests for preemption
[ https://issues.apache.org/jira/browse/TEZ-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1569: Attachment: TEZ-1569.1.patch Attaching patch with more comments. Add tests for preemption Key: TEZ-1569 URL: https://issues.apache.org/jira/browse/TEZ-1569 Project: Apache Tez Issue Type: Test Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-1569.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1267) Exception handling when Routing Events
[ https://issues.apache.org/jira/browse/TEZ-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131114#comment-14131114 ] Jeff Zhang commented on TEZ-1267: - [~sseth] If you haven't started it, I'd like to take it. Try to resolve it with [TEZ-1573|https://issues.apache.org/jira/browse/TEZ-1573] Exception handling when Routing Events -- Key: TEZ-1267 URL: https://issues.apache.org/jira/browse/TEZ-1267 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Priority: Critical Events are generated by user code. In some places they're also handled by user code within the AM. Currently, exceptions which are generated when handling user code will end up killing the AM (and hence leading to a retry). Instead, failure to handle such events, should cause the application to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)