date:20140911


 [ 
https://issues.apache.org/jira/browse/TEZ-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1534:

Attachment: TEZ-1534.3.txt

Updated to remove the extra conf creation. Wasn't sure if the ACLManager 
modified the conf.
THanks for the reviews. Committing.

 Make client side configs available to AM and tasks
 --

 Key: TEZ-1534
 URL: https://issues.apache.org/jira/browse/TEZ-1534
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: TEZ-1534.1.txt, TEZ-1534.2.txt, TEZ-1534.3.txt


 Configs from the client (specifically the ones provided to TezClient, along 
 with YARN additions) should be shipped over to the cluster (AM and tasks), 
 instead of AM/tasks depending on configs present on cluster nodes.
 These configs will primarily be used for Tez components like RPC servers, 
 clients etc - and not by the Processor / Input / Output - which should be 
 sending over fully configured payloads in any case.
 Tez should continue to run without core-site, hdfs-site, yarn-site etc in the 
 classpath on cluster nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TEZ-1571) Add create method for DataSinkDescriptor

Jeff Zhang created TEZ-1571:
---

 Summary: Add create method for DataSinkDescriptor
 Key: TEZ-1571
 URL: https://issues.apache.org/jira/browse/TEZ-1571
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang


Add create method for DataSinkDescriptor, and make the constructor private. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TEZ-1572) Add throw Exception for method handleEvents of OutputFrameworkInterface

Jeff Zhang created TEZ-1572:
---

 Summary: Add throw Exception for method handleEvents of 
OutputFrameworkInterface
 Key: TEZ-1572
 URL: https://issues.apache.org/jira/browse/TEZ-1572
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang


align the interface with InputFrameworkInterface



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code


 [ 
https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1539:

Attachment: TEZ-1539.3.txt

 Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
 

 Key: TEZ-1539
 URL: https://issues.apache.org/jira/browse/TEZ-1539
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt


 Specifically for InputInitalizerEvents and VertexManagerEvents.
 Pasting comment from TEZ-1447
 In a majority of cases, events generated by different attempts of the same 
 task will be identical - in which case just making use of the event generated 
 by the first successful attempt is adequate. Doing something like this manes 
 that users don't worry about retries, indices etc - and can just rely on 
 receiving a set of events which are to be processed once the vertex succeeds.
 If different attempts of the same workload generate different events - 
 processing is likely to be incorrect, since it's very possible for all data 
 to be processed (VERTEX successful), then a failure and retry - which 
 generates a different event. The initializer doesn't even run at this point, 
 since it's already done it's work and is complete. Handling such scenarios, 
 likely involves re-running the entire initializer and re-starting the vertex 
 which processed the event from scratch. In situations like this, where data 
 generated may be different, the best bet is for speculation to be disabled 
 (when it's supported), and max-attempts to be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1573) Exception from InputInitializer is not propogated to client


 [ 
https://issues.apache.org/jira/browse/TEZ-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-1573:

Issue Type: Sub-task  (was: Bug)
Parent: TEZ-1240

 Exception from InputInitializer is not propogated to client
 ---

 Key: TEZ-1573
 URL: https://issues.apache.org/jira/browse/TEZ-1573
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Jeff Zhang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1568) Add system test for propagation of diagnostics for errors


 [ 
https://issues.apache.org/jira/browse/TEZ-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-1568:

Description: Design system test where exception come from Input, Output, 
Processor, InputInitializer and VertexManagerPlugin  (was: Design system test 
where exception come from Input, Output, Processor)

 Add system test for propagation of diagnostics for errors
 -

 Key: TEZ-1568
 URL: https://issues.apache.org/jira/browse/TEZ-1568
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Jeff Zhang
Assignee: Jeff Zhang

 Design system test where exception come from Input, Output, Processor, 
 InputInitializer and VertexManagerPlugin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-853) Support counters recovery


 [ 
https://issues.apache.org/jira/browse/TEZ-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-853:
---
Attachment: Tez-853-3.patch

 Support counters recovery
 -

 Key: TEZ-853
 URL: https://issues.apache.org/jira/browse/TEZ-853
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: Tez-853-2.patch, Tez-853-3.patch, Tez-853.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-853) Support counters recovery


[ 
https://issues.apache.org/jira/browse/TEZ-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130011#comment-14130011
 ] 

Jeff Zhang commented on TEZ-853:


Changes in the new patch.

* Fix one small bug in DAGStatus ( use method getDAGCounters instead of field 
dagCounters, because it may not have been deserlized from proto )
* remove counters from VertexFinishedProto and TaskFinishedProto but keep it in 
java class VertexFinishedEvent and TaskFinishedEvent for history event.
* Fix counters recovery issue in TaskAttemptImpl and DAGImpl
* Move the counter recovery unit test into TestAMRecovery. 

 Support counters recovery
 -

 Key: TEZ-853
 URL: https://issues.apache.org/jira/browse/TEZ-853
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: Tez-853-2.patch, Tez-853-3.patch, Tez-853.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1572) Add throw Exception for method handleEvents of OutputFrameworkInterface and ProcessorFrameworkInterface

2014-09-11 Thread Chen He (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated TEZ-1572:
-
Attachment: TEZ-1572.patch

add throw exception to OutputFrameworkInterface and 
ProcessorFrameworkInterface, also classes that implement above two interface, 
fix some warnings in related classes.

 Add throw Exception for method handleEvents of OutputFrameworkInterface and 
 ProcessorFrameworkInterface
 ---

 Key: TEZ-1572
 URL: https://issues.apache.org/jira/browse/TEZ-1572
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
 Attachments: TEZ-1572.patch


 align the interface with InputFrameworkInterface



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1345) Add checks to guarantee all init events are written to recovery to consider vertex initialized


 [ 
https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1345:
-
Attachment: TEZ-1345.12.rebased.patch

Attaching rebased patch - ready to commit. 

 Add checks to guarantee all init events are written to recovery to consider 
 vertex initialized
 --

 Key: TEZ-1345
 URL: https://issues.apache.org/jira/browse/TEZ-1345
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-1345.12.rebased.patch, Tez-1345-10.patch, 
 Tez-1345-11.patch, Tez-1345-12.patch, Tez-1345-2.patch, Tez-1345-3.patch, 
 Tez-1345-4.patch, Tez-1345-5.patch, Tez-1345-6.patch, Tez-1345-7.patch, 
 Tez-1345-8.patch, Tez-1345-9.patch, Tez-1345.patch


 Related to issue discovered in TEZ-1033



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1345) Add checks to guarantee all init events are written to recovery to consider vertex initialized


 [ 
https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1345:
-
Attachment: TEZ-1345.12.rebased.2.patch

With missing file. 

 Add checks to guarantee all init events are written to recovery to consider 
 vertex initialized
 --

 Key: TEZ-1345
 URL: https://issues.apache.org/jira/browse/TEZ-1345
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-1345.12.rebased.2.patch, TEZ-1345.12.rebased.patch, 
 Tez-1345-10.patch, Tez-1345-11.patch, Tez-1345-12.patch, Tez-1345-2.patch, 
 Tez-1345-3.patch, Tez-1345-4.patch, Tez-1345-5.patch, Tez-1345-6.patch, 
 Tez-1345-7.patch, Tez-1345-8.patch, Tez-1345-9.patch, Tez-1345.patch


 Related to issue discovered in TEZ-1033



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (TEZ-1345) Add checks to guarantee all init events are written to recovery to consider vertex initialized


 [ 
https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah reopened TEZ-1345:
--

 Add checks to guarantee all init events are written to recovery to consider 
 vertex initialized
 --

 Key: TEZ-1345
 URL: https://issues.apache.org/jira/browse/TEZ-1345
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Fix For: 0.6.0

 Attachments: TEZ-1345.12.rebased.2.patch, TEZ-1345.12.rebased.patch, 
 Tez-1345-10.patch, Tez-1345-11.patch, Tez-1345-12.patch, Tez-1345-2.patch, 
 Tez-1345-3.patch, Tez-1345-4.patch, Tez-1345-5.patch, Tez-1345-6.patch, 
 Tez-1345-7.patch, Tez-1345-8.patch, Tez-1345-9.patch, Tez-1345.patch


 Related to issue discovered in TEZ-1033



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1345) Add checks to guarantee all init events are written to recovery to consider vertex initialized


[ 
https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130448#comment-14130448
 ] 

Hitesh Shah commented on TEZ-1345:
--

Reverted commit for now as unit tests failed in build 
https://builds.apache.org/job/Tez-Build/628 

 Add checks to guarantee all init events are written to recovery to consider 
 vertex initialized
 --

 Key: TEZ-1345
 URL: https://issues.apache.org/jira/browse/TEZ-1345
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Fix For: 0.6.0

 Attachments: TEZ-1345.12.rebased.2.patch, TEZ-1345.12.rebased.patch, 
 Tez-1345-10.patch, Tez-1345-11.patch, Tez-1345-12.patch, Tez-1345-2.patch, 
 Tez-1345-3.patch, Tez-1345-4.patch, Tez-1345-5.patch, Tez-1345-6.patch, 
 Tez-1345-7.patch, Tez-1345-8.patch, Tez-1345-9.patch, Tez-1345.patch


 Related to issue discovered in TEZ-1033



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-853) Support counters recovery


[ 
https://issues.apache.org/jira/browse/TEZ-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130500#comment-14130500
 ] 

Hitesh Shah commented on TEZ-853:
-

Mostly looks good. With respect to the tests, they rely on timing. Have you 
done multiple runs of the test to ensure that they are not flaky? 




 Support counters recovery
 -

 Key: TEZ-853
 URL: https://issues.apache.org/jira/browse/TEZ-853
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: Tez-853-2.patch, Tez-853-3.patch, Tez-853.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TEZ-1574) Support additional formats for the tez deployed archive

Siddharth Seth created TEZ-1574:
---

 Summary: Support additional formats for the tez deployed archive
 Key: TEZ-1574
 URL: https://issues.apache.org/jira/browse/TEZ-1574
 Project: Apache Tez
  Issue Type: Bug
Reporter: Siddharth Seth
Assignee: Siddharth Seth


Currently, we only look for .tgz and .tar.gz. Looking at extensions isn't the 
best method - but for now, this jira is to expand this list.
Improving the mechanism to detect an archive will be a separate jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1574) Support additional formats for the tez deployed archive


 [ 
https://issues.apache.org/jira/browse/TEZ-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1574:

Attachment: TEZ-1574.1.txt

Trivial patch. Additional checks for .zip and .tar. [~hitesh] - please review.

 Support additional formats for the tez deployed archive
 ---

 Key: TEZ-1574
 URL: https://issues.apache.org/jira/browse/TEZ-1574
 Project: Apache Tez
  Issue Type: Bug
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: TEZ-1574.1.txt


 Currently, we only look for .tgz and .tar.gz. Looking at extensions isn't the 
 best method - but for now, this jira is to expand this list.
 Improving the mechanism to detect an archive will be a separate jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-853) Support counters recovery


[ 
https://issues.apache.org/jira/browse/TEZ-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130662#comment-14130662
 ] 

Hitesh Shah commented on TEZ-853:
-

After applying the patch, saw unit tests failing:

{code}
Failed tests:
  TestTaskAttemptRecovery.testTARecovery_START:120
eventHandler.handle(any);
Wanted 1 time:
- at 
org.apache.tez.dag.app.dag.impl.TestTaskAttemptRecovery.testTARecovery_START(TestTaskAttemptRecovery.java:120)
But was 3 times. Undesired invocation:
- at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.sendEvent(TaskAttemptImpl.java:822)

  TestTaskAttemptRecovery.testTARecovery_FAILED:162
eventHandler.handle(any);
Never wanted here:
- at 
org.apache.tez.dag.app.dag.impl.TestTaskAttemptRecovery.testTARecovery_FAILED(TestTaskAttemptRecovery.java:162)
But invoked here:
- at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.sendEvent(TaskAttemptImpl.java:822)

  TestTaskAttemptRecovery.testTARecovery_KIILED:148
eventHandler.handle(any);
Never wanted here:
- at 
org.apache.tez.dag.app.dag.impl.TestTaskAttemptRecovery.testTARecovery_KIILED(TestTaskAttemptRecovery.java:148)
But invoked here:
- at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.sendEvent(TaskAttemptImpl.java:822)

  TestTaskAttemptRecovery.testTARecovery_SUCCEED:134
eventHandler.handle(any);
Never wanted here:
- at 
org.apache.tez.dag.app.dag.impl.TestTaskAttemptRecovery.testTARecovery_SUCCEED(TestTaskAttemptRecovery.java:134)
But invoked here:
- at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.sendEvent(TaskAttemptImpl.java:822)

  TestTaskAttemptRecovery.testTARecovery_NEW:108
eventHandler.handle(any);
Wanted 1 time:
- at 
org.apache.tez.dag.app.dag.impl.TestTaskAttemptRecovery.testTARecovery_NEW(TestTaskAttemptRecovery.java:108)
But was 2 times. Undesired invocation:
- at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.sendEvent(TaskAttemptImpl.java:822)
{code}


 Support counters recovery
 -

 Key: TEZ-853
 URL: https://issues.apache.org/jira/browse/TEZ-853
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: Tez-853-2.patch, Tez-853-3.patch, Tez-853.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code

[
https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130701#comment-14130701
]

Bikas Saha commented on TEZ-1539:
-

TaskStateChangeNotification should be a separate jira. Its not following the
VertexStatusUpdateEvent/ENUM pattern that is being followed by
VertexStatusChangeNotification. There is no good reason for these similar
things to follow different code patterns. Nor are there specific tests for
TaskStateChangeNotification.

For the rest of the code related to this jira, there are a lot of changes which
overall look fine. But given the number of buffers and if else conditions do
you think the 2 test cases part of this patch are providing sufficient
coverage? There arent any e2e tests covering initializer events being generated
and nothing in our code/test code uses initializer events. Maybe if that had
been there then we may have realized the need for the current changes when we
had initially put in these new events. So adding them now would be a useful way
to ascertain that things work as expected before we put this in the wild and
have it used e2e for the first time by a user.

Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code

Key: TEZ-1539
URL: https://issues.apache.org/jira/browse/TEZ-1539
Project: Apache Tez
Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth
Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt

Specifically for InputInitalizerEvents and VertexManagerEvents.
Pasting comment from TEZ-1447
In a majority of cases, events generated by different attempts of the same
task will be identical - in which case just making use of the event generated
by the first successful attempt is adequate. Doing something like this manes
that users don't worry about retries, indices etc - and can just rely on
receiving a set of events which are to be processed once the vertex succeeds.
If different attempts of the same workload generate different events -
processing is likely to be incorrect, since it's very possible for all data
to be processed (VERTEX successful), then a failure and retry - which
generates a different event. The initializer doesn't even run at this point,
since it's already done it's work and is complete. Handling such scenarios,
likely involves re-running the entire initializer and re-starting the vertex
which processed the event from scratch. In situations like this, where data
generated may be different, the best bet is for speculation to be disabled
(when it's supported), and max-attempts to be set to 1.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TEZ-1575) MRRSleepJob does not pick MR settings for container size and java opts

Hitesh Shah created TEZ-1575:


 Summary: MRRSleepJob does not pick MR settings for container size 
and java opts
 Key: TEZ-1575
 URL: https://issues.apache.org/jira/browse/TEZ-1575
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Hitesh Shah






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1575) MRRSleepJob does not pick MR settings for container size and java opts


 [ 
https://issues.apache.org/jira/browse/TEZ-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1575:
-
Attachment: TEZ-1575.1.patch

Given that this is an MR-based job, it might be better for it to use MR 
settings. 

[~bikassaha] Review please?

 MRRSleepJob does not pick MR settings for container size and java opts
 --

 Key: TEZ-1575
 URL: https://issues.apache.org/jira/browse/TEZ-1575
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Hitesh Shah
 Attachments: TEZ-1575.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1574) Support additional formats for the tez deployed archive


[ 
https://issues.apache.org/jira/browse/TEZ-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130811#comment-14130811
 ] 

Hitesh Shah commented on TEZ-1574:
--

Sounds good. 

 Support additional formats for the tez deployed archive
 ---

 Key: TEZ-1574
 URL: https://issues.apache.org/jira/browse/TEZ-1574
 Project: Apache Tez
  Issue Type: Bug
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: TEZ-1574.1.txt


 Currently, we only look for .tgz and .tar.gz. Looking at extensions isn't the 
 best method - but for now, this jira is to expand this list.
 Improving the mechanism to detect an archive will be a separate jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1543) Shuffle Errors on heavy load (causing task retries)

2014-09-11 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1543:
--
Fix Version/s: 0.6.0

 Shuffle Errors on heavy load (causing task retries)
 ---

 Key: TEZ-1543
 URL: https://issues.apache.org/jira/browse/TEZ-1543
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
  Labels: performance
 Fix For: 0.6.0

 Attachments: TEZ-1543.1.patch, syn_app_with_issue.svg, with_patch.svg


 org.apache.tez.runtime.library.common.shuffle.impl.Shuffle: ShuffleRunner 
 failed with error
 org.apache.tez.runtime.library.common.shuffle.impl.Shuffle$ShuffleError: 
 error in shuffle in fetcher [initialmap] #13
 at 
 org.apache.tez.runtime.library.common.shuffle.impl.Shuffle$RunShuffleCallable.call(Shuffle.java:336)
 at 
 org.apache.tez.runtime.library.common.shuffle.impl.Shuffle$RunShuffleCallable.call(Shuffle.java:318)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
 at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
 at 
 org.apache.hadoop.io.WritableUtils.readStringSafely(WritableUtils.java:475)
 at 
 org.apache.tez.runtime.library.common.shuffle.impl.ShuffleHeader.readFields(ShuffleHeader.java:82)
 at 
 org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyMapOutput(Fetcher.java:350)
 at 
 org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyFromHost(Fetcher.java:294)
 at 
 org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.run(Fetcher.java:160)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code

[
https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130820#comment-14130820
]

Hitesh Shah commented on TEZ-1539:
--

[~zjffdu] To clarify this use-case, InputInitializerEvents are similar to
DataMovementEvents i.e. they are generated by a source vertex's task and sent
downstream.

[~sseth] For the most part, I believe the recovery logic seems fine as the
events are being stored and restored from the log. The only missing piece is
injecting the correct logic to handle their routing in routeRecoveredEvents().
This is one of the missing places that needs fixing as it does not use the
RouteEventTransition.

Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code

[
https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Siddharth Seth updated TEZ-1539:

Attachment: TEZ-1539.4.txt

Updated patch with the recovery fixes. Also adds two new unit tests to handle
multiple tasks, multiple sources, and different events.

If this looks good, I'll rename the RecoveryEvent just before commit to avoid
noise.

Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1575) MRRSleepJob does not pick MR settings for container size and java opts


[ 
https://issues.apache.org/jira/browse/TEZ-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130956#comment-14130956
 ] 

Bikas Saha commented on TEZ-1575:
-

lgtm

 MRRSleepJob does not pick MR settings for container size and java opts
 --

 Key: TEZ-1575
 URL: https://issues.apache.org/jira/browse/TEZ-1575
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Hitesh Shah
 Attachments: TEZ-1575.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1267) Exception handling when Routing Events


 [ 
https://issues.apache.org/jira/browse/TEZ-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1267:

Priority: Critical  (was: Major)

 Exception handling when Routing Events
 --

 Key: TEZ-1267
 URL: https://issues.apache.org/jira/browse/TEZ-1267
 Project: Apache Tez
  Issue Type: Bug
Reporter: Siddharth Seth
Priority: Critical

 Events are generated by user code. In some places they're also handled by 
 user code within the AM. Currently, exceptions which are generated when 
 handling user code will end up killing the AM (and hence leading to a retry).
 Instead, failure to handle such events, should cause the application to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code


[ 
https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130965#comment-14130965
 ] 

Hitesh Shah commented on TEZ-1539:
--

[~zjffdu] Could you keep a watch on this as this may affect your patches. 

[~sseth] change for routeRecoveredEvents looks fine. 

 Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
 

 Key: TEZ-1539
 URL: https://issues.apache.org/jira/browse/TEZ-1539
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt, 
 TEZ-1539.4.txt


 Specifically for InputInitalizerEvents and VertexManagerEvents.
 Pasting comment from TEZ-1447
 In a majority of cases, events generated by different attempts of the same 
 task will be identical - in which case just making use of the event generated 
 by the first successful attempt is adequate. Doing something like this manes 
 that users don't worry about retries, indices etc - and can just rely on 
 receiving a set of events which are to be processed once the vertex succeeds.
 If different attempts of the same workload generate different events - 
 processing is likely to be incorrect, since it's very possible for all data 
 to be processed (VERTEX successful), then a failure and retry - which 
 generates a different event. The initializer doesn't even run at this point, 
 since it's already done it's work and is complete. Handling such scenarios, 
 likely involves re-running the entire initializer and re-starting the vertex 
 which processed the event from scratch. In situations like this, where data 
 generated may be different, the best bet is for speculation to be disabled 
 (when it's supported), and max-attempts to be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code


[ 
https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130993#comment-14130993
 ] 

Bikas Saha commented on TEZ-1539:
-

Please mark this as an incompatible change since behavior has changed wrt 0.5.0 
and users may need to adjust to it.
Its fine to get this in and iterate with e2e example in a separate jira. But 
thats important to get done before we can be confident it works e2e.

 Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
 

 Key: TEZ-1539
 URL: https://issues.apache.org/jira/browse/TEZ-1539
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt, 
 TEZ-1539.4.txt


 Specifically for InputInitalizerEvents and VertexManagerEvents.
 Pasting comment from TEZ-1447
 In a majority of cases, events generated by different attempts of the same 
 task will be identical - in which case just making use of the event generated 
 by the first successful attempt is adequate. Doing something like this manes 
 that users don't worry about retries, indices etc - and can just rely on 
 receiving a set of events which are to be processed once the vertex succeeds.
 If different attempts of the same workload generate different events - 
 processing is likely to be incorrect, since it's very possible for all data 
 to be processed (VERTEX successful), then a failure and retry - which 
 generates a different event. The initializer doesn't even run at this point, 
 since it's already done it's work and is complete. Handling such scenarios, 
 likely involves re-running the entire initializer and re-starting the vertex 
 which processed the event from scratch. In situations like this, where data 
 generated may be different, the best bet is for speculation to be disabled 
 (when it's supported), and max-attempts to be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1571) Add create method for DataSinkDescriptor


 [ 
https://issues.apache.org/jira/browse/TEZ-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1571:

Affects Version/s: 0.5.0

 Add create method for DataSinkDescriptor
 

 Key: TEZ-1571
 URL: https://issues.apache.org/jira/browse/TEZ-1571
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Blocker

 Add create method for DataSinkDescriptor, and make the constructor private. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1571) Add create method for DataSinkDescriptor


 [ 
https://issues.apache.org/jira/browse/TEZ-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1571:

Priority: Blocker  (was: Major)

 Add create method for DataSinkDescriptor
 

 Key: TEZ-1571
 URL: https://issues.apache.org/jira/browse/TEZ-1571
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Blocker

 Add create method for DataSinkDescriptor, and make the constructor private. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-853) Support counters recovery


 [ 
https://issues.apache.org/jira/browse/TEZ-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-853:
---
Attachment: Tez-853-4.patch

 Support counters recovery
 -

 Key: TEZ-853
 URL: https://issues.apache.org/jira/browse/TEZ-853
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: Tez-853-2.patch, Tez-853-3.patch, Tez-853-4.patch, 
 Tez-853.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1345) Add checks to guarantee all init events are written to recovery to consider vertex initialized


 [ 
https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-1345:

Attachment: TEZ-1345-13.patch

Attach new patch, fix the test failure. 

 Add checks to guarantee all init events are written to recovery to consider 
 vertex initialized
 --

 Key: TEZ-1345
 URL: https://issues.apache.org/jira/browse/TEZ-1345
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Fix For: 0.6.0

 Attachments: TEZ-1345-13.patch, TEZ-1345.12.rebased.2.patch, 
 TEZ-1345.12.rebased.patch, Tez-1345-10.patch, Tez-1345-11.patch, 
 Tez-1345-12.patch, Tez-1345-2.patch, Tez-1345-3.patch, Tez-1345-4.patch, 
 Tez-1345-5.patch, Tez-1345-6.patch, Tez-1345-7.patch, Tez-1345-8.patch, 
 Tez-1345-9.patch, Tez-1345.patch


 Related to issue discovered in TEZ-1033



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code


[ 
https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131046#comment-14131046
 ] 

Jeff Zhang commented on TEZ-1539:
-

[~hitesh] I will keep watching this jira. 

Regarding the impact on recovery, I think it will have impact on recovery in 
case when the source vertex is succeeded, the InputInitiliazerEvent won't be 
regenerated, we also don't save them in recovery. So I think we should save the 
InputInitializerEvent in recovery log and also add unit test for this.

 Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
 

 Key: TEZ-1539
 URL: https://issues.apache.org/jira/browse/TEZ-1539
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt, 
 TEZ-1539.4.txt


 Specifically for InputInitalizerEvents and VertexManagerEvents.
 Pasting comment from TEZ-1447
 In a majority of cases, events generated by different attempts of the same 
 task will be identical - in which case just making use of the event generated 
 by the first successful attempt is adequate. Doing something like this manes 
 that users don't worry about retries, indices etc - and can just rely on 
 receiving a set of events which are to be processed once the vertex succeeds.
 If different attempts of the same workload generate different events - 
 processing is likely to be incorrect, since it's very possible for all data 
 to be processed (VERTEX successful), then a failure and retry - which 
 generates a different event. The initializer doesn't even run at this point, 
 since it's already done it's work and is complete. Handling such scenarios, 
 likely involves re-running the entire initializer and re-starting the vertex 
 which processed the event from scratch. In situations like this, where data 
 generated may be different, the best bet is for speculation to be disabled 
 (when it's supported), and max-attempts to be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1571) Add create method for DataSinkDescriptor


 [ 
https://issues.apache.org/jira/browse/TEZ-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-1571:

Attachment: Tez-1571.patch

Attach patch. 

 Add create method for DataSinkDescriptor
 

 Key: TEZ-1571
 URL: https://issues.apache.org/jira/browse/TEZ-1571
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Blocker
 Attachments: Tez-1571.patch


 Add create method for DataSinkDescriptor, and make the constructor private. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1569) Add tests for preemption


 [ 
https://issues.apache.org/jira/browse/TEZ-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1569:

Attachment: TEZ-1569.1.patch

 Add tests for preemption
 

 Key: TEZ-1569
 URL: https://issues.apache.org/jira/browse/TEZ-1569
 Project: Apache Tez
  Issue Type: Test
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-1569.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1568) Add system test for propagation of diagnostics for errors


 [ 
https://issues.apache.org/jira/browse/TEZ-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-1568:

Attachment: TEZ-1568.patch

 Add system test for propagation of diagnostics for errors
 -

 Key: TEZ-1568
 URL: https://issues.apache.org/jira/browse/TEZ-1568
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-1568.patch


 Design system test where exception come from Input, Output, Processor, 
 InputInitializer and VertexManagerPlugin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1568) Add system test for propagation of diagnostics for errors


[ 
https://issues.apache.org/jira/browse/TEZ-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131068#comment-14131068
 ] 

Jeff Zhang commented on TEZ-1568:
-

Attach the patch.

* Verify the exception from Input/Output/Processor could be propagated to 
client side. For the II and VM cases, leave it in 
[TEZ-1573|https://issues.apache.org/jira/browse/TEZ-1573]
* handleEvents of Output and Processor is not supported, so didn't include test 
for them. (Only find ConsumeType of Input in Edge)

 Add system test for propagation of diagnostics for errors
 -

 Key: TEZ-1568
 URL: https://issues.apache.org/jira/browse/TEZ-1568
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-1568.patch


 Design system test where exception come from Input, Output, Processor, 
 InputInitializer and VertexManagerPlugin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1539) Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code


[ 
https://issues.apache.org/jira/browse/TEZ-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131070#comment-14131070
 ] 

Bikas Saha commented on TEZ-1539:
-

bq.Agreed. The last patch done by Sid should be saving those events into 
recovery log within the source vertex itself.
[~hitesh] from the patch it looks like II events are being stored in the 
destination vertex and not the source vertex, unless I am reading it wrong
{code}
   if (!isEventFromVertex(vertex, tezEvent.getSourceInfo())) {
-continue;
+if 
(tezEvent.getEventType().equals(EventType.ROOT_INPUT_INITIALIZER_EVENT)) {
+  recoveryEvents.add(tezEvent);
+} else {
+  continue;
+}
   }{code}
It might still work but shouldn't the flow be consistent with all other events 
where the events are stored in their source vertex? Might break things down the 
road.

 Allow a FIRE_ONCE_ON_SUCCESS model for events generated by user code
 

 Key: TEZ-1539
 URL: https://issues.apache.org/jira/browse/TEZ-1539
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: TEZ-1539.1.wip.txt, TEZ-1539.2.txt, TEZ-1539.3.txt, 
 TEZ-1539.4.txt


 Specifically for InputInitalizerEvents and VertexManagerEvents.
 Pasting comment from TEZ-1447
 In a majority of cases, events generated by different attempts of the same 
 task will be identical - in which case just making use of the event generated 
 by the first successful attempt is adequate. Doing something like this manes 
 that users don't worry about retries, indices etc - and can just rely on 
 receiving a set of events which are to be processed once the vertex succeeds.
 If different attempts of the same workload generate different events - 
 processing is likely to be incorrect, since it's very possible for all data 
 to be processed (VERTEX successful), then a failure and retry - which 
 generates a different event. The initializer doesn't even run at this point, 
 since it's already done it's work and is complete. Handling such scenarios, 
 likely involves re-running the entire initializer and re-starting the vertex 
 which processed the event from scratch. In situations like this, where data 
 generated may be different, the best bet is for speculation to be disabled 
 (when it's supported), and max-attempts to be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1569) Add tests for preemption


[ 
https://issues.apache.org/jira/browse/TEZ-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131084#comment-14131084
 ] 

Bikas Saha commented on TEZ-1569:
-

Patch adds e2e tests for preemption by using local mode and custom container 
launcher to quickly simulate job execution without launching any tasks. It 
simulates preemption by sending the preemption event to the engine in exactly 
the same manner in which YARN sends the preemption event. This enables us to 
explicitly specify the exact tasks we want to preempt and check for expected 
behavior.
The patch tests.
In session mode, different combinations of DAG edges and vertices with 
different number of preempted attempts
In non-session mode, multiple preemptions for one case of a DAG.

[~tassapola] Please review. Thanks!

 Add tests for preemption
 

 Key: TEZ-1569
 URL: https://issues.apache.org/jira/browse/TEZ-1569
 Project: Apache Tez
  Issue Type: Test
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-1569.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1569) Add tests for preemption


 [ 
https://issues.apache.org/jira/browse/TEZ-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1569:

Attachment: (was: TEZ-1569.1.patch)

 Add tests for preemption
 

 Key: TEZ-1569
 URL: https://issues.apache.org/jira/browse/TEZ-1569
 Project: Apache Tez
  Issue Type: Test
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-1569.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1569) Add tests for preemption


 [ 
https://issues.apache.org/jira/browse/TEZ-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1569:

Attachment: TEZ-1569.1.patch

Attaching patch with more comments.

 Add tests for preemption
 

 Key: TEZ-1569
 URL: https://issues.apache.org/jira/browse/TEZ-1569
 Project: Apache Tez
  Issue Type: Test
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-1569.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1267) Exception handling when Routing Events