[jira] [Updated] (TEZ-1757) Column selector for tables.

2014-11-08 Thread Sreenath Somarajapuram (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sreenath Somarajapuram updated TEZ-1757:

Attachment: TEZ-1757.1.patch

Generic column selector for all tables.

> Column selector for tables.
> ---
>
> Key: TEZ-1757
> URL: https://issues.apache.org/jira/browse/TEZ-1757
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Sreenath Somarajapuram
>Assignee: Sreenath Somarajapuram
> Attachments: TEZ-1757.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686

2014-11-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203768#comment-14203768
 ] 

Jeff Zhang commented on TEZ-1761:
-

Yes, I have committed it to branch-0.5.

Change the target version to 0.5.3

> TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to 
> TEZ-1686
> 
>
> Key: TEZ-1761
> URL: https://issues.apache.org/jira/browse/TEZ-1761
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1761.patch
>
>
> Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< 
> FAILURE!
> testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser)  Time 
> elapsed: 0.053 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<100>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686

2014-11-08 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203766#comment-14203766
 ] 

Hitesh Shah commented on TEZ-1761:
--

That should be fine for 0.5.3. Has this been comitted? The jira is still 
unresolved. 

> TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to 
> TEZ-1686
> 
>
> Key: TEZ-1761
> URL: https://issues.apache.org/jira/browse/TEZ-1761
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1761.patch
>
>
> Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< 
> FAILURE!
> testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser)  Time 
> elapsed: 0.053 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<100>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686

2014-11-08 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203766#comment-14203766
 ] 

Hitesh Shah edited comment on TEZ-1761 at 11/9/14 4:41 AM:
---

This should be fine for 0.5.3 as it is a trivial bug fix. Has this been 
comitted? The jira is still unresolved. 


was (Author: hitesh):
That should be fine for 0.5.3. Has this been comitted? The jira is still 
unresolved. 

> TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to 
> TEZ-1686
> 
>
> Key: TEZ-1761
> URL: https://issues.apache.org/jira/browse/TEZ-1761
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1761.patch
>
>
> Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< 
> FAILURE!
> testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser)  Time 
> elapsed: 0.053 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<100>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686

2014-11-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203756#comment-14203756
 ] 

Jeff Zhang commented on TEZ-1761:
-

Sorry, [~hitesh], Is it shouldn't committed to branch-0.5 ? I find the target 
version is 0.6

> TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to 
> TEZ-1686
> 
>
> Key: TEZ-1761
> URL: https://issues.apache.org/jira/browse/TEZ-1761
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1761.patch
>
>
> Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< 
> FAILURE!
> testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser)  Time 
> elapsed: 0.053 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<100>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686

2014-11-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203755#comment-14203755
 ] 

Jeff Zhang commented on TEZ-1761:
-

Committed to master & branch-0.5

> TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to 
> TEZ-1686
> 
>
> Key: TEZ-1761
> URL: https://issues.apache.org/jira/browse/TEZ-1761
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1761.patch
>
>
> Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< 
> FAILURE!
> testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser)  Time 
> elapsed: 0.053 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<100>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686

2014-11-08 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203748#comment-14203748
 ] 

Hitesh Shah commented on TEZ-1761:
--

+1 

> TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to 
> TEZ-1686
> 
>
> Key: TEZ-1761
> URL: https://issues.apache.org/jira/browse/TEZ-1761
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1761.patch
>
>
> Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< 
> FAILURE!
> testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser)  Time 
> elapsed: 0.053 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<100>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686

2014-11-08 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-1761:

Attachment: TEZ-1761.patch

[~hitesh] attach the patch, please help review.

> TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to 
> TEZ-1686
> 
>
> Key: TEZ-1761
> URL: https://issues.apache.org/jira/browse/TEZ-1761
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1761.patch
>
>
> Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< 
> FAILURE!
> testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser)  Time 
> elapsed: 0.053 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<100>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1736) Add support for Inputs/Outputs in runtime-library to generate history text data

2014-11-08 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203730#comment-14203730
 ] 

Hitesh Shah commented on TEZ-1736:
--

[~sseth] Mind doing a review of the latest patch to see if the property 
handling is being done correctly? 

> Add support for Inputs/Outputs in runtime-library to generate history text 
> data
> ---
>
> Key: TEZ-1736
> URL: https://issues.apache.org/jira/browse/TEZ-1736
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
> Attachments: TEZ-1736.1.patch, TEZ-1736.2.patch, TEZ-1736.3.patch, 
> TEZ-1736.4.patch, TEZ-1736.5.patch, TEZ-1736.6.patch, TEZ-1736.7.patch
>
>
> The userpayload related setHistoryText has been available for some time but 
> is not used by the Inputs/Outputs in the run-time library. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1736) Add support for Inputs/Outputs in runtime-library to generate history text data

2014-11-08 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1736:
-
Attachment: TEZ-1736.7.patch

> Add support for Inputs/Outputs in runtime-library to generate history text 
> data
> ---
>
> Key: TEZ-1736
> URL: https://issues.apache.org/jira/browse/TEZ-1736
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
> Attachments: TEZ-1736.1.patch, TEZ-1736.2.patch, TEZ-1736.3.patch, 
> TEZ-1736.4.patch, TEZ-1736.5.patch, TEZ-1736.6.patch, TEZ-1736.7.patch
>
>
> The userpayload related setHistoryText has been available for some time but 
> is not used by the Inputs/Outputs in the run-time library. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1764) Data generated in SimpleHistoryLogger could be less verbose

2014-11-08 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-1764:


 Summary: Data generated in SimpleHistoryLogger could be less 
verbose
 Key: TEZ-1764
 URL: https://issues.apache.org/jira/browse/TEZ-1764
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah


The structure was initially kept quite similar to Timeline entities. However, 
sections are related entities are in most cases not needed as the data is 
replicated in primary filters and/or other info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1763) Support tee of history events to both a file based logger as well as Timeline

2014-11-08 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-1763:


 Summary: Support tee of history events to both a file based logger 
as well as Timeline
 Key: TEZ-1763
 URL: https://issues.apache.org/jira/browse/TEZ-1763
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1762) Lots of unit tests do not have timeout parameter set

2014-11-08 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203660#comment-14203660
 ] 

Hitesh Shah commented on TEZ-1762:
--

A generic 5 second timeout could be added to all of them with modifications if 
needed based on actual run times.

> Lots of unit tests do not have timeout parameter set 
> -
>
> Key: TEZ-1762
> URL: https://issues.apache.org/jira/browse/TEZ-1762
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1762) Lots of unit tests do not have timeout parameter set

2014-11-08 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-1762:


 Summary: Lots of unit tests do not have timeout parameter set 
 Key: TEZ-1762
 URL: https://issues.apache.org/jira/browse/TEZ-1762
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686

2014-11-08 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-1761:


 Summary: TestRecoveryParser::testGetLastInProgressDAG fails in 
similar manner to TEZ-1686
 Key: TEZ-1761
 URL: https://issues.apache.org/jira/browse/TEZ-1761
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jeff Zhang


Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< 
FAILURE!
testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser)  Time 
elapsed: 0.053 sec  <<< FAILURE!
java.lang.AssertionError: expected:<0> but was:<100>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1760) DAGSchedulerNaturalOrderControlled should consider cluster capacity when waiting for source vertices

2014-11-08 Thread Siddharth Seth (JIRA)
Siddharth Seth created TEZ-1760:
---

 Summary: DAGSchedulerNaturalOrderControlled should consider 
cluster capacity when waiting for source vertices
 Key: TEZ-1760
 URL: https://issues.apache.org/jira/browse/TEZ-1760
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1759) Add slow start capabilities to DAGSchedulerNaturalOrderControlled

2014-11-08 Thread Siddharth Seth (JIRA)
Siddharth Seth created TEZ-1759:
---

 Summary: Add slow start capabilities to 
DAGSchedulerNaturalOrderControlled
 Key: TEZ-1759
 URL: https://issues.apache.org/jira/browse/TEZ-1759
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled

2014-11-08 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203629#comment-14203629
 ] 

Siddharth Seth commented on TEZ-1750:
-

Thanks for the reviews. Committing.

bq. This will likely need a bunch of experimentation to see benefits/issues. 
That will give more data on this approach. Until important points like 
different edge types etc. are addressed this probably should not be made 
default.
We'll get to this in 0.6. The issue where we end up doing out of order 
scheduling is a bigger one in terms of performance. Pre-emption helps but 
performance would still suffer when multiple DAGs per session come into play.

> Add a DAGScheduler which schedules tasks only when sources have been scheduled
> --
>
> Key: TEZ-1750
> URL: https://issues.apache.org/jira/browse/TEZ-1750
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt, TEZ-1750.3.txt
>
>
> Splitting out the patch on TEZ-1522 into a separate jira.
> There's several scenarios in which we end up scheduling downstream tasks 
> before their sources have been scheduled - and then get into a situation 
> where the sources are starved. Currently, anywhere a ShuffleVertexManager is 
> used can cause such behaviour - since it starts scheduling it's tasks after a 
> certain number of sources are complete, but subsequen non-shuffle 
> VertexManagers will scheduled immediately.
> Disabling slow-start is one option to achieve this (or setting slow start on 
> all vertices), but it doesn't work for the situation where dynamic reducer 
> parallelism kicks in - since it has to wait for source tasks to complete.
> The intent here is to add a DAGScheduler, which affectively negates the slow 
> start, and in case of dynamic parallelism determination, waits for upstream 
> tasks to be scheduled before scheduling downstream tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled

2014-11-08 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203598#comment-14203598
 ] 

Bikas Saha commented on TEZ-1750:
-

This will likely need a bunch of experimentation to see benefits/issues. That 
will give more data on this approach. Until important points like different 
edge types etc. are addressed this probably should not be made default. 
Hopefully follow up jiras will address these and/or other issues found with 
more experiments and increase confidence.

Making this pluggable via config would be good to have so that folks can 
experiment with this approach or other kinds of DAG scheduling, without having 
to change the main code. So lets get this into master and also into 0.5.3.


> Add a DAGScheduler which schedules tasks only when sources have been scheduled
> --
>
> Key: TEZ-1750
> URL: https://issues.apache.org/jira/browse/TEZ-1750
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt, TEZ-1750.3.txt
>
>
> Splitting out the patch on TEZ-1522 into a separate jira.
> There's several scenarios in which we end up scheduling downstream tasks 
> before their sources have been scheduled - and then get into a situation 
> where the sources are starved. Currently, anywhere a ShuffleVertexManager is 
> used can cause such behaviour - since it starts scheduling it's tasks after a 
> certain number of sources are complete, but subsequen non-shuffle 
> VertexManagers will scheduled immediately.
> Disabling slow-start is one option to achieve this (or setting slow start on 
> all vertices), but it doesn't work for the situation where dynamic reducer 
> parallelism kicks in - since it has to wait for source tasks to complete.
> The intent here is to add a DAGScheduler, which affectively negates the slow 
> start, and in case of dynamic parallelism determination, waits for upstream 
> tasks to be scheduled before scheduling downstream tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled

2014-11-08 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203590#comment-14203590
 ] 

Siddharth Seth edited comment on TEZ-1750 at 11/8/14 8:52 PM:
--

Updated with expert level setting.

There's comments in their on some future enhancements - considering cluster 
capacity to allow scheduling downstream even if upstream is not scheduled, 
generic slow start.

Not enabling this by default in 0.5.3, because I think it's not a good change 
to have in place on a minor version. On 0.6.0, based on how this performs, it 
could be enabled by default.


was (Author: sseth):
Updated with experimental.

There's comments in their on some future enhancements - considering cluster 
capacity to allow scheduling downstream even if upstream is not scheduled, 
generic slow start.

Not enabling this by default in 0.5.3, because I think it's not a good change 
to have in place on a minor version. On 0.6.0, based on how this performs, it 
could be enabled by default.

> Add a DAGScheduler which schedules tasks only when sources have been scheduled
> --
>
> Key: TEZ-1750
> URL: https://issues.apache.org/jira/browse/TEZ-1750
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt, TEZ-1750.3.txt
>
>
> Splitting out the patch on TEZ-1522 into a separate jira.
> There's several scenarios in which we end up scheduling downstream tasks 
> before their sources have been scheduled - and then get into a situation 
> where the sources are starved. Currently, anywhere a ShuffleVertexManager is 
> used can cause such behaviour - since it starts scheduling it's tasks after a 
> certain number of sources are complete, but subsequen non-shuffle 
> VertexManagers will scheduled immediately.
> Disabling slow-start is one option to achieve this (or setting slow start on 
> all vertices), but it doesn't work for the situation where dynamic reducer 
> parallelism kicks in - since it has to wait for source tasks to complete.
> The intent here is to add a DAGScheduler, which affectively negates the slow 
> start, and in case of dynamic parallelism determination, waits for upstream 
> tasks to be scheduled before scheduling downstream tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled

2014-11-08 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1750:

Attachment: TEZ-1750.3.txt

Updated with experimental.

There's comments in their on some future enhancements - considering cluster 
capacity to allow scheduling downstream even if upstream is not scheduled, 
generic slow start.

Not enabling this by default in 0.5.3, because I think it's not a good change 
to have in place on a minor version. On 0.6.0, based on how this performs, it 
could be enabled by default.

> Add a DAGScheduler which schedules tasks only when sources have been scheduled
> --
>
> Key: TEZ-1750
> URL: https://issues.apache.org/jira/browse/TEZ-1750
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt, TEZ-1750.3.txt
>
>
> Splitting out the patch on TEZ-1522 into a separate jira.
> There's several scenarios in which we end up scheduling downstream tasks 
> before their sources have been scheduled - and then get into a situation 
> where the sources are starved. Currently, anywhere a ShuffleVertexManager is 
> used can cause such behaviour - since it starts scheduling it's tasks after a 
> certain number of sources are complete, but subsequen non-shuffle 
> VertexManagers will scheduled immediately.
> Disabling slow-start is one option to achieve this (or setting slow start on 
> all vertices), but it doesn't work for the situation where dynamic reducer 
> parallelism kicks in - since it has to wait for source tasks to complete.
> The intent here is to add a DAGScheduler, which affectively negates the slow 
> start, and in case of dynamic parallelism determination, waits for upstream 
> tasks to be scheduled before scheduling downstream tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled

2014-11-08 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203589#comment-14203589
 ] 

Siddharth Seth commented on TEZ-1750:
-

Marked as Expert level. I don't think unstable is required though... it allows 
the DAGScheduler to be configured, which is something we do need.

1-1 edges can be problematic in such cases. Realistically, though - if the 
consumer of a 1-1 edge produces final output, at the moment - that isn't read 
till the DAG completes. Otherwise, it's likely to be connected to some other 
source which requires all outputs.
This could be addressed, in a future patch, with additional details being sent 
over in the Task launch request.

The bigger problem this addresses is the slow start (either configured or due 
to Reduce parallelism changes), on a producer, but no slow start on downstream 
vertices - which can cause some bad scheduling behaviour. One intent of slow 
start is to prevent unnecessary cluster utilization - however we can end up 
with situations where we not only end up using the cluster unnecessarily, but 
in the process also harm the currently executing DAG.

> Add a DAGScheduler which schedules tasks only when sources have been scheduled
> --
>
> Key: TEZ-1750
> URL: https://issues.apache.org/jira/browse/TEZ-1750
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt
>
>
> Splitting out the patch on TEZ-1522 into a separate jira.
> There's several scenarios in which we end up scheduling downstream tasks 
> before their sources have been scheduled - and then get into a situation 
> where the sources are starved. Currently, anywhere a ShuffleVertexManager is 
> used can cause such behaviour - since it starts scheduling it's tasks after a 
> certain number of sources are complete, but subsequen non-shuffle 
> VertexManagers will scheduled immediately.
> Disabling slow-start is one option to achieve this (or setting slow start on 
> all vertices), but it doesn't work for the situation where dynamic reducer 
> parallelism kicks in - since it has to wait for source tasks to complete.
> The intent here is to add a DAGScheduler, which affectively negates the slow 
> start, and in case of dynamic parallelism determination, waits for upstream 
> tasks to be scheduled before scheduling downstream tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1502) TestDriver cannot be executed via any jar in the tez distribution

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1502:

Target Version/s: 0.6.0  (was: 0.5.2)

> TestDriver cannot be executed via any jar in the tez distribution
> -
>
> Key: TEZ-1502
> URL: https://issues.apache.org/jira/browse/TEZ-1502
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>
> TestDriver contains the FaultTolerance tests. Ideally, TEZ-1501 should belong 
> here as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1526) LoadingCache for TezTaskID slow for large jobs

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1526:

Target Version/s: 0.6.0  (was: 0.5.2)

> LoadingCache for TezTaskID slow for large jobs
> --
>
> Key: TEZ-1526
> URL: https://issues.apache.org/jira/browse/TEZ-1526
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>  Labels: performance
> Attachments: 10-TezTaskIDs.patch, TEZ-1526-v1.patch, 
> TEZ-1526-v2.patch
>
>
> Using the LoadingCache with default builder settings. 100,000 TezTaskIDs are 
> created in 10 seconds on my setup. With a LoadingCache initialCapacity of 
> 10,000 they are created in 300 ms. With no LoadingCache, they are created in 
> 10 ms. A test case in attached to illustrate the condition I would like to be 
> sped up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1348) Ensure local directories are used when using local-mode

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1348:

Target Version/s: 0.6.0  (was: 0.5.2)

> Ensure local directories are used when using local-mode
> ---
>
> Key: TEZ-1348
> URL: https://issues.apache.org/jira/browse/TEZ-1348
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Priority: Critical
>
> In TEZ-717, I incorrect thought setting fs.defaultFS programmatically in 
> tez-site would work for local mode.
> Currently the requirement is that tez-site.xml must have fs.defaultFS set to 
> file:///.
> While that works, it doesn't allow for seamless execution in either 
> local-mode or on a cluster.
> The main issue here is that when Inputs / Outputs are configured - they use a 
> version of configuration which reads tez-site, and do not use the 
> configuration from the client itself (which is correct behaviour).
> Not sure what a good way to fix this is 
> 1) It may be possible to override this value each time an instance of 
> Configuration/TezConfiguration is created. One possible way would be to 
> statically add a default resource to Configuration the moment a local client 
> is created.
> 2) Provide information in the contexts on whether this is local or not. This 
> is fairly ugly, and would get in the way of running mixed mode tasks.
> Anyone have other suggestions ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1576) Class level comment in {{MiniTezCluster}} ends abruptly

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1576:

Target Version/s: 0.6.0  (was: 0.5.2)

> Class level comment in {{MiniTezCluster}} ends abruptly
> ---
>
> Key: TEZ-1576
> URL: https://issues.apache.org/jira/browse/TEZ-1576
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ufuk Celebi
>Assignee: Saurabh Chhajed
>Priority: Trivial
>
> The class level comment in {{MiniTezCluster}} ends abruptly:
> {code}
> /**
>  * Configures and starts the Tez-specific components in the YARN cluster.
>  *
>  * When using this mini cluster, the user is expected to
>  */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1564) State machine error: Invalid event: T_SCHEDULE at SCHEDULED

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1564:

Target Version/s: 0.6.0  (was: 0.5.2)

> State machine error: Invalid event: T_SCHEDULE at SCHEDULED
> ---
>
> Key: TEZ-1564
> URL: https://issues.apache.org/jira/browse/TEZ-1564
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Critical
> Attachments: applogs.txt.tar.gz, dag.dot
>
>
> ERROR [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.TaskImpl: Can't handle this event at current 
> state for task_1409722953518_0162_1_07_00
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> T_SCHEDULE at SCHEDULED
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:827)
>   at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:95)
>   at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1604)
>   at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1590)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>   at java.lang.Thread.run(Thread.java:724)
> I will attach the dag + app logs soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1315) Make use of cache information added to InputSplit in MAPREDUCE-5896

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1315:

Target Version/s: 0.6.0  (was: 0.5.2)

> Make use of cache information added to InputSplit in MAPREDUCE-5896
> ---
>
> Key: TEZ-1315
> URL: https://issues.apache.org/jira/browse/TEZ-1315
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1541) DAGAppMaster can get stuck on shutdown if the RM is no longer around

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1541:

Target Version/s: 0.6.0  (was: 0.5.2)

> DAGAppMaster can get stuck on shutdown if the RM is no longer around
> 
>
> Key: TEZ-1541
> URL: https://issues.apache.org/jira/browse/TEZ-1541
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
> Attachments: dagapp.threads.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1444) Wrap interrupted in a specific Exception

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1444:

Target Version/s: 0.6.0  (was: 0.5.2)

> Wrap interrupted in a specific Exception
> 
>
> Key: TEZ-1444
> URL: https://issues.apache.org/jira/browse/TEZ-1444
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Priority: Critical
>
> TEZ-1331 changed some APIs to throw InterruptedException in an IOException 
> wrapper.
> It's useful to be able to determine whether the exception thrown was 
> "interrupted" or not, since there's a good chance the interrupt was caused by 
> user code itself, in which case Interrupts can be caught and ignored by the 
> user code which likely issued the interrupt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1421:

Target Version/s: 0.6.0  (was: 0.5.2)

> MRCombiner throws NPE in MapredWordCount on master branch
> -
>
> Key: TEZ-1421
> URL: https://issues.apache.org/jira/browse/TEZ-1421
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Tsuyoshi OZAWA
>
> I tested MapredWordCount against 70GB generated by RandowTextWriter. When a 
> Combiner runs, it throws NPE. It looks setCombinerClass doesn't work 
> correctly.
> {quote}
> Caused by: java.lang.RuntimeException: java.lang.NullPointerException
> at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
> at 
> org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122)
> at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112)
> at 
> org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472)
> at 
> org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605)
> at 
> org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1721) Update INSTALL instructions for clarifying tez client jars compatibility with runtime tarball on HDFS

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1721:

Target Version/s: 0.6.0  (was: 0.5.2)

> Update INSTALL instructions for clarifying tez client jars compatibility with 
> runtime tarball on HDFS
> -
>
> Key: TEZ-1721
> URL: https://issues.apache.org/jira/browse/TEZ-1721
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>Priority: Critical
> Attachments: TEZ-1721.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1736) Add support for Inputs/Outputs in runtime-library to generate history text data

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1736:

Target Version/s: 0.6.0  (was: 0.5.2)

> Add support for Inputs/Outputs in runtime-library to generate history text 
> data
> ---
>
> Key: TEZ-1736
> URL: https://issues.apache.org/jira/browse/TEZ-1736
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
> Attachments: TEZ-1736.1.patch, TEZ-1736.2.patch, TEZ-1736.3.patch, 
> TEZ-1736.4.patch, TEZ-1736.5.patch, TEZ-1736.6.patch
>
>
> The userpayload related setHistoryText has been available for some time but 
> is not used by the Inputs/Outputs in the run-time library. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1663) Change diagnostics to a more parseable format for easy consumption by UI

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1663:

Target Version/s: 0.6.0  (was: 0.5.2)

> Change diagnostics to a more parseable format for easy consumption by UI
> 
>
> Key: TEZ-1663
> URL: https://issues.apache.org/jira/browse/TEZ-1663
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>
> A better format for diagnostics will allow the UI to parse it more easily and 
> also provide information back to user in a better manner. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1697) DAG submission fails if a local resource added is already part of tez.lib.uris

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1697:

Target Version/s: 0.6.0  (was: 0.5.2)

> DAG submission fails if a local resource added is already part of tez.lib.uris
> --
>
> Key: TEZ-1697
> URL: https://issues.apache.org/jira/browse/TEZ-1697
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1723) Flaky test: TestVertexImpl::testVertexWithMultipleInitializers1

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1723:

Target Version/s: 0.6.0  (was: 0.5.2)

> Flaky test: TestVertexImpl::testVertexWithMultipleInitializers1
> ---
>
> Key: TEZ-1723
> URL: https://issues.apache.org/jira/browse/TEZ-1723
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>
> https://builds.apache.org/job/Tez-Build/723
> testVertexWithMultipleInitializers1(org.apache.tez.dag.app.dag.impl.TestVertexImpl)
>   Time elapsed: 0.024 sec  <<< FAILURE!
> java.lang.AssertionError: expected: but was:
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.tez.dag.app.dag.impl.TestVertexImpl.testVertexWithMultipleInitializers1(TestVertexImpl.java:4234)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1734) Vertex's taskNum may be -1 when recovered from NEW to FAILED/KILLED

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1734:

Target Version/s: 0.6.0  (was: 0.5.2)

> Vertex's taskNum may be -1 when recovered from NEW to FAILED/KILLED
> ---
>
> Key: TEZ-1734
> URL: https://issues.apache.org/jira/browse/TEZ-1734
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.1
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-1734-2.patch, TEZ-1734.patch
>
>
> When vertex recovered from NEW to FAILED/KILLED, the taskNum may be -1, in 
> this case, we don't need to recover its tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1546) Change InputInitializerContext.registerForVertexStateUpdates to return a list of pending state changes

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1546:

Target Version/s: 0.6.0  (was: 0.5.2)

> Change InputInitializerContext.registerForVertexStateUpdates to return a list 
> of pending state changes
> --
>
> Key: TEZ-1546
> URL: https://issues.apache.org/jira/browse/TEZ-1546
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Critical
>
> Sending pending events via the stateChange on the InputInitializer can be 
> confusing - since multiple calls will be made back to back, without knowing 
> how many events are coming in , and which the last one is.
> Returning all past state changes via register ensures invocations of 
> onStateChanged are current events.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1397) Node affinity for tasks processing the same splits

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1397:

Target Version/s: 0.6.0  (was: 0.5.2)

> Node affinity for tasks processing the same splits
> --
>
> Key: TEZ-1397
> URL: https://issues.apache.org/jira/browse/TEZ-1397
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>
> Within a session, if the same set of HDFS blocks are accessed by different 
> tasks - these should ideally be launched on the same node for better buffer 
> cache, etc utilization.
> This will likely end up being another level of requests higher up than 
> NODE_LOCAL for the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1730) Add Auto-reducer-parallelism information to the DAG .dot file

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1730:

Target Version/s: 0.6.0  (was: 0.5.2)

> Add Auto-reducer-parallelism information to the DAG .dot file
> -
>
> Key: TEZ-1730
> URL: https://issues.apache.org/jira/browse/TEZ-1730
> Project: Apache Tez
>  Issue Type: Task
>Affects Versions: 0.5.2
>Reporter: Gopal V
>Assignee: Gopal V
>Priority: Trivial
>
> Write out the auto-reducer configuration information (if present) into the 
> Graphviz data being printed out as debug artifact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-925) Tez job does not blacklist node when container failed to launch

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-925:
---
Target Version/s: 0.6.0  (was: 0.5.2)

> Tez job does not blacklist node when container failed to launch
> ---
>
> Key: TEZ-925
> URL: https://issues.apache.org/jira/browse/TEZ-925
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Yesha Vora
>
> Tez job tries to create a container even if disk is full. Thus,  job fails 
> with below error.
>  "Can't create directory application_1393835169637_0138 in 
> /tmp/yarn/local/usercache/user/appcache/application_1393835169637_0138 - No 
> space left on device"
> However, Tez should be able to detect full disk and blacklist the node. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1522) Scheduling can result in out of order execution and slowdown of upstream work

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1522:

Target Version/s: 0.6.0  (was: 0.5.2)

> Scheduling can result in out of order execution and slowdown of upstream work
> -
>
> Key: TEZ-1522
> URL: https://issues.apache.org/jira/browse/TEZ-1522
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Critical
>  Labels: performance
> Attachments: TEZ-1522.1.wip.txt, TEZ-1522.2.wip.txt, 
> TEZ-1522.am.log.gz, task_runtime.svg
>
>
> M2 M7
> \  /
> (sg) \/
>R3/ (b)
> \   /
>  (b) \ /
>   \   /
> M5
> |
> R6 
> Plz refer to the attachment (task runtime SVG). In this case, M5 got 
> scheduled much earlier than R3 (green color in the diagram) and retained lots 
> of containers.
> R3 got less containers to work with. 
> Attaching the output from the status monitor when the job ran;  Map_5 has 
> taken up almost all of cluster resource, whereas Reducer_3 got fraction of 
> the capacity.
> Map_2: 1/1  Map_5: 0(+373)/1000 Map_7: 1/1  Reducer_3: 0/8000 
>   Reducer_6: 0/1
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 0/8000 
>   Reducer_6: 0/1
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 0(+1)/8000 
>   Reducer_6: 0/1
> 
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 
> 14(+7)/8000  Reducer_6: 0/1
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 
> 63(+14)/8000 Reducer_6: 0/1
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 
> 159(+22)/8000Reducer_6: 0/1
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 
> 308(+29)/8000Reducer_6: 0/1
> ...
> Creating this JIRA as a placeholder for scheduler enhancement. One 
> possibililty could be to
> schedule lesser number of tasks in downstream vertices, based on the 
> information available for the upstream vertex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1742) Improve response time of internal preemption

2014-11-08 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203573#comment-14203573
 ] 

Bikas Saha commented on TEZ-1742:
-

commit c6c08c18b77870e71298adb2dc51908473434768
Author: Bikas Saha 
Date:   Sat Nov 8 11:55:05 2014 -0800

TEZ-1742 addendum patch for follow up comments


> Improve response time of internal preemption
> 
>
> Key: TEZ-1742
> URL: https://issues.apache.org/jira/browse/TEZ-1742
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 0.5.3
>
> Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, 
> TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch, 
> TEZ-1742.addendum.3.patch
>
>
> Tez YARN Task Scheduler currently preempts 1 running task at a time when a 
> higher priority task is waiting and there are no available resources. When a 
> large number of higher priority tasks are pending then it can take a long 
> time to preempt the required number of lower priority tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha resolved TEZ-1758.
-
   Resolution: Fixed
Fix Version/s: 0.5.3
 Hadoop Flags: Reviewed

commit a2d5768bc274684719dff64b46967264f3d18c0e
Author: Bikas Saha 
Date:   Sat Nov 8 11:59:14 2014 -0800

TEZ-1758. TezClient should provide YARN diagnostics when the AM crashes 
(bikas)


> TezClient should provide YARN diagnostics when the AM crashes
> -
>
> Key: TEZ-1758
> URL: https://issues.apache.org/jira/browse/TEZ-1758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 0.5.3
>
> Attachments: TEZ-1758.1.patch, TEZ-1758.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1758:

Attachment: TEZ-1758.2.patch

Fixed review comments. Added test case. Checked for spaces. Uploading commit 
patch.

> TezClient should provide YARN diagnostics when the AM crashes
> -
>
> Key: TEZ-1758
> URL: https://issues.apache.org/jira/browse/TEZ-1758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1758.1.patch, TEZ-1758.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1742) Improve response time of internal preemption

2014-11-08 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203567#comment-14203567
 ] 

Rajesh Balamohan commented on TEZ-1742:
---

lgtm. +1

> Improve response time of internal preemption
> 
>
> Key: TEZ-1742
> URL: https://issues.apache.org/jira/browse/TEZ-1742
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 0.5.3
>
> Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, 
> TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch, 
> TEZ-1742.addendum.3.patch
>
>
> Tez YARN Task Scheduler currently preempts 1 running task at a time when a 
> higher priority task is waiting and there are no available resources. When a 
> large number of higher priority tasks are pending then it can take a long 
> time to preempt the required number of lower priority tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes

2014-11-08 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203566#comment-14203566
 ] 

Hitesh Shah commented on TEZ-1758:
--

Whitespace example: 

{code}
+DAG dag = DAG.create("DAG").addVertex(vertex);
+
+try {
{code}

The line between DAG and try is not an empty line. 

> TezClient should provide YARN diagnostics when the AM crashes
> -
>
> Key: TEZ-1758
> URL: https://issues.apache.org/jira/browse/TEZ-1758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1758.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes

2014-11-08 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203564#comment-14203564
 ] 

Bikas Saha edited comment on TEZ-1758 at 11/8/14 7:43 PM:
--

Diagnostics are expected to be user visible and meaningful. Showing YARN app 
state does not meet that criteria since its leaking internal platform state. 
Its good for debugging and it being logged.

Will make both places consistent.

This is because we simply dont show diagnostics on the client and we have to 
look at the RM UI/logs or NM logs to find why the AM did not start. Dont know 
if YARN is propagating the errors from the NM to the RM properly but this will 
show if its not doing so.

The tabs looks fine to me in the diff
{code}
+  
+  @Test(timeout = 5000)
+  public void testWaitTillReadyAppFailed() throws Exception {
+final TezClientForTest client = configure();
+client.start();
+String msg = "Application Test Failed";
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState())
+
.thenReturn(YarnApplicationState.NEW).thenReturn(YarnApplicationState.FAILED);
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn(
+msg);
+try {
+  client.waitTillReady();
+  Assert.fail();
+} catch (SessionNotRunning e) {
+  Assert.assertTrue(e.getMessage().contains(msg));
+}
+client.stop();
+  }
+  
+  @Test(timeout = 5000)
+  public void testSubmitDAGAppFailed() throws Exception {
+final TezClientForTest client = configure();
+client.start();
+
+client.callRealGetSessionAMProxy = true;
+String msg = "Application Test Failed";
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState())
+.thenReturn(YarnApplicationState.KILLED);
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn(
+msg);
+
+Vertex vertex = Vertex.create("Vertex", ProcessorDescriptor.create("P"), 1,
+Resource.newInstance(1, 1));
+DAG dag = DAG.create("DAG").addVertex(vertex);
+
+try {
+  client.submitDAG(dag);
+  Assert.fail();
+} catch (SessionNotRunning e) {
+  Assert.assertTrue(e.getMessage().contains(msg));
+}
+client.stop();
+  }
 {code}

Thanks for the review!


was (Author: bikassaha):
Diagnostics are expected to be user visible and meaningful. Showing YARN app 
state does not meet that criteria.

Will make both places consistent.

This is because we simply dont show diagnostics on the client and we have to 
look at the RM UI/logs or NM logs to find why the AM did not start. Dont know 
if YARN is propagating the errors from the NM to the RM properly but this will 
show if its not doing so.

The tabs looks fine to me in the diff
{code}
+  
+  @Test(timeout = 5000)
+  public void testWaitTillReadyAppFailed() throws Exception {
+final TezClientForTest client = configure();
+client.start();
+String msg = "Application Test Failed";
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState())
+
.thenReturn(YarnApplicationState.NEW).thenReturn(YarnApplicationState.FAILED);
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn(
+msg);
+try {
+  client.waitTillReady();
+  Assert.fail();
+} catch (SessionNotRunning e) {
+  Assert.assertTrue(e.getMessage().contains(msg));
+}
+client.stop();
+  }
+  
+  @Test(timeout = 5000)
+  public void testSubmitDAGAppFailed() throws Exception {
+final TezClientForTest client = configure();
+client.start();
+
+client.callRealGetSessionAMProxy = true;
+String msg = "Application Test Failed";
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState())
+.thenReturn(YarnApplicationState.KILLED);
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn(
+msg);
+
+Vertex vertex = Vertex.create("Vertex", ProcessorDescriptor.create("P"), 1,
+Resource.newInstance(1, 1));
+DAG dag = DAG.create("DAG").addVertex(vertex);
+
+try {
+  client.submitDAG(dag);
+  Assert.fail();
+} catch (SessionNotRunning e) {
+  Assert.assertTrue(e.getMessage().contains(msg));
+}
+client.stop();
+  }
 {code}

Thanks for the review!

> TezClient should provide YARN diagnostics when the AM crashes
> -
>
> Key: TEZ-1758
> URL: https://issues.apache.org/jira/browse/TEZ-1758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1758.1.patch

[jira] [Commented] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes

2014-11-08 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203564#comment-14203564
 ] 

Bikas Saha commented on TEZ-1758:
-

Diagnostics are expected to be user visible and meaningful. Showing YARN app 
state does not meet that criteria.

Will make both places consistent.

This is because we simply dont show diagnostics on the client and we have to 
look at the RM UI/logs or NM logs to find why the AM did not start. Dont know 
if YARN is propagating the errors from the NM to the RM properly but this will 
show if its not doing so.

The tabs looks fine to me in the diff
{code}
+  
+  @Test(timeout = 5000)
+  public void testWaitTillReadyAppFailed() throws Exception {
+final TezClientForTest client = configure();
+client.start();
+String msg = "Application Test Failed";
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState())
+
.thenReturn(YarnApplicationState.NEW).thenReturn(YarnApplicationState.FAILED);
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn(
+msg);
+try {
+  client.waitTillReady();
+  Assert.fail();
+} catch (SessionNotRunning e) {
+  Assert.assertTrue(e.getMessage().contains(msg));
+}
+client.stop();
+  }
+  
+  @Test(timeout = 5000)
+  public void testSubmitDAGAppFailed() throws Exception {
+final TezClientForTest client = configure();
+client.start();
+
+client.callRealGetSessionAMProxy = true;
+String msg = "Application Test Failed";
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState())
+.thenReturn(YarnApplicationState.KILLED);
+
when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn(
+msg);
+
+Vertex vertex = Vertex.create("Vertex", ProcessorDescriptor.create("P"), 1,
+Resource.newInstance(1, 1));
+DAG dag = DAG.create("DAG").addVertex(vertex);
+
+try {
+  client.submitDAG(dag);
+  Assert.fail();
+} catch (SessionNotRunning e) {
+  Assert.assertTrue(e.getMessage().contains(msg));
+}
+client.stop();
+  }
 {code}

Thanks for the review!

> TezClient should provide YARN diagnostics when the AM crashes
> -
>
> Key: TEZ-1758
> URL: https://issues.apache.org/jira/browse/TEZ-1758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1758.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1749) Increase test timeout for TestLocalMode.testMultipleClientsWithSession

2014-11-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1749:
--
Fix Version/s: (was: 0.6.0)
   0.5.3

> Increase test timeout for TestLocalMode.testMultipleClientsWithSession
> --
>
> Key: TEZ-1749
> URL: https://issues.apache.org/jira/browse/TEZ-1749
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 0.5.3
>
> Attachments: TEZ-1749.1.patch
>
>
> Times out in other platforms (windows).  Verified that it is successful at 
> times and timesout at times due to slower hardware.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1749) Increase test timeout for TestLocalMode.testMultipleClientsWithSession

2014-11-08 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203561#comment-14203561
 ] 

Rajesh Balamohan commented on TEZ-1749:
---

committed to branch-0.5

> Increase test timeout for TestLocalMode.testMultipleClientsWithSession
> --
>
> Key: TEZ-1749
> URL: https://issues.apache.org/jira/browse/TEZ-1749
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 0.5.3
>
> Attachments: TEZ-1749.1.patch
>
>
> Times out in other platforms (windows).  Verified that it is successful at 
> times and timesout at times due to slower hardware.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1742) Improve response time of internal preemption

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1742:

Attachment: TEZ-1742.addendum.3.patch

OK. I will try to stop being smart about not casting doubles to int and go read 
my basic math book for integer division. 
Attached patch with fix and more tests for the corner cases.

> Improve response time of internal preemption
> 
>
> Key: TEZ-1742
> URL: https://issues.apache.org/jira/browse/TEZ-1742
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 0.5.3
>
> Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, 
> TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch, 
> TEZ-1742.addendum.3.patch
>
>
> Tez YARN Task Scheduler currently preempts 1 running task at a time when a 
> higher priority task is waiting and there are no available resources. When a 
> large number of higher priority tasks are pending then it can take a long 
> time to preempt the required number of lower priority tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1742) Improve response time of internal preemption

2014-11-08 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203550#comment-14203550
 ] 

Rajesh Balamohan commented on TEZ-1742:
---

Changes related to task priority instead of container priority is fine.

There is a subtle issue "scaleDownByPreemptionPercentage".  E.g, As per current 
logic, if original=200 and percent=70, it would try to clear up 200 containers 
instead of 140. 


> Improve response time of internal preemption
> 
>
> Key: TEZ-1742
> URL: https://issues.apache.org/jira/browse/TEZ-1742
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 0.5.3
>
> Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, 
> TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch
>
>
> Tez YARN Task Scheduler currently preempts 1 running task at a time when a 
> higher priority task is waiting and there are no available resources. When a 
> large number of higher priority tasks are pending then it can take a long 
> time to preempt the required number of lower priority tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes

2014-11-08 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203547#comment-14203547
 ] 

Hitesh Shah commented on TEZ-1758:
--

Comments:

Mostly looks good.

Minor point related to:

{code}
-throw new SessionNotRunning("TezSession has already shutdown");
+throw new SessionNotRunning("TezSession has already shutdown. "
++ ((diagnostics != null) ? diagnostics : ""));
{code}

Would it make sense to have the message contain the yarn app state as well as 
diagnostics? 

Also, there is a minor inconsistency:
   - in case for the above code, it uses "" when diagnostics is null
   - In TezClientUtils, it says "Cluster diagnostics not found" 

Is this because diagnostics is not set when state is FINISHED?

Also, the newly added unit test code has a lot of spurious whitespaces that 
needs cleaning up. 






> TezClient should provide YARN diagnostics when the AM crashes
> -
>
> Key: TEZ-1758
> URL: https://issues.apache.org/jira/browse/TEZ-1758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1758.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes

2014-11-08 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203547#comment-14203547
 ] 

Hitesh Shah edited comment on TEZ-1758 at 11/8/14 7:17 PM:
---

Comments:

Mostly looks good.

Minor point related to:

{code}
-throw new SessionNotRunning("TezSession has already shutdown");
+throw new SessionNotRunning("TezSession has already shutdown. "
++ ((diagnostics != null) ? diagnostics : ""));
{code}

Would it make sense to have the message contain the final yarn app state as 
well as diagnostics? 

Also, there is a minor inconsistency:
   - in case for the above code, it uses "" when diagnostics is null
   - In TezClientUtils, it says "Cluster diagnostics not found" 

Is this because diagnostics is not set when state is FINISHED?

Also, the newly added unit test code has a lot of spurious whitespaces that 
needs cleaning up. 







was (Author: hitesh):
Comments:

Mostly looks good.

Minor point related to:

{code}
-throw new SessionNotRunning("TezSession has already shutdown");
+throw new SessionNotRunning("TezSession has already shutdown. "
++ ((diagnostics != null) ? diagnostics : ""));
{code}

Would it make sense to have the message contain the yarn app state as well as 
diagnostics? 

Also, there is a minor inconsistency:
   - in case for the above code, it uses "" when diagnostics is null
   - In TezClientUtils, it says "Cluster diagnostics not found" 

Is this because diagnostics is not set when state is FINISHED?

Also, the newly added unit test code has a lot of spurious whitespaces that 
needs cleaning up. 






> TezClient should provide YARN diagnostics when the AM crashes
> -
>
> Key: TEZ-1758
> URL: https://issues.apache.org/jira/browse/TEZ-1758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1758.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1758:

Affects Version/s: 0.5.2

> TezClient should provide YARN diagnostics when the AM crashes
> -
>
> Key: TEZ-1758
> URL: https://issues.apache.org/jira/browse/TEZ-1758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1758.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1758:

Attachment: TEZ-1758.1.patch

Patch that makes sure appReport diagnostics are reported up to the user so that 
YARN diagnostics (if present) can be shown when the AM does not start or dies. 
[~hitesh] [~pramachandran] Please review.

> TezClient should provide YARN diagnostics when the AM crashes
> -
>
> Key: TEZ-1758
> URL: https://issues.apache.org/jira/browse/TEZ-1758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1758.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes

2014-11-08 Thread Bikas Saha (JIRA)
Bikas Saha created TEZ-1758:
---

 Summary: TezClient should provide YARN diagnostics when the AM 
crashes
 Key: TEZ-1758
 URL: https://issues.apache.org/jira/browse/TEZ-1758
 Project: Apache Tez
  Issue Type: Bug
Reporter: Bikas Saha
Assignee: Bikas Saha






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled

2014-11-08 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203344#comment-14203344
 ] 

Bikas Saha commented on TEZ-1750:
-

the new configuration should be an expert level setting and probably unstable.

yes. From the naming of the method it seems that it should already have 
scheduled if the vertex scheduling trigger has been reached. wasnt clear why is 
was just returning a boolean and letting the called do the scheduling.

Any ideas about the 1-1 edge handling or custom edge handling? Or are they 
follow up items to be done after trying this out in experiments?

> Add a DAGScheduler which schedules tasks only when sources have been scheduled
> --
>
> Key: TEZ-1750
> URL: https://issues.apache.org/jira/browse/TEZ-1750
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>Priority: Critical
> Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt
>
>
> Splitting out the patch on TEZ-1522 into a separate jira.
> There's several scenarios in which we end up scheduling downstream tasks 
> before their sources have been scheduled - and then get into a situation 
> where the sources are starved. Currently, anywhere a ShuffleVertexManager is 
> used can cause such behaviour - since it starts scheduling it's tasks after a 
> certain number of sources are complete, but subsequen non-shuffle 
> VertexManagers will scheduled immediately.
> Disabling slow-start is one option to achieve this (or setting slow start on 
> all vertices), but it doesn't work for the situation where dynamic reducer 
> parallelism kicks in - since it has to wait for source tasks to complete.
> The intent here is to add a DAGScheduler, which affectively negates the slow 
> start, and in case of dynamic parallelism determination, waits for upstream 
> tasks to be scheduled before scheduling downstream tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1742) Improve response time of internal preemption

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1742:

Attachment: (was: TEZ-1742.addendum.2.patch)

> Improve response time of internal preemption
> 
>
> Key: TEZ-1742
> URL: https://issues.apache.org/jira/browse/TEZ-1742
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 0.5.3
>
> Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, 
> TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch
>
>
> Tez YARN Task Scheduler currently preempts 1 running task at a time when a 
> higher priority task is waiting and there are no available resources. When a 
> large number of higher priority tasks are pending then it can take a long 
> time to preempt the required number of lower priority tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1742) Improve response time of internal preemption

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1742:

Attachment: TEZ-1742.addendum.2.patch

> Improve response time of internal preemption
> 
>
> Key: TEZ-1742
> URL: https://issues.apache.org/jira/browse/TEZ-1742
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 0.5.3
>
> Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, 
> TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch
>
>
> Tez YARN Task Scheduler currently preempts 1 running task at a time when a 
> higher priority task is waiting and there are no available resources. When a 
> large number of higher priority tasks are pending then it can take a long 
> time to preempt the required number of lower priority tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1742) Improve response time of internal preemption

2014-11-08 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203328#comment-14203328
 ] 

Bikas Saha commented on TEZ-1742:
-

Good catch. The latest addendum patch fixes the bug and adds a unit test for 
that method. Please review.

This jiras original patch contains not just the percentage improvement but also 
bug fixes where incorrect priorities were being compared and we could end up 
preempting the wrong tasks. The addendum patch has a test enhancement for that 
bug fix. It also contains a fix to not trigger task preemption in back-to-back 
cycles so that we dont preempt before the RM can react to the preemption. The 
existing behavior of fewer preemptions can be enabled by setting a sufficiently 
low value of the preemptionPercentage, if needed. Hence, I feel that this patch 
is relevant for branch-0.5 and should make its way there. Proportional 
preemptions is a general improvement, that helps out of order scheduling, but 
is unrelated to it and can help in any other case where scheduling constraints 
need to be enforced quickly. e.g. losing many tasks on a rack or a re-execution 
of all reducers because of a correlated outage.

> Improve response time of internal preemption
> 
>
> Key: TEZ-1742
> URL: https://issues.apache.org/jira/browse/TEZ-1742
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 0.5.3
>
> Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, 
> TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch
>
>
> Tez YARN Task Scheduler currently preempts 1 running task at a time when a 
> higher priority task is waiting and there are no available resources. When a 
> large number of higher priority tasks are pending then it can take a long 
> time to preempt the required number of lower priority tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1742) Improve response time of internal preemption

2014-11-08 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1742:

Attachment: TEZ-1742.addendum.2.patch

> Improve response time of internal preemption
> 
>
> Key: TEZ-1742
> URL: https://issues.apache.org/jira/browse/TEZ-1742
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 0.5.3
>
> Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, 
> TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch
>
>
> Tez YARN Task Scheduler currently preempts 1 running task at a time when a 
> higher priority task is waiting and there are no available resources. When a 
> large number of higher priority tasks are pending then it can take a long 
> time to preempt the required number of lower priority tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)