[jira] [Updated] (TEZ-1757) Column selector for tables.
[ https://issues.apache.org/jira/browse/TEZ-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sreenath Somarajapuram updated TEZ-1757: Attachment: TEZ-1757.1.patch Generic column selector for all tables. > Column selector for tables. > --- > > Key: TEZ-1757 > URL: https://issues.apache.org/jira/browse/TEZ-1757 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Sreenath Somarajapuram >Assignee: Sreenath Somarajapuram > Attachments: TEZ-1757.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686
[ https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203768#comment-14203768 ] Jeff Zhang commented on TEZ-1761: - Yes, I have committed it to branch-0.5. Change the target version to 0.5.3 > TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to > TEZ-1686 > > > Key: TEZ-1761 > URL: https://issues.apache.org/jira/browse/TEZ-1761 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1761.patch > > > Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< > FAILURE! > testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser) Time > elapsed: 0.053 sec <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<100> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686
[ https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203766#comment-14203766 ] Hitesh Shah commented on TEZ-1761: -- That should be fine for 0.5.3. Has this been comitted? The jira is still unresolved. > TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to > TEZ-1686 > > > Key: TEZ-1761 > URL: https://issues.apache.org/jira/browse/TEZ-1761 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1761.patch > > > Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< > FAILURE! > testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser) Time > elapsed: 0.053 sec <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<100> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686
[ https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203766#comment-14203766 ] Hitesh Shah edited comment on TEZ-1761 at 11/9/14 4:41 AM: --- This should be fine for 0.5.3 as it is a trivial bug fix. Has this been comitted? The jira is still unresolved. was (Author: hitesh): That should be fine for 0.5.3. Has this been comitted? The jira is still unresolved. > TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to > TEZ-1686 > > > Key: TEZ-1761 > URL: https://issues.apache.org/jira/browse/TEZ-1761 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1761.patch > > > Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< > FAILURE! > testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser) Time > elapsed: 0.053 sec <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<100> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686
[ https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203756#comment-14203756 ] Jeff Zhang commented on TEZ-1761: - Sorry, [~hitesh], Is it shouldn't committed to branch-0.5 ? I find the target version is 0.6 > TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to > TEZ-1686 > > > Key: TEZ-1761 > URL: https://issues.apache.org/jira/browse/TEZ-1761 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1761.patch > > > Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< > FAILURE! > testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser) Time > elapsed: 0.053 sec <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<100> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686
[ https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203755#comment-14203755 ] Jeff Zhang commented on TEZ-1761: - Committed to master & branch-0.5 > TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to > TEZ-1686 > > > Key: TEZ-1761 > URL: https://issues.apache.org/jira/browse/TEZ-1761 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1761.patch > > > Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< > FAILURE! > testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser) Time > elapsed: 0.053 sec <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<100> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686
[ https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203748#comment-14203748 ] Hitesh Shah commented on TEZ-1761: -- +1 > TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to > TEZ-1686 > > > Key: TEZ-1761 > URL: https://issues.apache.org/jira/browse/TEZ-1761 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1761.patch > > > Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< > FAILURE! > testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser) Time > elapsed: 0.053 sec <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<100> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686
[ https://issues.apache.org/jira/browse/TEZ-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1761: Attachment: TEZ-1761.patch [~hitesh] attach the patch, please help review. > TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to > TEZ-1686 > > > Key: TEZ-1761 > URL: https://issues.apache.org/jira/browse/TEZ-1761 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1761.patch > > > Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< > FAILURE! > testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser) Time > elapsed: 0.053 sec <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<100> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1736) Add support for Inputs/Outputs in runtime-library to generate history text data
[ https://issues.apache.org/jira/browse/TEZ-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203730#comment-14203730 ] Hitesh Shah commented on TEZ-1736: -- [~sseth] Mind doing a review of the latest patch to see if the property handling is being done correctly? > Add support for Inputs/Outputs in runtime-library to generate history text > data > --- > > Key: TEZ-1736 > URL: https://issues.apache.org/jira/browse/TEZ-1736 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Attachments: TEZ-1736.1.patch, TEZ-1736.2.patch, TEZ-1736.3.patch, > TEZ-1736.4.patch, TEZ-1736.5.patch, TEZ-1736.6.patch, TEZ-1736.7.patch > > > The userpayload related setHistoryText has been available for some time but > is not used by the Inputs/Outputs in the run-time library. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1736) Add support for Inputs/Outputs in runtime-library to generate history text data
[ https://issues.apache.org/jira/browse/TEZ-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1736: - Attachment: TEZ-1736.7.patch > Add support for Inputs/Outputs in runtime-library to generate history text > data > --- > > Key: TEZ-1736 > URL: https://issues.apache.org/jira/browse/TEZ-1736 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Attachments: TEZ-1736.1.patch, TEZ-1736.2.patch, TEZ-1736.3.patch, > TEZ-1736.4.patch, TEZ-1736.5.patch, TEZ-1736.6.patch, TEZ-1736.7.patch > > > The userpayload related setHistoryText has been available for some time but > is not used by the Inputs/Outputs in the run-time library. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1764) Data generated in SimpleHistoryLogger could be less verbose
Hitesh Shah created TEZ-1764: Summary: Data generated in SimpleHistoryLogger could be less verbose Key: TEZ-1764 URL: https://issues.apache.org/jira/browse/TEZ-1764 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah The structure was initially kept quite similar to Timeline entities. However, sections are related entities are in most cases not needed as the data is replicated in primary filters and/or other info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1763) Support tee of history events to both a file based logger as well as Timeline
Hitesh Shah created TEZ-1763: Summary: Support tee of history events to both a file based logger as well as Timeline Key: TEZ-1763 URL: https://issues.apache.org/jira/browse/TEZ-1763 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1762) Lots of unit tests do not have timeout parameter set
[ https://issues.apache.org/jira/browse/TEZ-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203660#comment-14203660 ] Hitesh Shah commented on TEZ-1762: -- A generic 5 second timeout could be added to all of them with modifications if needed based on actual run times. > Lots of unit tests do not have timeout parameter set > - > > Key: TEZ-1762 > URL: https://issues.apache.org/jira/browse/TEZ-1762 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1762) Lots of unit tests do not have timeout parameter set
Hitesh Shah created TEZ-1762: Summary: Lots of unit tests do not have timeout parameter set Key: TEZ-1762 URL: https://issues.apache.org/jira/browse/TEZ-1762 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1761) TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686
Hitesh Shah created TEZ-1761: Summary: TestRecoveryParser::testGetLastInProgressDAG fails in similar manner to TEZ-1686 Key: TEZ-1761 URL: https://issues.apache.org/jira/browse/TEZ-1761 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Jeff Zhang Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.713 sec <<< FAILURE! testGetLastInProgressDAG(org.apache.tez.dag.app.TestRecoveryParser) Time elapsed: 0.053 sec <<< FAILURE! java.lang.AssertionError: expected:<0> but was:<100> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.tez.dag.app.TestRecoveryParser.testGetLastInProgressDAG(TestRecoveryParser.java:93) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1760) DAGSchedulerNaturalOrderControlled should consider cluster capacity when waiting for source vertices
Siddharth Seth created TEZ-1760: --- Summary: DAGSchedulerNaturalOrderControlled should consider cluster capacity when waiting for source vertices Key: TEZ-1760 URL: https://issues.apache.org/jira/browse/TEZ-1760 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1759) Add slow start capabilities to DAGSchedulerNaturalOrderControlled
Siddharth Seth created TEZ-1759: --- Summary: Add slow start capabilities to DAGSchedulerNaturalOrderControlled Key: TEZ-1759 URL: https://issues.apache.org/jira/browse/TEZ-1759 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled
[ https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203629#comment-14203629 ] Siddharth Seth commented on TEZ-1750: - Thanks for the reviews. Committing. bq. This will likely need a bunch of experimentation to see benefits/issues. That will give more data on this approach. Until important points like different edge types etc. are addressed this probably should not be made default. We'll get to this in 0.6. The issue where we end up doing out of order scheduling is a bigger one in terms of performance. Pre-emption helps but performance would still suffer when multiple DAGs per session come into play. > Add a DAGScheduler which schedules tasks only when sources have been scheduled > -- > > Key: TEZ-1750 > URL: https://issues.apache.org/jira/browse/TEZ-1750 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt, TEZ-1750.3.txt > > > Splitting out the patch on TEZ-1522 into a separate jira. > There's several scenarios in which we end up scheduling downstream tasks > before their sources have been scheduled - and then get into a situation > where the sources are starved. Currently, anywhere a ShuffleVertexManager is > used can cause such behaviour - since it starts scheduling it's tasks after a > certain number of sources are complete, but subsequen non-shuffle > VertexManagers will scheduled immediately. > Disabling slow-start is one option to achieve this (or setting slow start on > all vertices), but it doesn't work for the situation where dynamic reducer > parallelism kicks in - since it has to wait for source tasks to complete. > The intent here is to add a DAGScheduler, which affectively negates the slow > start, and in case of dynamic parallelism determination, waits for upstream > tasks to be scheduled before scheduling downstream tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled
[ https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203598#comment-14203598 ] Bikas Saha commented on TEZ-1750: - This will likely need a bunch of experimentation to see benefits/issues. That will give more data on this approach. Until important points like different edge types etc. are addressed this probably should not be made default. Hopefully follow up jiras will address these and/or other issues found with more experiments and increase confidence. Making this pluggable via config would be good to have so that folks can experiment with this approach or other kinds of DAG scheduling, without having to change the main code. So lets get this into master and also into 0.5.3. > Add a DAGScheduler which schedules tasks only when sources have been scheduled > -- > > Key: TEZ-1750 > URL: https://issues.apache.org/jira/browse/TEZ-1750 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt, TEZ-1750.3.txt > > > Splitting out the patch on TEZ-1522 into a separate jira. > There's several scenarios in which we end up scheduling downstream tasks > before their sources have been scheduled - and then get into a situation > where the sources are starved. Currently, anywhere a ShuffleVertexManager is > used can cause such behaviour - since it starts scheduling it's tasks after a > certain number of sources are complete, but subsequen non-shuffle > VertexManagers will scheduled immediately. > Disabling slow-start is one option to achieve this (or setting slow start on > all vertices), but it doesn't work for the situation where dynamic reducer > parallelism kicks in - since it has to wait for source tasks to complete. > The intent here is to add a DAGScheduler, which affectively negates the slow > start, and in case of dynamic parallelism determination, waits for upstream > tasks to be scheduled before scheduling downstream tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled
[ https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203590#comment-14203590 ] Siddharth Seth edited comment on TEZ-1750 at 11/8/14 8:52 PM: -- Updated with expert level setting. There's comments in their on some future enhancements - considering cluster capacity to allow scheduling downstream even if upstream is not scheduled, generic slow start. Not enabling this by default in 0.5.3, because I think it's not a good change to have in place on a minor version. On 0.6.0, based on how this performs, it could be enabled by default. was (Author: sseth): Updated with experimental. There's comments in their on some future enhancements - considering cluster capacity to allow scheduling downstream even if upstream is not scheduled, generic slow start. Not enabling this by default in 0.5.3, because I think it's not a good change to have in place on a minor version. On 0.6.0, based on how this performs, it could be enabled by default. > Add a DAGScheduler which schedules tasks only when sources have been scheduled > -- > > Key: TEZ-1750 > URL: https://issues.apache.org/jira/browse/TEZ-1750 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt, TEZ-1750.3.txt > > > Splitting out the patch on TEZ-1522 into a separate jira. > There's several scenarios in which we end up scheduling downstream tasks > before their sources have been scheduled - and then get into a situation > where the sources are starved. Currently, anywhere a ShuffleVertexManager is > used can cause such behaviour - since it starts scheduling it's tasks after a > certain number of sources are complete, but subsequen non-shuffle > VertexManagers will scheduled immediately. > Disabling slow-start is one option to achieve this (or setting slow start on > all vertices), but it doesn't work for the situation where dynamic reducer > parallelism kicks in - since it has to wait for source tasks to complete. > The intent here is to add a DAGScheduler, which affectively negates the slow > start, and in case of dynamic parallelism determination, waits for upstream > tasks to be scheduled before scheduling downstream tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled
[ https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1750: Attachment: TEZ-1750.3.txt Updated with experimental. There's comments in their on some future enhancements - considering cluster capacity to allow scheduling downstream even if upstream is not scheduled, generic slow start. Not enabling this by default in 0.5.3, because I think it's not a good change to have in place on a minor version. On 0.6.0, based on how this performs, it could be enabled by default. > Add a DAGScheduler which schedules tasks only when sources have been scheduled > -- > > Key: TEZ-1750 > URL: https://issues.apache.org/jira/browse/TEZ-1750 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt, TEZ-1750.3.txt > > > Splitting out the patch on TEZ-1522 into a separate jira. > There's several scenarios in which we end up scheduling downstream tasks > before their sources have been scheduled - and then get into a situation > where the sources are starved. Currently, anywhere a ShuffleVertexManager is > used can cause such behaviour - since it starts scheduling it's tasks after a > certain number of sources are complete, but subsequen non-shuffle > VertexManagers will scheduled immediately. > Disabling slow-start is one option to achieve this (or setting slow start on > all vertices), but it doesn't work for the situation where dynamic reducer > parallelism kicks in - since it has to wait for source tasks to complete. > The intent here is to add a DAGScheduler, which affectively negates the slow > start, and in case of dynamic parallelism determination, waits for upstream > tasks to be scheduled before scheduling downstream tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled
[ https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203589#comment-14203589 ] Siddharth Seth commented on TEZ-1750: - Marked as Expert level. I don't think unstable is required though... it allows the DAGScheduler to be configured, which is something we do need. 1-1 edges can be problematic in such cases. Realistically, though - if the consumer of a 1-1 edge produces final output, at the moment - that isn't read till the DAG completes. Otherwise, it's likely to be connected to some other source which requires all outputs. This could be addressed, in a future patch, with additional details being sent over in the Task launch request. The bigger problem this addresses is the slow start (either configured or due to Reduce parallelism changes), on a producer, but no slow start on downstream vertices - which can cause some bad scheduling behaviour. One intent of slow start is to prevent unnecessary cluster utilization - however we can end up with situations where we not only end up using the cluster unnecessarily, but in the process also harm the currently executing DAG. > Add a DAGScheduler which schedules tasks only when sources have been scheduled > -- > > Key: TEZ-1750 > URL: https://issues.apache.org/jira/browse/TEZ-1750 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt > > > Splitting out the patch on TEZ-1522 into a separate jira. > There's several scenarios in which we end up scheduling downstream tasks > before their sources have been scheduled - and then get into a situation > where the sources are starved. Currently, anywhere a ShuffleVertexManager is > used can cause such behaviour - since it starts scheduling it's tasks after a > certain number of sources are complete, but subsequen non-shuffle > VertexManagers will scheduled immediately. > Disabling slow-start is one option to achieve this (or setting slow start on > all vertices), but it doesn't work for the situation where dynamic reducer > parallelism kicks in - since it has to wait for source tasks to complete. > The intent here is to add a DAGScheduler, which affectively negates the slow > start, and in case of dynamic parallelism determination, waits for upstream > tasks to be scheduled before scheduling downstream tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1502) TestDriver cannot be executed via any jar in the tez distribution
[ https://issues.apache.org/jira/browse/TEZ-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1502: Target Version/s: 0.6.0 (was: 0.5.2) > TestDriver cannot be executed via any jar in the tez distribution > - > > Key: TEZ-1502 > URL: https://issues.apache.org/jira/browse/TEZ-1502 > Project: Apache Tez > Issue Type: Bug >Reporter: Siddharth Seth > > TestDriver contains the FaultTolerance tests. Ideally, TEZ-1501 should belong > here as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1526) LoadingCache for TezTaskID slow for large jobs
[ https://issues.apache.org/jira/browse/TEZ-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1526: Target Version/s: 0.6.0 (was: 0.5.2) > LoadingCache for TezTaskID slow for large jobs > -- > > Key: TEZ-1526 > URL: https://issues.apache.org/jira/browse/TEZ-1526 > Project: Apache Tez > Issue Type: Improvement >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Labels: performance > Attachments: 10-TezTaskIDs.patch, TEZ-1526-v1.patch, > TEZ-1526-v2.patch > > > Using the LoadingCache with default builder settings. 100,000 TezTaskIDs are > created in 10 seconds on my setup. With a LoadingCache initialCapacity of > 10,000 they are created in 300 ms. With no LoadingCache, they are created in > 10 ms. A test case in attached to illustrate the condition I would like to be > sped up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1348) Ensure local directories are used when using local-mode
[ https://issues.apache.org/jira/browse/TEZ-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1348: Target Version/s: 0.6.0 (was: 0.5.2) > Ensure local directories are used when using local-mode > --- > > Key: TEZ-1348 > URL: https://issues.apache.org/jira/browse/TEZ-1348 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Siddharth Seth >Priority: Critical > > In TEZ-717, I incorrect thought setting fs.defaultFS programmatically in > tez-site would work for local mode. > Currently the requirement is that tez-site.xml must have fs.defaultFS set to > file:///. > While that works, it doesn't allow for seamless execution in either > local-mode or on a cluster. > The main issue here is that when Inputs / Outputs are configured - they use a > version of configuration which reads tez-site, and do not use the > configuration from the client itself (which is correct behaviour). > Not sure what a good way to fix this is > 1) It may be possible to override this value each time an instance of > Configuration/TezConfiguration is created. One possible way would be to > statically add a default resource to Configuration the moment a local client > is created. > 2) Provide information in the contexts on whether this is local or not. This > is fairly ugly, and would get in the way of running mixed mode tasks. > Anyone have other suggestions ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1576) Class level comment in {{MiniTezCluster}} ends abruptly
[ https://issues.apache.org/jira/browse/TEZ-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1576: Target Version/s: 0.6.0 (was: 0.5.2) > Class level comment in {{MiniTezCluster}} ends abruptly > --- > > Key: TEZ-1576 > URL: https://issues.apache.org/jira/browse/TEZ-1576 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ufuk Celebi >Assignee: Saurabh Chhajed >Priority: Trivial > > The class level comment in {{MiniTezCluster}} ends abruptly: > {code} > /** > * Configures and starts the Tez-specific components in the YARN cluster. > * > * When using this mini cluster, the user is expected to > */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1564) State machine error: Invalid event: T_SCHEDULE at SCHEDULED
[ https://issues.apache.org/jira/browse/TEZ-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1564: Target Version/s: 0.6.0 (was: 0.5.2) > State machine error: Invalid event: T_SCHEDULE at SCHEDULED > --- > > Key: TEZ-1564 > URL: https://issues.apache.org/jira/browse/TEZ-1564 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Priority: Critical > Attachments: applogs.txt.tar.gz, dag.dot > > > ERROR [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.TaskImpl: Can't handle this event at current > state for task_1409722953518_0162_1_07_00 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > T_SCHEDULE at SCHEDULED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:827) > at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:95) > at > org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1604) > at > org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1590) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:724) > I will attach the dag + app logs soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1315) Make use of cache information added to InputSplit in MAPREDUCE-5896
[ https://issues.apache.org/jira/browse/TEZ-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1315: Target Version/s: 0.6.0 (was: 0.5.2) > Make use of cache information added to InputSplit in MAPREDUCE-5896 > --- > > Key: TEZ-1315 > URL: https://issues.apache.org/jira/browse/TEZ-1315 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1541) DAGAppMaster can get stuck on shutdown if the RM is no longer around
[ https://issues.apache.org/jira/browse/TEZ-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1541: Target Version/s: 0.6.0 (was: 0.5.2) > DAGAppMaster can get stuck on shutdown if the RM is no longer around > > > Key: TEZ-1541 > URL: https://issues.apache.org/jira/browse/TEZ-1541 > Project: Apache Tez > Issue Type: Bug >Reporter: Siddharth Seth > Attachments: dagapp.threads.txt > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1444) Wrap interrupted in a specific Exception
[ https://issues.apache.org/jira/browse/TEZ-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1444: Target Version/s: 0.6.0 (was: 0.5.2) > Wrap interrupted in a specific Exception > > > Key: TEZ-1444 > URL: https://issues.apache.org/jira/browse/TEZ-1444 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Priority: Critical > > TEZ-1331 changed some APIs to throw InterruptedException in an IOException > wrapper. > It's useful to be able to determine whether the exception thrown was > "interrupted" or not, since there's a good chance the interrupt was caused by > user code itself, in which case Interrupts can be caught and ignored by the > user code which likely issued the interrupt. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch
[ https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1421: Target Version/s: 0.6.0 (was: 0.5.2) > MRCombiner throws NPE in MapredWordCount on master branch > - > > Key: TEZ-1421 > URL: https://issues.apache.org/jira/browse/TEZ-1421 > Project: Apache Tez > Issue Type: Bug >Reporter: Tsuyoshi OZAWA > > I tested MapredWordCount against 70GB generated by RandowTextWriter. When a > Combiner runs, it throws NPE. It looks setCombinerClass doesn't work > correctly. > {quote} > Caused by: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) > at > org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122) > at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1721) Update INSTALL instructions for clarifying tez client jars compatibility with runtime tarball on HDFS
[ https://issues.apache.org/jira/browse/TEZ-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1721: Target Version/s: 0.6.0 (was: 0.5.2) > Update INSTALL instructions for clarifying tez client jars compatibility with > runtime tarball on HDFS > - > > Key: TEZ-1721 > URL: https://issues.apache.org/jira/browse/TEZ-1721 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah >Priority: Critical > Attachments: TEZ-1721.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1736) Add support for Inputs/Outputs in runtime-library to generate history text data
[ https://issues.apache.org/jira/browse/TEZ-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1736: Target Version/s: 0.6.0 (was: 0.5.2) > Add support for Inputs/Outputs in runtime-library to generate history text > data > --- > > Key: TEZ-1736 > URL: https://issues.apache.org/jira/browse/TEZ-1736 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Attachments: TEZ-1736.1.patch, TEZ-1736.2.patch, TEZ-1736.3.patch, > TEZ-1736.4.patch, TEZ-1736.5.patch, TEZ-1736.6.patch > > > The userpayload related setHistoryText has been available for some time but > is not used by the Inputs/Outputs in the run-time library. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1663) Change diagnostics to a more parseable format for easy consumption by UI
[ https://issues.apache.org/jira/browse/TEZ-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1663: Target Version/s: 0.6.0 (was: 0.5.2) > Change diagnostics to a more parseable format for easy consumption by UI > > > Key: TEZ-1663 > URL: https://issues.apache.org/jira/browse/TEZ-1663 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah > > A better format for diagnostics will allow the UI to parse it more easily and > also provide information back to user in a better manner. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1697) DAG submission fails if a local resource added is already part of tez.lib.uris
[ https://issues.apache.org/jira/browse/TEZ-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1697: Target Version/s: 0.6.0 (was: 0.5.2) > DAG submission fails if a local resource added is already part of tez.lib.uris > -- > > Key: TEZ-1697 > URL: https://issues.apache.org/jira/browse/TEZ-1697 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1723) Flaky test: TestVertexImpl::testVertexWithMultipleInitializers1
[ https://issues.apache.org/jira/browse/TEZ-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1723: Target Version/s: 0.6.0 (was: 0.5.2) > Flaky test: TestVertexImpl::testVertexWithMultipleInitializers1 > --- > > Key: TEZ-1723 > URL: https://issues.apache.org/jira/browse/TEZ-1723 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah > > https://builds.apache.org/job/Tez-Build/723 > testVertexWithMultipleInitializers1(org.apache.tez.dag.app.dag.impl.TestVertexImpl) > Time elapsed: 0.024 sec <<< FAILURE! > java.lang.AssertionError: expected: but was: > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:144) > at > org.apache.tez.dag.app.dag.impl.TestVertexImpl.testVertexWithMultipleInitializers1(TestVertexImpl.java:4234) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1734) Vertex's taskNum may be -1 when recovered from NEW to FAILED/KILLED
[ https://issues.apache.org/jira/browse/TEZ-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1734: Target Version/s: 0.6.0 (was: 0.5.2) > Vertex's taskNum may be -1 when recovered from NEW to FAILED/KILLED > --- > > Key: TEZ-1734 > URL: https://issues.apache.org/jira/browse/TEZ-1734 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.1 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-1734-2.patch, TEZ-1734.patch > > > When vertex recovered from NEW to FAILED/KILLED, the taskNum may be -1, in > this case, we don't need to recover its tasks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1546) Change InputInitializerContext.registerForVertexStateUpdates to return a list of pending state changes
[ https://issues.apache.org/jira/browse/TEZ-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1546: Target Version/s: 0.6.0 (was: 0.5.2) > Change InputInitializerContext.registerForVertexStateUpdates to return a list > of pending state changes > -- > > Key: TEZ-1546 > URL: https://issues.apache.org/jira/browse/TEZ-1546 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > > Sending pending events via the stateChange on the InputInitializer can be > confusing - since multiple calls will be made back to back, without knowing > how many events are coming in , and which the last one is. > Returning all past state changes via register ensures invocations of > onStateChanged are current events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1397) Node affinity for tasks processing the same splits
[ https://issues.apache.org/jira/browse/TEZ-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1397: Target Version/s: 0.6.0 (was: 0.5.2) > Node affinity for tasks processing the same splits > -- > > Key: TEZ-1397 > URL: https://issues.apache.org/jira/browse/TEZ-1397 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > > Within a session, if the same set of HDFS blocks are accessed by different > tasks - these should ideally be launched on the same node for better buffer > cache, etc utilization. > This will likely end up being another level of requests higher up than > NODE_LOCAL for the scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1730) Add Auto-reducer-parallelism information to the DAG .dot file
[ https://issues.apache.org/jira/browse/TEZ-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1730: Target Version/s: 0.6.0 (was: 0.5.2) > Add Auto-reducer-parallelism information to the DAG .dot file > - > > Key: TEZ-1730 > URL: https://issues.apache.org/jira/browse/TEZ-1730 > Project: Apache Tez > Issue Type: Task >Affects Versions: 0.5.2 >Reporter: Gopal V >Assignee: Gopal V >Priority: Trivial > > Write out the auto-reducer configuration information (if present) into the > Graphviz data being printed out as debug artifact. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-925) Tez job does not blacklist node when container failed to launch
[ https://issues.apache.org/jira/browse/TEZ-925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-925: --- Target Version/s: 0.6.0 (was: 0.5.2) > Tez job does not blacklist node when container failed to launch > --- > > Key: TEZ-925 > URL: https://issues.apache.org/jira/browse/TEZ-925 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Yesha Vora > > Tez job tries to create a container even if disk is full. Thus, job fails > with below error. > "Can't create directory application_1393835169637_0138 in > /tmp/yarn/local/usercache/user/appcache/application_1393835169637_0138 - No > space left on device" > However, Tez should be able to detect full disk and blacklist the node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1522) Scheduling can result in out of order execution and slowdown of upstream work
[ https://issues.apache.org/jira/browse/TEZ-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1522: Target Version/s: 0.6.0 (was: 0.5.2) > Scheduling can result in out of order execution and slowdown of upstream work > - > > Key: TEZ-1522 > URL: https://issues.apache.org/jira/browse/TEZ-1522 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan >Priority: Critical > Labels: performance > Attachments: TEZ-1522.1.wip.txt, TEZ-1522.2.wip.txt, > TEZ-1522.am.log.gz, task_runtime.svg > > > M2 M7 > \ / > (sg) \/ >R3/ (b) > \ / > (b) \ / > \ / > M5 > | > R6 > Plz refer to the attachment (task runtime SVG). In this case, M5 got > scheduled much earlier than R3 (green color in the diagram) and retained lots > of containers. > R3 got less containers to work with. > Attaching the output from the status monitor when the job ran; Map_5 has > taken up almost all of cluster resource, whereas Reducer_3 got fraction of > the capacity. > Map_2: 1/1 Map_5: 0(+373)/1000 Map_7: 1/1 Reducer_3: 0/8000 > Reducer_6: 0/1 > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 0/8000 > Reducer_6: 0/1 > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 0(+1)/8000 > Reducer_6: 0/1 > > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: > 14(+7)/8000 Reducer_6: 0/1 > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: > 63(+14)/8000 Reducer_6: 0/1 > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: > 159(+22)/8000Reducer_6: 0/1 > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: > 308(+29)/8000Reducer_6: 0/1 > ... > Creating this JIRA as a placeholder for scheduler enhancement. One > possibililty could be to > schedule lesser number of tasks in downstream vertices, based on the > information available for the upstream vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1742) Improve response time of internal preemption
[ https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203573#comment-14203573 ] Bikas Saha commented on TEZ-1742: - commit c6c08c18b77870e71298adb2dc51908473434768 Author: Bikas Saha Date: Sat Nov 8 11:55:05 2014 -0800 TEZ-1742 addendum patch for follow up comments > Improve response time of internal preemption > > > Key: TEZ-1742 > URL: https://issues.apache.org/jira/browse/TEZ-1742 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 0.5.3 > > Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, > TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch, > TEZ-1742.addendum.3.patch > > > Tez YARN Task Scheduler currently preempts 1 running task at a time when a > higher priority task is waiting and there are no available resources. When a > large number of higher priority tasks are pending then it can take a long > time to preempt the required number of lower priority tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes
[ https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha resolved TEZ-1758. - Resolution: Fixed Fix Version/s: 0.5.3 Hadoop Flags: Reviewed commit a2d5768bc274684719dff64b46967264f3d18c0e Author: Bikas Saha Date: Sat Nov 8 11:59:14 2014 -0800 TEZ-1758. TezClient should provide YARN diagnostics when the AM crashes (bikas) > TezClient should provide YARN diagnostics when the AM crashes > - > > Key: TEZ-1758 > URL: https://issues.apache.org/jira/browse/TEZ-1758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 0.5.3 > > Attachments: TEZ-1758.1.patch, TEZ-1758.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes
[ https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1758: Attachment: TEZ-1758.2.patch Fixed review comments. Added test case. Checked for spaces. Uploading commit patch. > TezClient should provide YARN diagnostics when the AM crashes > - > > Key: TEZ-1758 > URL: https://issues.apache.org/jira/browse/TEZ-1758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1758.1.patch, TEZ-1758.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1742) Improve response time of internal preemption
[ https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203567#comment-14203567 ] Rajesh Balamohan commented on TEZ-1742: --- lgtm. +1 > Improve response time of internal preemption > > > Key: TEZ-1742 > URL: https://issues.apache.org/jira/browse/TEZ-1742 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 0.5.3 > > Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, > TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch, > TEZ-1742.addendum.3.patch > > > Tez YARN Task Scheduler currently preempts 1 running task at a time when a > higher priority task is waiting and there are no available resources. When a > large number of higher priority tasks are pending then it can take a long > time to preempt the required number of lower priority tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes
[ https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203566#comment-14203566 ] Hitesh Shah commented on TEZ-1758: -- Whitespace example: {code} +DAG dag = DAG.create("DAG").addVertex(vertex); + +try { {code} The line between DAG and try is not an empty line. > TezClient should provide YARN diagnostics when the AM crashes > - > > Key: TEZ-1758 > URL: https://issues.apache.org/jira/browse/TEZ-1758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1758.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes
[ https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203564#comment-14203564 ] Bikas Saha edited comment on TEZ-1758 at 11/8/14 7:43 PM: -- Diagnostics are expected to be user visible and meaningful. Showing YARN app state does not meet that criteria since its leaking internal platform state. Its good for debugging and it being logged. Will make both places consistent. This is because we simply dont show diagnostics on the client and we have to look at the RM UI/logs or NM logs to find why the AM did not start. Dont know if YARN is propagating the errors from the NM to the RM properly but this will show if its not doing so. The tabs looks fine to me in the diff {code} + + @Test(timeout = 5000) + public void testWaitTillReadyAppFailed() throws Exception { +final TezClientForTest client = configure(); +client.start(); +String msg = "Application Test Failed"; + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState()) + .thenReturn(YarnApplicationState.NEW).thenReturn(YarnApplicationState.FAILED); + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn( +msg); +try { + client.waitTillReady(); + Assert.fail(); +} catch (SessionNotRunning e) { + Assert.assertTrue(e.getMessage().contains(msg)); +} +client.stop(); + } + + @Test(timeout = 5000) + public void testSubmitDAGAppFailed() throws Exception { +final TezClientForTest client = configure(); +client.start(); + +client.callRealGetSessionAMProxy = true; +String msg = "Application Test Failed"; + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState()) +.thenReturn(YarnApplicationState.KILLED); + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn( +msg); + +Vertex vertex = Vertex.create("Vertex", ProcessorDescriptor.create("P"), 1, +Resource.newInstance(1, 1)); +DAG dag = DAG.create("DAG").addVertex(vertex); + +try { + client.submitDAG(dag); + Assert.fail(); +} catch (SessionNotRunning e) { + Assert.assertTrue(e.getMessage().contains(msg)); +} +client.stop(); + } {code} Thanks for the review! was (Author: bikassaha): Diagnostics are expected to be user visible and meaningful. Showing YARN app state does not meet that criteria. Will make both places consistent. This is because we simply dont show diagnostics on the client and we have to look at the RM UI/logs or NM logs to find why the AM did not start. Dont know if YARN is propagating the errors from the NM to the RM properly but this will show if its not doing so. The tabs looks fine to me in the diff {code} + + @Test(timeout = 5000) + public void testWaitTillReadyAppFailed() throws Exception { +final TezClientForTest client = configure(); +client.start(); +String msg = "Application Test Failed"; + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState()) + .thenReturn(YarnApplicationState.NEW).thenReturn(YarnApplicationState.FAILED); + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn( +msg); +try { + client.waitTillReady(); + Assert.fail(); +} catch (SessionNotRunning e) { + Assert.assertTrue(e.getMessage().contains(msg)); +} +client.stop(); + } + + @Test(timeout = 5000) + public void testSubmitDAGAppFailed() throws Exception { +final TezClientForTest client = configure(); +client.start(); + +client.callRealGetSessionAMProxy = true; +String msg = "Application Test Failed"; + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState()) +.thenReturn(YarnApplicationState.KILLED); + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn( +msg); + +Vertex vertex = Vertex.create("Vertex", ProcessorDescriptor.create("P"), 1, +Resource.newInstance(1, 1)); +DAG dag = DAG.create("DAG").addVertex(vertex); + +try { + client.submitDAG(dag); + Assert.fail(); +} catch (SessionNotRunning e) { + Assert.assertTrue(e.getMessage().contains(msg)); +} +client.stop(); + } {code} Thanks for the review! > TezClient should provide YARN diagnostics when the AM crashes > - > > Key: TEZ-1758 > URL: https://issues.apache.org/jira/browse/TEZ-1758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1758.1.patch
[jira] [Commented] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes
[ https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203564#comment-14203564 ] Bikas Saha commented on TEZ-1758: - Diagnostics are expected to be user visible and meaningful. Showing YARN app state does not meet that criteria. Will make both places consistent. This is because we simply dont show diagnostics on the client and we have to look at the RM UI/logs or NM logs to find why the AM did not start. Dont know if YARN is propagating the errors from the NM to the RM properly but this will show if its not doing so. The tabs looks fine to me in the diff {code} + + @Test(timeout = 5000) + public void testWaitTillReadyAppFailed() throws Exception { +final TezClientForTest client = configure(); +client.start(); +String msg = "Application Test Failed"; + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState()) + .thenReturn(YarnApplicationState.NEW).thenReturn(YarnApplicationState.FAILED); + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn( +msg); +try { + client.waitTillReady(); + Assert.fail(); +} catch (SessionNotRunning e) { + Assert.assertTrue(e.getMessage().contains(msg)); +} +client.stop(); + } + + @Test(timeout = 5000) + public void testSubmitDAGAppFailed() throws Exception { +final TezClientForTest client = configure(); +client.start(); + +client.callRealGetSessionAMProxy = true; +String msg = "Application Test Failed"; + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getYarnApplicationState()) +.thenReturn(YarnApplicationState.KILLED); + when(client.mockYarnClient.getApplicationReport(client.mockAppId).getDiagnostics()).thenReturn( +msg); + +Vertex vertex = Vertex.create("Vertex", ProcessorDescriptor.create("P"), 1, +Resource.newInstance(1, 1)); +DAG dag = DAG.create("DAG").addVertex(vertex); + +try { + client.submitDAG(dag); + Assert.fail(); +} catch (SessionNotRunning e) { + Assert.assertTrue(e.getMessage().contains(msg)); +} +client.stop(); + } {code} Thanks for the review! > TezClient should provide YARN diagnostics when the AM crashes > - > > Key: TEZ-1758 > URL: https://issues.apache.org/jira/browse/TEZ-1758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1758.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1749) Increase test timeout for TestLocalMode.testMultipleClientsWithSession
[ https://issues.apache.org/jira/browse/TEZ-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1749: -- Fix Version/s: (was: 0.6.0) 0.5.3 > Increase test timeout for TestLocalMode.testMultipleClientsWithSession > -- > > Key: TEZ-1749 > URL: https://issues.apache.org/jira/browse/TEZ-1749 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 0.5.3 > > Attachments: TEZ-1749.1.patch > > > Times out in other platforms (windows). Verified that it is successful at > times and timesout at times due to slower hardware. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1749) Increase test timeout for TestLocalMode.testMultipleClientsWithSession
[ https://issues.apache.org/jira/browse/TEZ-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203561#comment-14203561 ] Rajesh Balamohan commented on TEZ-1749: --- committed to branch-0.5 > Increase test timeout for TestLocalMode.testMultipleClientsWithSession > -- > > Key: TEZ-1749 > URL: https://issues.apache.org/jira/browse/TEZ-1749 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 0.5.3 > > Attachments: TEZ-1749.1.patch > > > Times out in other platforms (windows). Verified that it is successful at > times and timesout at times due to slower hardware. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1742) Improve response time of internal preemption
[ https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1742: Attachment: TEZ-1742.addendum.3.patch OK. I will try to stop being smart about not casting doubles to int and go read my basic math book for integer division. Attached patch with fix and more tests for the corner cases. > Improve response time of internal preemption > > > Key: TEZ-1742 > URL: https://issues.apache.org/jira/browse/TEZ-1742 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 0.5.3 > > Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, > TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch, > TEZ-1742.addendum.3.patch > > > Tez YARN Task Scheduler currently preempts 1 running task at a time when a > higher priority task is waiting and there are no available resources. When a > large number of higher priority tasks are pending then it can take a long > time to preempt the required number of lower priority tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1742) Improve response time of internal preemption
[ https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203550#comment-14203550 ] Rajesh Balamohan commented on TEZ-1742: --- Changes related to task priority instead of container priority is fine. There is a subtle issue "scaleDownByPreemptionPercentage". E.g, As per current logic, if original=200 and percent=70, it would try to clear up 200 containers instead of 140. > Improve response time of internal preemption > > > Key: TEZ-1742 > URL: https://issues.apache.org/jira/browse/TEZ-1742 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 0.5.3 > > Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, > TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch > > > Tez YARN Task Scheduler currently preempts 1 running task at a time when a > higher priority task is waiting and there are no available resources. When a > large number of higher priority tasks are pending then it can take a long > time to preempt the required number of lower priority tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes
[ https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203547#comment-14203547 ] Hitesh Shah commented on TEZ-1758: -- Comments: Mostly looks good. Minor point related to: {code} -throw new SessionNotRunning("TezSession has already shutdown"); +throw new SessionNotRunning("TezSession has already shutdown. " ++ ((diagnostics != null) ? diagnostics : "")); {code} Would it make sense to have the message contain the yarn app state as well as diagnostics? Also, there is a minor inconsistency: - in case for the above code, it uses "" when diagnostics is null - In TezClientUtils, it says "Cluster diagnostics not found" Is this because diagnostics is not set when state is FINISHED? Also, the newly added unit test code has a lot of spurious whitespaces that needs cleaning up. > TezClient should provide YARN diagnostics when the AM crashes > - > > Key: TEZ-1758 > URL: https://issues.apache.org/jira/browse/TEZ-1758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1758.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes
[ https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203547#comment-14203547 ] Hitesh Shah edited comment on TEZ-1758 at 11/8/14 7:17 PM: --- Comments: Mostly looks good. Minor point related to: {code} -throw new SessionNotRunning("TezSession has already shutdown"); +throw new SessionNotRunning("TezSession has already shutdown. " ++ ((diagnostics != null) ? diagnostics : "")); {code} Would it make sense to have the message contain the final yarn app state as well as diagnostics? Also, there is a minor inconsistency: - in case for the above code, it uses "" when diagnostics is null - In TezClientUtils, it says "Cluster diagnostics not found" Is this because diagnostics is not set when state is FINISHED? Also, the newly added unit test code has a lot of spurious whitespaces that needs cleaning up. was (Author: hitesh): Comments: Mostly looks good. Minor point related to: {code} -throw new SessionNotRunning("TezSession has already shutdown"); +throw new SessionNotRunning("TezSession has already shutdown. " ++ ((diagnostics != null) ? diagnostics : "")); {code} Would it make sense to have the message contain the yarn app state as well as diagnostics? Also, there is a minor inconsistency: - in case for the above code, it uses "" when diagnostics is null - In TezClientUtils, it says "Cluster diagnostics not found" Is this because diagnostics is not set when state is FINISHED? Also, the newly added unit test code has a lot of spurious whitespaces that needs cleaning up. > TezClient should provide YARN diagnostics when the AM crashes > - > > Key: TEZ-1758 > URL: https://issues.apache.org/jira/browse/TEZ-1758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1758.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes
[ https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1758: Affects Version/s: 0.5.2 > TezClient should provide YARN diagnostics when the AM crashes > - > > Key: TEZ-1758 > URL: https://issues.apache.org/jira/browse/TEZ-1758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1758.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes
[ https://issues.apache.org/jira/browse/TEZ-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1758: Attachment: TEZ-1758.1.patch Patch that makes sure appReport diagnostics are reported up to the user so that YARN diagnostics (if present) can be shown when the AM does not start or dies. [~hitesh] [~pramachandran] Please review. > TezClient should provide YARN diagnostics when the AM crashes > - > > Key: TEZ-1758 > URL: https://issues.apache.org/jira/browse/TEZ-1758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1758.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1758) TezClient should provide YARN diagnostics when the AM crashes
Bikas Saha created TEZ-1758: --- Summary: TezClient should provide YARN diagnostics when the AM crashes Key: TEZ-1758 URL: https://issues.apache.org/jira/browse/TEZ-1758 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1750) Add a DAGScheduler which schedules tasks only when sources have been scheduled
[ https://issues.apache.org/jira/browse/TEZ-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203344#comment-14203344 ] Bikas Saha commented on TEZ-1750: - the new configuration should be an expert level setting and probably unstable. yes. From the naming of the method it seems that it should already have scheduled if the vertex scheduling trigger has been reached. wasnt clear why is was just returning a boolean and letting the called do the scheduling. Any ideas about the 1-1 edge handling or custom edge handling? Or are they follow up items to be done after trying this out in experiments? > Add a DAGScheduler which schedules tasks only when sources have been scheduled > -- > > Key: TEZ-1750 > URL: https://issues.apache.org/jira/browse/TEZ-1750 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Attachments: TEZ-1750.1.txt, TEZ-1750.2.txt > > > Splitting out the patch on TEZ-1522 into a separate jira. > There's several scenarios in which we end up scheduling downstream tasks > before their sources have been scheduled - and then get into a situation > where the sources are starved. Currently, anywhere a ShuffleVertexManager is > used can cause such behaviour - since it starts scheduling it's tasks after a > certain number of sources are complete, but subsequen non-shuffle > VertexManagers will scheduled immediately. > Disabling slow-start is one option to achieve this (or setting slow start on > all vertices), but it doesn't work for the situation where dynamic reducer > parallelism kicks in - since it has to wait for source tasks to complete. > The intent here is to add a DAGScheduler, which affectively negates the slow > start, and in case of dynamic parallelism determination, waits for upstream > tasks to be scheduled before scheduling downstream tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1742) Improve response time of internal preemption
[ https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1742: Attachment: (was: TEZ-1742.addendum.2.patch) > Improve response time of internal preemption > > > Key: TEZ-1742 > URL: https://issues.apache.org/jira/browse/TEZ-1742 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 0.5.3 > > Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, > TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch > > > Tez YARN Task Scheduler currently preempts 1 running task at a time when a > higher priority task is waiting and there are no available resources. When a > large number of higher priority tasks are pending then it can take a long > time to preempt the required number of lower priority tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1742) Improve response time of internal preemption
[ https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1742: Attachment: TEZ-1742.addendum.2.patch > Improve response time of internal preemption > > > Key: TEZ-1742 > URL: https://issues.apache.org/jira/browse/TEZ-1742 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 0.5.3 > > Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, > TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch > > > Tez YARN Task Scheduler currently preempts 1 running task at a time when a > higher priority task is waiting and there are no available resources. When a > large number of higher priority tasks are pending then it can take a long > time to preempt the required number of lower priority tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1742) Improve response time of internal preemption
[ https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203328#comment-14203328 ] Bikas Saha commented on TEZ-1742: - Good catch. The latest addendum patch fixes the bug and adds a unit test for that method. Please review. This jiras original patch contains not just the percentage improvement but also bug fixes where incorrect priorities were being compared and we could end up preempting the wrong tasks. The addendum patch has a test enhancement for that bug fix. It also contains a fix to not trigger task preemption in back-to-back cycles so that we dont preempt before the RM can react to the preemption. The existing behavior of fewer preemptions can be enabled by setting a sufficiently low value of the preemptionPercentage, if needed. Hence, I feel that this patch is relevant for branch-0.5 and should make its way there. Proportional preemptions is a general improvement, that helps out of order scheduling, but is unrelated to it and can help in any other case where scheduling constraints need to be enforced quickly. e.g. losing many tasks on a rack or a re-execution of all reducers because of a correlated outage. > Improve response time of internal preemption > > > Key: TEZ-1742 > URL: https://issues.apache.org/jira/browse/TEZ-1742 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 0.5.3 > > Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, > TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch > > > Tez YARN Task Scheduler currently preempts 1 running task at a time when a > higher priority task is waiting and there are no available resources. When a > large number of higher priority tasks are pending then it can take a long > time to preempt the required number of lower priority tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1742) Improve response time of internal preemption
[ https://issues.apache.org/jira/browse/TEZ-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1742: Attachment: TEZ-1742.addendum.2.patch > Improve response time of internal preemption > > > Key: TEZ-1742 > URL: https://issues.apache.org/jira/browse/TEZ-1742 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 0.5.3 > > Attachments: TEZ-1742.1.patch, TEZ-1742.2.patch, TEZ-1742.3.patch, > TEZ-1742.4.patch, TEZ-1742.4.patch.addendum, TEZ-1742.addendum.2.patch > > > Tez YARN Task Scheduler currently preempts 1 running task at a time when a > higher priority task is waiting and there are no available resources. When a > large number of higher priority tasks are pending then it can take a long > time to preempt the required number of lower priority tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)