[ 
https://issues.apache.org/jira/browse/TEZ-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1143:
----------------------------

    Description: 
One-one edge fail when the parallelism of source vertex changes dynamically 
(through a ShuffleVertexManager). Here is the stack:
{code}
2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Vertex 
vertex_1400646157236_0012_1_03 parallelism set to 1 from 202014-05-21 
00:05:55,284 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_0000012014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_0000022014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_0000032014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_0000042014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_0000052014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_000006
2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_0000072014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_000008
2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000009
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000010
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000011
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000012
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000013
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000014
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000015
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000016
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000017
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000018
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000019
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Replacing edge manager for 
source:scope-41 destination: vertex_1400646157236_0012_1_032014-05-21 
00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.history.HistoryEventHandler: 
[HISTORY][DAG:dag_1400646157236_0012_1][Event:VERTEX_PARALLELISM_UPDATED]: 
vertexId=vertex_1400646157236_0012_1_03, numTasks=1, vertexLocationHint=null, 
edgeManagersCount=12014-05-21 00:05:55,286 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.DAGImpl: Vertex vertex_1400646157236_0012_1_02 
completed., numCompletedVertices=3, numSuccessfulVertices=3, 
numFailedVertices=0, numKilledVertices=0, numVertices=72014-05-21 00:05:55,287 
ERROR [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Can't handle Invalid event 
V_ONE_TO_ONE_SOURCE_SPLIT on vertex scope-61 with vertexId 
vertex_1400646157236_0012_1_05 at current state 
RUNNINGorg.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid 
event: V_ONE_TO_ONE_SOURCE_SPLIT at RUNNING
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at 
org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1263)
        at 
org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:158)
        at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1716)
        at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1702)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)     
   at java.lang.Thread.run(Thread.java:695)
{code}

Attached complete AM log. scope-42 is the source vertex and scope-61 is the 
destination vertex.

The issue is that the code assumed that the split event will come before the 
vertex starts. This may not be valid in all cases. E.g. if the event comes from 
2 different paths in the DAG then the vertex can start after 1 path sets the 
parallelism and then the second path sends the event. Also if the previous 
vertex was a shuffle/reduce then its parallelism can change while its running, 
resulting in changing the current vertex parallelism while its running.

  was:
One-one edge fail when the parallelism of source vertex changes dynamically 
(through a ShuffleVertexManager). Here is the stack:
{code}
2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Vertex 
vertex_1400646157236_0012_1_03 parallelism set to 1 from 202014-05-21 
00:05:55,284 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_0000012014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_0000022014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_0000032014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_0000042014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_0000052014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_000006
2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_0000072014-05-21 00:05:55,284 INFO 
[AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
Removing task: task_1400646157236_0012_1_03_000008
2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000009
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000010
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000011
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000012
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000013
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000014
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000015
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000016
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000017
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000018
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
task_1400646157236_0012_1_03_000019
2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Replacing edge manager for 
source:scope-41 destination: vertex_1400646157236_0012_1_032014-05-21 
00:05:55,285 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.history.HistoryEventHandler: 
[HISTORY][DAG:dag_1400646157236_0012_1][Event:VERTEX_PARALLELISM_UPDATED]: 
vertexId=vertex_1400646157236_0012_1_03, numTasks=1, vertexLocationHint=null, 
edgeManagersCount=12014-05-21 00:05:55,286 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.DAGImpl: Vertex vertex_1400646157236_0012_1_02 
completed., numCompletedVertices=3, numSuccessfulVertices=3, 
numFailedVertices=0, numKilledVertices=0, numVertices=72014-05-21 00:05:55,287 
ERROR [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.VertexImpl: Can't handle Invalid event 
V_ONE_TO_ONE_SOURCE_SPLIT on vertex scope-61 with vertexId 
vertex_1400646157236_0012_1_05 at current state 
RUNNINGorg.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid 
event: V_ONE_TO_ONE_SOURCE_SPLIT at RUNNING
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at 
org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1263)
        at 
org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:158)
        at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1716)
        at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1702)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)     
   at java.lang.Thread.run(Thread.java:695)
{code}

Attached complete AM log. scope-42 is the source vertex and scope-61 is the 
destination vertex.

The issue is that the code assumed that the split event will come before the 
vertex starts. This may not be valid in all cases. E.g. if the event comes from 
2 different paths in the DAG then the vertex can start after 1 path sets the 
parallelism and then the second path sends the event.


> 1-1 source split event should be handled in Vertex.RUNNING state
> ----------------------------------------------------------------
>
>                 Key: TEZ-1143
>                 URL: https://issues.apache.org/jira/browse/TEZ-1143
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Daniel Dai
>            Assignee: Bikas Saha
>             Fix For: 0.5.0
>
>         Attachments: TEZ-1143.1.patch, TEZ-1143.2.patch, TEZ-1143.3.patch, 
> syslog_dag_1400696568249_0001_1
>
>
> One-one edge fail when the parallelism of source vertex changes dynamically 
> (through a ShuffleVertexManager). Here is the stack:
> {code}
> 2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Vertex 
> vertex_1400646157236_0012_1_03 parallelism set to 1 from 202014-05-21 
> 00:05:55,284 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_0000012014-05-21 00:05:55,284 INFO 
> [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
> Removing task: task_1400646157236_0012_1_03_0000022014-05-21 00:05:55,284 
> INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_0000032014-05-21 00:05:55,284 INFO 
> [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
> Removing task: task_1400646157236_0012_1_03_0000042014-05-21 00:05:55,284 
> INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_0000052014-05-21 00:05:55,284 INFO 
> [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
> Removing task: task_1400646157236_0012_1_03_000006
> 2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_0000072014-05-21 00:05:55,284 INFO 
> [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: 
> Removing task: task_1400646157236_0012_1_03_000008
> 2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_000009
> 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_000010
> 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_000011
> 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_000012
> 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_000013
> 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_000014
> 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_000015
> 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_000016
> 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_000017
> 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_000018
> 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: 
> task_1400646157236_0012_1_03_000019
> 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Replacing edge manager for 
> source:scope-41 destination: vertex_1400646157236_0012_1_032014-05-21 
> 00:05:55,285 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1400646157236_0012_1][Event:VERTEX_PARALLELISM_UPDATED]: 
> vertexId=vertex_1400646157236_0012_1_03, numTasks=1, vertexLocationHint=null, 
> edgeManagersCount=12014-05-21 00:05:55,286 INFO [AsyncDispatcher event 
> handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Vertex 
> vertex_1400646157236_0012_1_02 completed., numCompletedVertices=3, 
> numSuccessfulVertices=3, numFailedVertices=0, numKilledVertices=0, 
> numVertices=72014-05-21 00:05:55,287 ERROR [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.VertexImpl: Can't handle Invalid event 
> V_ONE_TO_ONE_SOURCE_SPLIT on vertex scope-61 with vertexId 
> vertex_1400646157236_0012_1_05 at current state 
> RUNNINGorg.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid 
> event: V_ONE_TO_ONE_SOURCE_SPLIT at RUNNING
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1263)
>         at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:158)
>         at 
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1716)
>         at 
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1702)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)   
>      at java.lang.Thread.run(Thread.java:695)
> {code}
> Attached complete AM log. scope-42 is the source vertex and scope-61 is the 
> destination vertex.
> The issue is that the code assumed that the split event will come before the 
> vertex starts. This may not be valid in all cases. E.g. if the event comes 
> from 2 different paths in the DAG then the vertex can start after 1 path sets 
> the parallelism and then the second path sends the event. Also if the 
> previous vertex was a shuffle/reduce then its parallelism can change while 
> its running, resulting in changing the current vertex parallelism while its 
> running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to