[ 
https://issues.apache.org/jira/browse/TEZ-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854024#comment-17854024
 ] 

Shohei Okumiya commented on TEZ-4569:
-------------------------------------

I created a test case to reproduce the issue first. The 
[testTableScanTemporalFailure|https://github.com/okumin/tez/commit/deac035274bd0b958fbfdf3557dc7120c16fddc5#diff-ad65a331fa51a07f3cc5301ca7df09c199e9730a6f889bc8b1859554ccfc0519R199-R217]
 is the most straightforward reproduction.
{code:java}
2024-06-11 20:18:55,719 INFO  [Time-limited test] client.DAGClientImpl 
(DAGClientImpl.java:log(709)) - DAG: State: RUNNING Progress: 200% TotalTasks: 
1 Succeeded: 2 Running: 0 Failed: 0 Killed: 0 KilledTaskAttempts: 1
2024-06-11 20:18:55,720 INFO  [Time-limited test] client.DAGClientImpl 
(DAGClientImpl.java:log(709)) -     VertexStatus: VertexName: TableScan 
Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0 
KilledTaskAttempts: 1
2024-06-11 20:18:55,721 INFO  [Time-limited test] client.DAGClientImpl 
(DAGClientImpl.java:log(709)) -     VertexStatus: VertexName: Aggregation 
Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0
2024-06-11 20:18:55,721 INFO  [Time-limited test] client.DAGClientImpl 
(DAGClientImpl.java:log(709)) -     VertexStatus: VertexName: MapJoin Progress: 
0% TotalTasks: -1 Succeeded: 0 Running: 0 Failed: 0 Killed: 0
2024-06-11 20:19:00,756 INFO  [Time-limited test] client.DAGClientImpl 
(DAGClientImpl.java:log(709)) - DAG: State: RUNNING Progress: 200% TotalTasks: 
1 Succeeded: 2 Running: 0 Failed: 0 Killed: 0 KilledTaskAttempts: 1
2024-06-11 20:19:00,757 INFO  [Time-limited test] client.DAGClientImpl 
(DAGClientImpl.java:log(709)) -     VertexStatus: VertexName: TableScan 
Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0 
KilledTaskAttempts: 1
2024-06-11 20:19:00,757 INFO  [Time-limited test] client.DAGClientImpl 
(DAGClientImpl.java:log(709)) -     VertexStatus: VertexName: Aggregation 
Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0
2024-06-11 20:19:00,758 INFO  [Time-limited test] client.DAGClientImpl 
(DAGClientImpl.java:log(709)) -     VertexStatus: VertexName: MapJoin Progress: 
0% TotalTasks: -1 Succeeded: 0 Running: 0 Failed: 0 Killed: 0 {code}

> SCATTER_GATHER + BROADCAST hangs on DAG Recovery
> ------------------------------------------------
>
>                 Key: TEZ-4569
>                 URL: https://issues.apache.org/jira/browse/TEZ-4569
>             Project: Apache Tez
>          Issue Type: Improvement
>    Affects Versions: 0.10.3
>            Reporter: Shohei Okumiya
>            Assignee: Shohei Okumiya
>            Priority: Major
>         Attachments: image-2024-06-11-20-45-12-540.png
>
>
> A Tez DAG fails to initialize itself when an Application Master is timely 
> preempted.
>  
> The problem typically happens with Map Join(Broadcast Hash Join) of Hive when 
> the broadcast edge is multi-staged. In the following case, the smaller side 
> includes one aggregation, and the condition is satisfied.
>  
> {code:java}
> CREATE TABLE small AS SELECT 1 AS id;
> CREATE TABLE big AS SELECT 1 AS id UNION ALL SELECT 2 AS id UNION ALL SELECT 
> 3 AS id;
> SELECT *
> FROM big
> JOIN (SELECT id, count(*) AS num FROM small GROUP BY id) s ON big.id = s.id 
> {code}
> Once it happens, a retried AM fails to configure the Map Join vertex. In the 
> following case, Map 1 never starts.
>  
>  
> {code:java}
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED  
> ----------------------------------------------------------------------------------------------
> Map 2 .......... container     SUCCEEDED      1          1        0        0  
>      0       1  
> Reducer 3 ...... container     SUCCEEDED      1          1        0        0  
>      0       0  
> Map 1            container  INITIALIZING     -1          0        0       -1  
>      0       0  
> ----------------------------------------------------------------------------------------------
>  {code}
> Tez starts Map 2 and Map 1 once their splits are configured. The hang issue 
> happens when an AM is retried before it starts Reducer 3.
> !image-2024-06-11-20-45-12-540.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to