[
https://issues.apache.org/jira/browse/TEZ-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854024#comment-17854024
]
Shohei Okumiya commented on TEZ-4569:
-------------------------------------
I created a test case to reproduce the issue first. The
[testTableScanTemporalFailure|https://github.com/okumin/tez/commit/deac035274bd0b958fbfdf3557dc7120c16fddc5#diff-ad65a331fa51a07f3cc5301ca7df09c199e9730a6f889bc8b1859554ccfc0519R199-R217]
is the most straightforward reproduction.
{code:java}
2024-06-11 20:18:55,719 INFO [Time-limited test] client.DAGClientImpl
(DAGClientImpl.java:log(709)) - DAG: State: RUNNING Progress: 200% TotalTasks:
1 Succeeded: 2 Running: 0 Failed: 0 Killed: 0 KilledTaskAttempts: 1
2024-06-11 20:18:55,720 INFO [Time-limited test] client.DAGClientImpl
(DAGClientImpl.java:log(709)) - VertexStatus: VertexName: TableScan
Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0
KilledTaskAttempts: 1
2024-06-11 20:18:55,721 INFO [Time-limited test] client.DAGClientImpl
(DAGClientImpl.java:log(709)) - VertexStatus: VertexName: Aggregation
Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0
2024-06-11 20:18:55,721 INFO [Time-limited test] client.DAGClientImpl
(DAGClientImpl.java:log(709)) - VertexStatus: VertexName: MapJoin Progress:
0% TotalTasks: -1 Succeeded: 0 Running: 0 Failed: 0 Killed: 0
2024-06-11 20:19:00,756 INFO [Time-limited test] client.DAGClientImpl
(DAGClientImpl.java:log(709)) - DAG: State: RUNNING Progress: 200% TotalTasks:
1 Succeeded: 2 Running: 0 Failed: 0 Killed: 0 KilledTaskAttempts: 1
2024-06-11 20:19:00,757 INFO [Time-limited test] client.DAGClientImpl
(DAGClientImpl.java:log(709)) - VertexStatus: VertexName: TableScan
Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0
KilledTaskAttempts: 1
2024-06-11 20:19:00,757 INFO [Time-limited test] client.DAGClientImpl
(DAGClientImpl.java:log(709)) - VertexStatus: VertexName: Aggregation
Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0
2024-06-11 20:19:00,758 INFO [Time-limited test] client.DAGClientImpl
(DAGClientImpl.java:log(709)) - VertexStatus: VertexName: MapJoin Progress:
0% TotalTasks: -1 Succeeded: 0 Running: 0 Failed: 0 Killed: 0 {code}
> SCATTER_GATHER + BROADCAST hangs on DAG Recovery
> ------------------------------------------------
>
> Key: TEZ-4569
> URL: https://issues.apache.org/jira/browse/TEZ-4569
> Project: Apache Tez
> Issue Type: Improvement
> Affects Versions: 0.10.3
> Reporter: Shohei Okumiya
> Assignee: Shohei Okumiya
> Priority: Major
> Attachments: image-2024-06-11-20-45-12-540.png
>
>
> A Tez DAG fails to initialize itself when an Application Master is timely
> preempted.
>
> The problem typically happens with Map Join(Broadcast Hash Join) of Hive when
> the broadcast edge is multi-staged. In the following case, the smaller side
> includes one aggregation, and the condition is satisfied.
>
> {code:java}
> CREATE TABLE small AS SELECT 1 AS id;
> CREATE TABLE big AS SELECT 1 AS id UNION ALL SELECT 2 AS id UNION ALL SELECT
> 3 AS id;
> SELECT *
> FROM big
> JOIN (SELECT id, count(*) AS num FROM small GROUP BY id) s ON big.id = s.id
> {code}
> Once it happens, a retried AM fails to configure the Map Join vertex. In the
> following case, Map 1 never starts.
>
>
> {code:java}
> ----------------------------------------------------------------------------------------------
> VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING
> FAILED KILLED
> ----------------------------------------------------------------------------------------------
> Map 2 .......... container SUCCEEDED 1 1 0 0
> 0 1
> Reducer 3 ...... container SUCCEEDED 1 1 0 0
> 0 0
> Map 1 container INITIALIZING -1 0 0 -1
> 0 0
> ----------------------------------------------------------------------------------------------
> {code}
> Tez starts Map 2 and Map 1 once their splits are configured. The hang issue
> happens when an AM is retried before it starts Reducer 3.
> !image-2024-06-11-20-45-12-540.png!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)