[ 
https://issues.apache.org/jira/browse/TEZ-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855389#comment-17855389
 ] 

Shohei Okumiya commented on TEZ-4569:
-------------------------------------

We have another discussion here.

https://lists.apache.org/thread/q7cnz81k39wzd29hrp08o5vohbrdlhk2

> SCATTER_GATHER + BROADCAST hangs on DAG Recovery
> ------------------------------------------------
>
>                 Key: TEZ-4569
>                 URL: https://issues.apache.org/jira/browse/TEZ-4569
>             Project: Apache Tez
>          Issue Type: Improvement
>    Affects Versions: 0.9.2, 0.10.3
>            Reporter: Shohei Okumiya
>            Assignee: Shohei Okumiya
>            Priority: Major
>         Attachments: image-2024-06-11-20-45-12-540.png
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> A Tez DAG fails to initialize itself when an Application Master is timely 
> preempted.
>  
> The problem typically happens with Map Join(Broadcast Hash Join) of Hive when 
> the broadcast edge is multi-staged. In the following case, the smaller side 
> includes one aggregation, and the condition is satisfied.
>  
> {code:java}
> CREATE TABLE small AS SELECT 1 AS id;
> CREATE TABLE big AS SELECT 1 AS id UNION ALL SELECT 2 AS id UNION ALL SELECT 
> 3 AS id;
> SELECT *
> FROM big
> JOIN (SELECT id, count(*) AS num FROM small GROUP BY id) s ON big.id = s.id 
> {code}
> Once it happens, a retried AM fails to configure the Map Join vertex. In the 
> following case, Map 1 never starts.
>  
>  
> {code:java}
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED  
> ----------------------------------------------------------------------------------------------
> Map 2 .......... container     SUCCEEDED      1          1        0        0  
>      0       1  
> Reducer 3 ...... container     SUCCEEDED      1          1        0        0  
>      0       0  
> Map 1            container  INITIALIZING     -1          0        0       -1  
>      0       0  
> ----------------------------------------------------------------------------------------------
>  {code}
> Tez starts Map 2 and Map 1 once their splits are configured. The hang issue 
> happens when an AM is retried before it starts Reducer 3.
> !image-2024-06-11-20-45-12-540.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to