Shohei Okumiya created TEZ-4569:
-----------------------------------

             Summary: SCATTER_GATHER + BROADCAST hangs on DAG Recovery
                 Key: TEZ-4569
                 URL: https://issues.apache.org/jira/browse/TEZ-4569
             Project: Apache Tez
          Issue Type: Improvement
    Affects Versions: 0.10.3
            Reporter: Shohei Okumiya
            Assignee: Shohei Okumiya
         Attachments: image-2024-06-11-20-45-12-540.png

A Tez DAG fails to initialize itself when an Application Master is timely 
preempted.

 

The problem typically happens with Map Join(Broadcast Hash Join) of Hive when 
the broadcast edge is multi-staged. In the following case, the smaller side 
includes one aggregation, and the condition is satisfied.

 
{code:java}
CREATE TABLE small AS SELECT 1 AS id;
CREATE TABLE big AS SELECT 1 AS id UNION ALL SELECT 2 AS id UNION ALL SELECT 3 
AS id;
SELECT *
FROM big
JOIN (SELECT id, count(*) AS num FROM small GROUP BY id) s ON big.id = s.id 
{code}
Once it happens, a retried AM fails to configure the Map Join vertex. In the 
following case, Map 1 never starts.

 

 
{code:java}
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 2 .......... container     SUCCEEDED      1          1        0        0    
   0       1  
Reducer 3 ...... container     SUCCEEDED      1          1        0        0    
   0       0  
Map 1            container  INITIALIZING     -1          0        0       -1    
   0       0  
----------------------------------------------------------------------------------------------
 {code}
Tez starts Map 2 and Map 1 once their splits are configured. The hang issue 
happens when an AM is retried before it starts Reducer 3.

!image-2024-06-11-20-45-12-540.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to