[ 
https://issues.apache.org/jira/browse/TEZ-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4474:
------------------------------
    Fix Version/s: 0.10.3

> DAG recovery failure leads to AM status SUCCEEDED
> -------------------------------------------------
>
>                 Key: TEZ-4474
>                 URL: https://issues.apache.org/jira/browse/TEZ-4474
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.2, 0.10.0, 0.10.1, 0.10.2
>            Reporter: Mudit Sharma
>            Assignee: Mudit Sharma
>            Priority: Blocker
>             Fix For: 0.10.3
>
>         Attachments: 
> 0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch
>
>          Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Summary of the Issue:
> When Tez DAG recovery is failed because of some reason in the second retry of 
> any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE
> Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(), 
> Tez AM will try to shutdown the DAG, and since recovery was failed so there 
> will not be any running DAGs
> If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM 
> sets the status to SUCCEEDED, because of this if-else:
> [https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1266]
> {code:java}
> public void shutdownTezAM(String dagKillmessage) throws TezException {
>     if (!sessionStopped.compareAndSet(false, true))
> {       // No need to shutdown twice.       // Return with a no-op if 
> shutdownTezAM has been invoked earlier.       return;     }
>     synchronized (this) {
>       this.taskSchedulerManager.setShouldUnregisterFlag();
>       if (currentDAG != null
>           && !currentDAG.isComplete())
> {         //send a DAG_TERMINATE message         LOG.info("Sending a kill 
> event to the current DAG"             + ", dagId=" + currentDAG.getID());     
>     tryKillDAG(currentDAG, dagKillmessage);       }
> else {
>         LOG.info("No current running DAG, shutting down the AM");
>         if (isSession && !state.equals(DAGAppMasterState.ERROR))
> {           state = DAGAppMasterState.SUCCEEDED;         }
>         shutdownHandler.shutdown();
>       }
>     }
>   }
> {code}
>  
> This can result in issues in dependent systems like Hive which will move 
> ahead with other tasks in pipeline assuming the DAG was success, this can 
> result in moving empty data in Hive
> As part of this JIRA, we are proposing to introduce a patch in TEZ, which 
> introduces a config, which when set, then in case of shutdown with no current 
> running DAGs, Tez status will always be marked as FAILED instead of SUCCEEDED 
> in case DAG state at that time was not ERROR
>  
> PR: [https://github.com/apache/tez/pull/266] 
> This is the patch, please review and let us know about your thoughts: 
> [^0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to