[jira] [Commented] (TEZ-4474) DAG recovery failure leads to AM status SUCCEEDED

Mudit Sharma (Jira) Tue, 14 Feb 2023 12:53:06 -0800


    [ 
https://issues.apache.org/jira/browse/TEZ-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688697#comment-17688697
 ]


Mudit Sharma commented on TEZ-4474:
-----------------------------------

[~srahman] , I tried to debug this issue further why the summary recovery file 
was not present

I saw these general observations:
 # Summary files are usually created at the end of an app attempt. Although as 
per code they should be instantaneously created but I saw that even after 
getting this log: 
[https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java#L413]
 I saw that the file on HDFS/GCS was not getting created, as per my 
understanding, we are just creating an in memory stream, which will be flushed 
later, please correct me if I am wrong
 # I saw that in the jobs where we are seeing the issues, they have this error:
 ## 2023-02-13 23:57:08,464 [FATAL] [IPC Server handler 22 on 7063] 
|yarn.YarnUncaughtExceptionHandler|: Thread Thread[IPC Server handler 22 on 
7063,5,main] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: GC overhead limit exceeded
 # Also, for all our buggy job attempts, we never see this log : 
[https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java#L222]
 , also we never saw this log as well: 
[https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/history/recovery/RecoveryService.java#L247]

So, my speculation is that whenever any FATAL error is occuring in the attempt, 
the recovery service is getting interrupted prematurely and recovery service is 
not properly shutting down, and because of that, those in memory streams are 
not properly getting written

You can also verify this and let me know if this makes sense as per your 
understanding of the flow because this is what I got from the logs for correct 
and buggy job attempts at our end

 

Also, apart from the PR I raised, I had one more proposal to fix this, that if 
we are in Session by any chance and still our recoveredDag is NULL, can we 
simply throw an exception and return?

Because right now also, we have a check: 
[https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1963]

If by any chance, we are returning null, that also means that we were unable to 
recover DAG from previous attempts, should we fail in that scenario as well?

> DAG recovery failure leads to AM status SUCCEEDED
> -------------------------------------------------
>
>                 Key: TEZ-4474
>                 URL: https://issues.apache.org/jira/browse/TEZ-4474
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.2, 0.10.0, 0.10.1, 0.10.2
>            Reporter: Mudit Sharma
>            Priority: Critical
>         Attachments: 
> 0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Summary of the Issue:
> When Tez DAG recovery is failed because of some reason in the second retry of 
> any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE
> Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(), 
> Tez AM will try to shutdown the DAG, and since recovery was failed so there 
> will not be any running DAGs
> If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM 
> sets the status to SUCCEEDED, because of this if-else:
> [https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1266]
> {code:java}
> public void shutdownTezAM(String dagKillmessage) throws TezException {
>     if (!sessionStopped.compareAndSet(false, true))
> {       // No need to shutdown twice.       // Return with a no-op if 
> shutdownTezAM has been invoked earlier.       return;     }
>     synchronized (this) {
>       this.taskSchedulerManager.setShouldUnregisterFlag();
>       if (currentDAG != null
>           && !currentDAG.isComplete())
> {         //send a DAG_TERMINATE message         LOG.info("Sending a kill 
> event to the current DAG"             + ", dagId=" + currentDAG.getID());     
>     tryKillDAG(currentDAG, dagKillmessage);       }
> else {
>         LOG.info("No current running DAG, shutting down the AM");
>         if (isSession && !state.equals(DAGAppMasterState.ERROR))
> {           state = DAGAppMasterState.SUCCEEDED;         }
>         shutdownHandler.shutdown();
>       }
>     }
>   }
> {code}
>  
> This can result in issues in dependent systems like Hive which will move 
> ahead with other tasks in pipeline assuming the DAG was success, this can 
> result in moving empty data in Hive
> As part of this JIRA, we are proposing to introduce a patch in TEZ, which 
> introduces a config, which when set, then in case of shutdown with no current 
> running DAGs, Tez status will always be marked as FAILED instead of SUCCEEDED 
> in case DAG state at that time was not ERROR
>  
> PR: [https://github.com/apache/tez/pull/266] 
> This is the patch, please review and let us know about your thoughts: 
> [^0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TEZ-4474) DAG recovery failure leads to AM status SUCCEEDED

Reply via email to