[ https://issues.apache.org/jira/browse/TEZ-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690784#comment-17690784 ]
Mudit Sharma commented on TEZ-4474: ----------------------------------- [~abstractdog] can you please review the PR: [https://github.com/apache/tez/pull/266], we are facing some critical issues because of this at our end > DAG recovery failure leads to AM status SUCCEEDED > ------------------------------------------------- > > Key: TEZ-4474 > URL: https://issues.apache.org/jira/browse/TEZ-4474 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.9.2, 0.10.0, 0.10.1, 0.10.2 > Reporter: Mudit Sharma > Priority: Blocker > Attachments: > 0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > Summary of the Issue: > When Tez DAG recovery is failed because of some reason in the second retry of > any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE > Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(), > Tez AM will try to shutdown the DAG, and since recovery was failed so there > will not be any running DAGs > If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM > sets the status to SUCCEEDED, because of this if-else: > [https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1266] > {code:java} > public void shutdownTezAM(String dagKillmessage) throws TezException { > if (!sessionStopped.compareAndSet(false, true)) > { // No need to shutdown twice. // Return with a no-op if > shutdownTezAM has been invoked earlier. return; } > synchronized (this) { > this.taskSchedulerManager.setShouldUnregisterFlag(); > if (currentDAG != null > && !currentDAG.isComplete()) > { //send a DAG_TERMINATE message LOG.info("Sending a kill > event to the current DAG" + ", dagId=" + currentDAG.getID()); > tryKillDAG(currentDAG, dagKillmessage); } > else { > LOG.info("No current running DAG, shutting down the AM"); > if (isSession && !state.equals(DAGAppMasterState.ERROR)) > { state = DAGAppMasterState.SUCCEEDED; } > shutdownHandler.shutdown(); > } > } > } > {code} > > This can result in issues in dependent systems like Hive which will move > ahead with other tasks in pipeline assuming the DAG was success, this can > result in moving empty data in Hive > As part of this JIRA, we are proposing to introduce a patch in TEZ, which > introduces a config, which when set, then in case of shutdown with no current > running DAGs, Tez status will always be marked as FAILED instead of SUCCEEDED > in case DAG state at that time was not ERROR > > PR: [https://github.com/apache/tez/pull/266] > This is the patch, please review and let us know about your thoughts: > [^0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch] > -- This message was sent by Atlassian Jira (v8.20.10#820010)