[ 
https://issues.apache.org/jira/browse/TEZ-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mudit Sharma updated TEZ-4474:
------------------------------
    Description: 
Summary of the Issue:

When Tez DAG recovery is failed because of some reason in the second retry of 
any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE

Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(), 
Tez AM will try to shutdown the DAG, and since recovery was failed so there 
will not be any running DAGs

If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM sets 
the status to SUCCEEDED, because of this if-else:

[https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1266]

public void shutdownTezAM(String dagKillmessage) throws TezException {
    if (!sessionStopped.compareAndSet(false, true))

{       // No need to shutdown twice.       // Return with a no-op if 
shutdownTezAM has been invoked earlier.       return;     }

    synchronized (this) {
      this.taskSchedulerManager.setShouldUnregisterFlag();
      if (currentDAG != null
          && !currentDAG.isComplete())

{         //send a DAG_TERMINATE message         LOG.info("Sending a kill event 
to the current DAG"             + ", dagId=" + currentDAG.getID());         
tryKillDAG(currentDAG, dagKillmessage);       }

else {
        LOG.info("No current running DAG, shutting down the AM");
        if (isSession && !state.equals(DAGAppMasterState.ERROR))

{           state = DAGAppMasterState.SUCCEEDED;         }

        shutdownHandler.shutdown();
      }
    }
  }

 

This can result in issues in dependent systems like Hive which will move ahead 
with other tasks in pipeline assuming the DAG was success, this can result in 
moving empty data in Hive

As part of this JIRA, we are proposing to introduce a patch in TEZ, which 
introduces a config, which when set, then in case of shutdown with no current 
running DAGs, Tez status will always be marked as FAILED instead of SUCCEEDED 
in case DAG state at that time was not ERROR

 

This is the patch, please review and let us know about your thoughts: 
[^0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch]

  was:
Summary of the Issue:

When Tez DAG recovery is failed because of some reason in the second retry of 
any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE

Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(), 
Tez AM will try to shutdown the DAG, and since recovery was failed so there 
will not be any running DAGs

If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM sets 
the status to SUCCEEDED, because of this if-else:

[https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1266]

public void shutdownTezAM(String dagKillmessage) throws TezException {
    if (!sessionStopped.compareAndSet(false, true)) {
      // No need to shutdown twice.
      // Return with a no-op if shutdownTezAM has been invoked earlier.
      return;
    }
    synchronized (this) {
      this.taskSchedulerManager.setShouldUnregisterFlag();
      if (currentDAG != null
          && !currentDAG.isComplete()) {
        //send a DAG_TERMINATE message
        LOG.info("Sending a kill event to the current DAG"
            + ", dagId=" + currentDAG.getID());
        tryKillDAG(currentDAG, dagKillmessage);
      } else {
        LOG.info("No current running DAG, shutting down the AM");
        if (isSession && !state.equals(DAGAppMasterState.ERROR)) {
          state = DAGAppMasterState.SUCCEEDED;
        }
        shutdownHandler.shutdown();
      }
    }
  }

 

This can result in issues in dependent systems like Hive which will move ahead 
with other tasks in pipeline assuming the DAG was success, this can result in 
moving empty data in Hive

As part of this JIRA, we are proposing to introduce a patch in TEZ, which 
introduces a config, which when set, then in case of shutdown with no current 
running DAGs, Tez status will always be marked as FAILED instead of SUCCEEDED 
in case DAG state at that time as not ERROR

 

This is the patch, please review and let us know about your thoughts: 
[^0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch]


> DAG recovery failure leads to AM status SUCCEEDED
> -------------------------------------------------
>
>                 Key: TEZ-4474
>                 URL: https://issues.apache.org/jira/browse/TEZ-4474
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Mudit Sharma
>            Priority: Major
>         Attachments: 
> 0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch
>
>
> Summary of the Issue:
> When Tez DAG recovery is failed because of some reason in the second retry of 
> any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE
> Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(), 
> Tez AM will try to shutdown the DAG, and since recovery was failed so there 
> will not be any running DAGs
> If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM 
> sets the status to SUCCEEDED, because of this if-else:
> [https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1266]
> public void shutdownTezAM(String dagKillmessage) throws TezException {
>     if (!sessionStopped.compareAndSet(false, true))
> {       // No need to shutdown twice.       // Return with a no-op if 
> shutdownTezAM has been invoked earlier.       return;     }
>     synchronized (this) {
>       this.taskSchedulerManager.setShouldUnregisterFlag();
>       if (currentDAG != null
>           && !currentDAG.isComplete())
> {         //send a DAG_TERMINATE message         LOG.info("Sending a kill 
> event to the current DAG"             + ", dagId=" + currentDAG.getID());     
>     tryKillDAG(currentDAG, dagKillmessage);       }
> else {
>         LOG.info("No current running DAG, shutting down the AM");
>         if (isSession && !state.equals(DAGAppMasterState.ERROR))
> {           state = DAGAppMasterState.SUCCEEDED;         }
>         shutdownHandler.shutdown();
>       }
>     }
>   }
>  
> This can result in issues in dependent systems like Hive which will move 
> ahead with other tasks in pipeline assuming the DAG was success, this can 
> result in moving empty data in Hive
> As part of this JIRA, we are proposing to introduce a patch in TEZ, which 
> introduces a config, which when set, then in case of shutdown with no current 
> running DAGs, Tez status will always be marked as FAILED instead of SUCCEEDED 
> in case DAG state at that time was not ERROR
>  
> This is the patch, please review and let us know about your thoughts: 
> [^0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to