[ 
https://issues.apache.org/jira/browse/FLINK-14268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lsw_aka_laplace updated FLINK-14268:
------------------------------------
    Description: 
I tried to start a  streaming task and restore from checkpoint which it was 
stored in HDFS. 

I set a wrong checkpoint path and sth unexpected happened: YARN AM restarted 
again and again.  Since we have already set some restart strategy to prevent 
endless restart, it should have been restarted with limited times.

Since we made sure that restart strategy works, we dived into source code and 
did some change mainly in _ClusterEntrypoint_.

 
{code:java}
//代码占位符
//before 
@Override
public void onFatalError(Throwable exception) {
   LOG.error("Fatal error occurred in the cluster entrypoint.", exception);

   System.exit(RUNTIME_FAILURE_RETURN_CODE);
}


//after 
@Override
public void onFatalError(Throwable exception) {
   LOG.error("Fatal error occurred in the cluster entrypoint.", exception);
 
if(ExceptionUtils.findThrowable(exception,PerJobFatalException.class).isPresent()){
//PerJobFatalException is the FLAG 
//在perjob模式有些致命的异常出现,am会一直重启,不能失败掉
      LOG.error("perjob fatal error");
      System.exit(STARTUP_FAILURE_RETURN_CODE);
   }
   System.exit(RUNTIME_FAILURE_RETURN_CODE);
}



{code}
 We forced to make the FAILURE_RETURN_CODE as STARTUP_FAILURE_RETURN_CODE 
rather than RUNTIME_FAILURE_RETURN_CODE in some condition and *it DID WORK*.

 

 

After discussing with [~Tison],  I knew that FAILURE_RETURN_CODE seems only to 
be used to debug, so I submitted this issue and look forward to ANY solution~

 

  was:
I tried to start a  streaming task and restore from checkpoint which it was 
stored in HDFS. 

I set a wrong checkpoint path and sth unexpected happened: YARN AM restarted 
again and again.  Since we have already set some restart strategy to prevent 
endless restart, it should have been restarted with limited times.

Since we made sure that restart strategy works, we dived into source code and 
did some change mainly in _ClusterEntrypoint_.

 
{code:java}
//代码占位符
//before 
@Override
public void onFatalError(Throwable exception) {
   LOG.error("Fatal error occurred in the cluster entrypoint.", exception);

   System.exit(RUNTIME_FAILURE_RETURN_CODE);
}


//after 
@Override
public void onFatalError(Throwable exception) {
   LOG.error("Fatal error occurred in the cluster entrypoint.", exception);
// PerJobFatalException is the FLAG   
if(ExceptionUtils.findThrowable(exception,PerJobFatalException.class).isPresent()){//在perjob模式有些致命的异常出现,am会一直重启,不能失败掉
      LOG.error("perjob fatal error");
      System.exit(STARTUP_FAILURE_RETURN_CODE);
   }
   System.exit(RUNTIME_FAILURE_RETURN_CODE);
}



{code}
 We forced to make the FAILURE_RETURN_CODE as STARTUP_FAILURE_RETURN_CODE 
rather than RUNTIME_FAILURE_RETURN_CODE in some condition and *it DID WORK*.

 

 

After discussing with [~Tison],  I knew that FAILURE_RETURN_CODE seems only to 
be used to debug, so I submitted this issue and look forward to ANY solution~

 


> YARN AM endless restarts when using wrong checkpoint path or wrong checkpoint
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-14268
>                 URL: https://issues.apache.org/jira/browse/FLINK-14268
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.7.2
>         Environment: Flink: 1.7.2
> Deloyment: YARN Per Job
> YARN:2.7.2
> State backend:FSStateBackend with HDFS 
>  
>            Reporter: Lsw_aka_laplace
>            Priority: Critical
>
> I tried to start a  streaming task and restore from checkpoint which it was 
> stored in HDFS. 
> I set a wrong checkpoint path and sth unexpected happened: YARN AM restarted 
> again and again.  Since we have already set some restart strategy to prevent 
> endless restart, it should have been restarted with limited times.
> Since we made sure that restart strategy works, we dived into source code and 
> did some change mainly in _ClusterEntrypoint_.
>  
> {code:java}
> //代码占位符
> //before 
> @Override
> public void onFatalError(Throwable exception) {
>    LOG.error("Fatal error occurred in the cluster entrypoint.", exception);
>    System.exit(RUNTIME_FAILURE_RETURN_CODE);
> }
> //after 
> @Override
> public void onFatalError(Throwable exception) {
>    LOG.error("Fatal error occurred in the cluster entrypoint.", exception);
>  
> if(ExceptionUtils.findThrowable(exception,PerJobFatalException.class).isPresent()){
> //PerJobFatalException is the FLAG 
> //在perjob模式有些致命的异常出现,am会一直重启,不能失败掉
>       LOG.error("perjob fatal error");
>       System.exit(STARTUP_FAILURE_RETURN_CODE);
>    }
>    System.exit(RUNTIME_FAILURE_RETURN_CODE);
> }
> {code}
>  We forced to make the FAILURE_RETURN_CODE as STARTUP_FAILURE_RETURN_CODE 
> rather than RUNTIME_FAILURE_RETURN_CODE in some condition and *it DID WORK*.
>  
>  
> After discussing with [~Tison],  I knew that FAILURE_RETURN_CODE seems only 
> to be used to debug, so I submitted this issue and look forward to ANY 
> solution~
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to