[ 
https://issues.apache.org/jira/browse/FLINK-14268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Zagrebin resolved FLINK-14268.
-------------------------------------
    Resolution: Abandoned

Please reopen it if you still believe it is an issue and you can provide more 
information.

> YARN AM endless restarts when using wrong checkpoint path or wrong checkpoint
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-14268
>                 URL: https://issues.apache.org/jira/browse/FLINK-14268
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.7.2
>         Environment: Flink: 1.7.2
> Deloyment: YARN Per Job
> YARN:2.7.2
> State backend:FSStateBackend with HDFS 
>  
>            Reporter: Lsw_aka_laplace
>            Priority: Major
>
> I tried to start a  streaming task and restore from checkpoint which it was 
> stored in HDFS. 
> I set a wrong checkpoint path and sth unexpected happened: YARN AM restarted 
> again and again.  Since we have already set some restart strategy to prevent 
> endless restart, it should have been restarted with limited times.
> Since we made sure that restart strategy works, we dived into source code and 
> did some change mainly in _ClusterEntrypoint_.
>  
> {code:java}
> //代码占位符
> //before 
> @Override
> public void onFatalError(Throwable exception) {
>    LOG.error("Fatal error occurred in the cluster entrypoint.", exception);
>    System.exit(RUNTIME_FAILURE_RETURN_CODE);
> }
> //after 
> @Override
> public void onFatalError(Throwable exception) {
>    LOG.error("Fatal error occurred in the cluster entrypoint.", exception);
>  
> if(ExceptionUtils.findThrowable(exception,PerJobFatalException.class).isPresent()){
> //PerJobFatalException is the FLAG 
> //在perjob模式有些致命的异常出现,am会一直重启,不能失败掉
>       LOG.error("perjob fatal error");
>       System.exit(STARTUP_FAILURE_RETURN_CODE);
>    }
>    System.exit(RUNTIME_FAILURE_RETURN_CODE);
> }
> {code}
>  We forced to make the FAILURE_RETURN_CODE as STARTUP_FAILURE_RETURN_CODE 
> rather than RUNTIME_FAILURE_RETURN_CODE in some condition and *it DID WORK*.
>  
>  
> After discussing with [~Tison],  I knew that FAILURE_RETURN_CODE seems only 
> to be used to debug, so I submitted this issue and look forward to ANY 
> solution~
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to