[ https://issues.apache.org/jira/browse/FLINK-14268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrey Zagrebin resolved FLINK-14268. ------------------------------------- Resolution: Abandoned Please reopen it if you still believe it is an issue and you can provide more information. > YARN AM endless restarts when using wrong checkpoint path or wrong checkpoint > ----------------------------------------------------------------------------- > > Key: FLINK-14268 > URL: https://issues.apache.org/jira/browse/FLINK-14268 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN > Affects Versions: 1.7.2 > Environment: Flink: 1.7.2 > Deloyment: YARN Per Job > YARN:2.7.2 > State backend:FSStateBackend with HDFS > > Reporter: Lsw_aka_laplace > Priority: Major > > I tried to start a streaming task and restore from checkpoint which it was > stored in HDFS. > I set a wrong checkpoint path and sth unexpected happened: YARN AM restarted > again and again. Since we have already set some restart strategy to prevent > endless restart, it should have been restarted with limited times. > Since we made sure that restart strategy works, we dived into source code and > did some change mainly in _ClusterEntrypoint_. > > {code:java} > //代码占位符 > //before > @Override > public void onFatalError(Throwable exception) { > LOG.error("Fatal error occurred in the cluster entrypoint.", exception); > System.exit(RUNTIME_FAILURE_RETURN_CODE); > } > //after > @Override > public void onFatalError(Throwable exception) { > LOG.error("Fatal error occurred in the cluster entrypoint.", exception); > > if(ExceptionUtils.findThrowable(exception,PerJobFatalException.class).isPresent()){ > //PerJobFatalException is the FLAG > //在perjob模式有些致命的异常出现,am会一直重启,不能失败掉 > LOG.error("perjob fatal error"); > System.exit(STARTUP_FAILURE_RETURN_CODE); > } > System.exit(RUNTIME_FAILURE_RETURN_CODE); > } > {code} > We forced to make the FAILURE_RETURN_CODE as STARTUP_FAILURE_RETURN_CODE > rather than RUNTIME_FAILURE_RETURN_CODE in some condition and *it DID WORK*. > > > After discussing with [~Tison], I knew that FAILURE_RETURN_CODE seems only > to be used to debug, so I submitted this issue and look forward to ANY > solution~ > -- This message was sent by Atlassian Jira (v8.3.4#803005)