[ https://issues.apache.org/jira/browse/FLINK-12219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
lamber-ken reopened FLINK-12219: -------------------------------- > Yarn application can't stop when flink job failed in per-job yarn cluster mode > ------------------------------------------------------------------------------ > > Key: FLINK-12219 > URL: https://issues.apache.org/jira/browse/FLINK-12219 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Runtime / REST > Affects Versions: 1.6.3, 1.8.0 > Reporter: lamber-ken > Assignee: lamber-ken > Priority: Major > Labels: pull-request-available > Attachments: fix-bug.patch, image-2019-04-17-15-00-40-687.png, > image-2019-04-17-15-02-49-513.png, image-2019-04-23-17-37-00-081.png > > Time Spent: 50m > Remaining Estimate: 0h > > h3. *Issue detail info* > In our flink(1.6.3) product env, I often encounter a scene that yarn > application can't stop when flink job failed in per-job yarn cluste mode, so > I deeply analyzed the reason why it happened. > When a flink job fail, system will write an archive file to a FileSystem > through +MiniDispatcher#archiveExecutionGraph+ method, then notify > YarnJobClusterEntrypoint to shutDown. But, if > +MiniDispatcher#archiveExecutionGraph+ throw exceptions during execution, it > affect the following calls. > So I open > [FLINK-12247|https://issues.apache.org/jira/projects/FLINK/issues/FLINK-12247] > to solve NEP bug when system write archive to FileSystem. But We still need > to consider other exceptions, so we should catch Exception / Throwable not > just IOExcetion. > h3. *Flink yarn job fail flow* > !image-2019-04-23-17-37-00-081.png! > h3. *Flink yarn job fail on yarn* > !image-2019-04-17-15-00-40-687.png! > > h3. *Flink yarn application can't stop* > !image-2019-04-17-15-02-49-513.png! > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)