[
https://issues.apache.org/jira/browse/FLINK-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15710512#comment-15710512
]
Xiaogang Shi commented on FLINK-5214:
-------------------------------------
I opened FLINK-5086 to report a similar problem, but I do not have a good idea
how to resolve it.
Because JM does know the existence of these checkpoint files, it seems only TM
can delete them. But as a failed TM may not be recovered by the JM if the
number of retries exceeds the given limit, these files will not be deleted in
such cases.
One possible solution i think is to let each TM return a handler to JM when the
TM is registered. JM can use the handler to clean the files even when the TM
fails.
Another solution is to recover the TM when the number of retries exceeds the
limit. Once the TM is recovered, the only thing it does is to clean the
checkpoint files.
Do you have any better ideas?
> Clean up checkpoint files when failing checkpoint operation on TM
> -----------------------------------------------------------------
>
> Key: FLINK-5214
> URL: https://issues.apache.org/jira/browse/FLINK-5214
> Project: Flink
> Issue Type: Bug
> Components: TaskManager
> Affects Versions: 1.2.0, 1.1.3
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> When the {{StreamTask#performCheckpoint}} operation fails on a
> {{TaskManager}} potentially created checkpoint files are not cleaned up. This
> should be changed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)