Hi Roman and Robert,
Thank you.
I have checked the code and the checkpoint deleting failure case. Yes,
Flink will delete the meta file and operator state file at first, then
delete the checkpoint dir which is truly an empty dir. The root cause of
the failure of deleting checkpoint is the hadoop
Hi,
I think Robert is right, state handles are deleted first, and then the
directory is deleted non-recursively.
If any exception occurs while removing the files, it will be combined with
the other exception (as suppressed).
So probably Flink failed to delete some files and then directory removal
Hi Robert,
When the `delete(Path f, boolean recursive)` recursive is false, hdfs will
throw exception like below:
[image: checkpoint-exception.png]
Yours sincerely
Josh
On Thu, Nov 12, 2020 at 4:29 PM Robert Metzger wrote:
> Hey Josh,
>
> As far as I understand the code
Hey Josh,
As far as I understand the code CompletedCheckpoint.discard(), Flink is
removing all the files in StateUtil.bestEffortDiscardAllStateObjects, then
deleting the directory.
Which files are left over in your case?
Do you see any exceptions on the TaskManagers?
Best,
Robert
On Wed, Nov
Hi
When a checkpoint should be deleted, FsCompletedCheckpointStorageLocation.
disposeStorageLocation will be called.
Inside it, fs.delete(exclusiveCheckpointDir, false) will do the delete
action. I wonder why the recursive parameter is set to false? as the
exclusiveCheckpointDir is truly a