Re: DFS problem with removing checkpoint

Szymon Szczypiński Fri, 06 Apr 2018 16:12:28 -0700

Hi,

in my case both doesn't deleted. In high-availability.storageDir thenumber of files of type "completedCheckpoint<SomeID>" are growing andalso dirs in "state.backend.fs.checkpointdir/JobId/check-<INTEGER>".

In my case i have Windows DFS filesystem mounted on linux with cifsprotocol.

Can you give me a hint or description which process is responsible forremoving those files and directories.


Best regards


W dniu 2018-04-02 o 15:58, Stephan Ewen pisze:

Can you clarify which one does not get deleted? The file in the"high-availability.storageDir", or the"state.backend.fs.checkpointdir/JobId/check-<INTEGER>", or both?


Could you also tell us which file system you use?

There is a known issue in some versions of Flink that S3 "directories"are not deleted. This means that Hadoop's S3 marker files (the waythat Hadoop's s3n and s3a imitate directories in S3) are not deleted.This is fixed in Flink 1.5. A workaround for Flink 1.4 is to use the"flink-s3-fs-presto", which does not uses these marker files.

On Thu, Mar 29, 2018 at 8:49 PM, Szymon Szczypiński <simo...@poczta.fm<mailto:simo...@poczta.fm>> wrote:


    Thank you for your replay.

    But my problem is not that Flink doesn't remove files/directories
    after incorrect job cancellation. My problem is different.
    Precisely when everything is ok and job is in RUNNING state, when
    checkpoint for job is done, then there is created file in
    "high-availability.storageDir" with name
    completedCheckpoint<SomeID> and also is created dir in location
    "state.backend.fs.checkpointdir/JobId/check-<INTEGER>". After the
    next checkpoint is completed, previous file and dir are deleted,
    and this is ok, because i always have only one checkpoint.
    But in my case when next checkpoint is completed, the previous is
    not deleted and this happens when job is in running state.

    My be you know why those files/dirs are not deleted.

    Best regards

    Szymon Szczypiński


    On 29.03.2018 11:23, Stephan Ewen wrote:

    Flink removes these files / directories only if you properly
    cancel the job. If you kill the processes (stop-cluster.sh) this
    looks like a failure, and Flink will try to recover. The recovery
    always starts in ZooKeeper, not in the DFS.

    Best way to prevent this is to
      - properly cancel jobs, not just kill processes
      - use separate cluster IDs for the standalone clusters, so that
    the new cluster knows that it is not supposed to recover the
    previous jobs and checkpoints



    On Wed, Mar 28, 2018 at 11:22 PM, Szymon Szczypiński
    <simo...@poczta.fm <mailto:simo...@poczta.fm>> wrote:

        Hi,

        i have problem with Flink in version 1.3.1.

        I have standalone cluster with two JobManagers and four
        TaskManager, as
        DFS i use windows high available storage mounted by cifs
        protocol.

        And sometimes i'm starting having problem that Flink doesn't
        remove
        checkpoint dirs for job and completedCheckpoint files from
        "high-availability.storageDir".

        To bring back cluster to normal working i need to remove all
        dirs from
        DFS and start everything from beginning.


        Maybe someone of Flink users had the same problem. For now i
        doesn't
        have any idea how to bring back cluster to normal work
        without deleting
        dirs from DFS.

        I don't want to delete dirs from DFS because than  i need to
        redeploy
        all jobs.


        Best regards

        Szymon Szczypiński

Re: DFS problem with removing checkpoint

Reply via email to