Re: Checkpoint metadata deleted by Flink after ZK connection issues

Yang Wang Wed, 09 Sep 2020 00:01:13 -0700

> The job sub directory will be cleaned up when the job
finished/canceled/failed.
Since we could submit multiple jobs into a Flink session, what i mean is
when a job
reached to the terminal state, the sub node(e.g.
/flink/application_xxxx/running_job_registry/4d255397c7aeb5327adb567238c983c1)
on the Zookeeper will be cleaned up. But the root
directory(/flink/application_xxxx/) still exists.



For your current case, it is a different case(perjob cluster). I think we
need to figure out why the only
running job reached the terminal state. For example, the restart attempts
are exhausted. And you
could find the following logs in your JobManager log.

"org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy"


Best,
Yang




Cristian <k...@fastmail.fm> 于2020年9月9日周三 上午11:26写道：

> > The job sub directory will be cleaned up when the job
> finished/canceled/failed.
>
> What does this mean?
>
> Also, to clarify: I'm a very sloppy developer. My jobs crash ALL the
> time... and yet, the jobs would ALWAYS resume from the last checkpoint.
>
> The only cases where I expect Flink to clean up the checkpoint data from
> ZK is when I explicitly stop or cancel the job (in those cases the job
> manager takes a savepoint before cleaning up zk and finishing the cluster).
>
> Which is not the case here. Flink was on autopilot here and decided to
> wipe my poor, good checkpoint metadata as the logs show.
>
> On Tue, Sep 8, 2020, at 7:59 PM, Yang Wang wrote:
>
> AFAIK, the HA data, including Zookeeper meta data and real data on DFS,
> will only be cleaned up
> when the Flink cluster reached terminated state.
>
> So if you are using a session cluster, the root cluster node on Zk will be
> cleaned up after you manually
> stop the session cluster. The job sub directory will be cleaned up when
> the job finished/canceled/failed.
>
> If you are using a job/application cluster, once the only running job
> finished/failed, all the HA data will
> be cleaned up. I think you need to check the job restart strategy you have
> set. For example, the following
> configuration will make the Flink cluster terminated after 10 attempts.
>
> restart-strategy: fixed-delay
> restart-strategy.fixed-delay.attempts: 10
>
>
> Best,
> Yang
>
> Cristian <k...@fastmail.fm> 于2020年9月9日周三 上午12:28写道：
>
>
> I'm using the standalone script to start the cluster.
>
> As far as I can tell, it's not easy to reproduce. We found that zookeeper
> lost a node around the time this happened, but all of our other 75 Flink
> jobs which use the same setup, version and zookeeper, didn't have any
> issues. They didn't even restart.
>
> So unfortunately I don't know how to reproduce this. All I know is I can't
> sleep. I have nightmares were my precious state is deleted. I wake up
> crying and quickly start manually savepointing all jobs just in case,
> because I feel the day of reckon is near. Flinkpocalypse!
>
> On Tue, Sep 8, 2020, at 5:54 AM, Robert Metzger wrote:
>
> Thanks a lot for reporting this problem here Cristian!
>
> I am not super familiar with the involved components, but the behavior you
> are describing doesn't sound right to me.
> Which entrypoint are you using? This is logged at the beginning, like
> this: "2020-09-08 14:45:32,807 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -
>  Starting StandaloneSessionClusterEntrypoint (Version: 1.11.1, Scala: 2.12,
> Rev:7eb514a, Date:2020-07-15T07:02:09+02:00)"
>
> Do you know by chance if this problem is reproducible? With
> the StandaloneSessionClusterEntrypoint I was not able to reproduce the
> problem.
>
>
>
>
> On Tue, Sep 8, 2020 at 4:00 AM Husky Zeng <568793...@qq.com> wrote:
>
> Hi Cristian,
>
>
> I don't know if it was designed to be like this deliberately.
>
> So I have already submitted an issue ,and wait for somebody to response.
>
> https://issues.apache.org/jira/browse/FLINK-19154
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>
>
>

Re: Checkpoint metadata deleted by Flink after ZK connection issues

Reply via email to