Re: Job recovery issues with state restoration

Peter Westermann Wed, 26 May 2021 03:57:13 -0700

/mnt/data is a local disk, so there shouldn’t be any additional latency. I’ll 
provide more information when/if this happens again.


Peter

From: Roman Khachatryan <ro...@apache.org>
Date: Tuesday, May 25, 2021 at 6:54 PM
To: Peter Westermann <no.westerm...@genesys.com>
Cc: user@flink.apache.org <user@flink.apache.org>
Subject: Re: Job recovery issues with state restoration
> I am not able to consistently reproduce this issue. It seems to only occur 
> when the failover happens at the wrong time. I have disabled task local 
> recovery and will report back if we see this again.

Thanks, please post any results here.

> The SST files are not the ones for task local recovery, those would be in a 
> different directory (we have configured io.tmp.dirs as /mnt/data/tmp).

Those files on /mnt could still be checked against the ones in
checkpoint directories (on S3/DFS), the size should match.

I'm also curious why do you place local recovery files on a remote FS?
(I assume /mnt/data/tmp is a remote FS or a persistent volume).
Currently, if a TM is lost (e.g. process dies) then those files can
not be used - and recovery will fallback to S3/DFS. So this probably
incurs some IO/latency unnecessarily.

Regards,
Roman

On Tue, May 25, 2021 at 2:16 PM Peter Westermann
<no.westerm...@genesys.com> wrote:
>
> Hi Roman,
>
>
>
> I am not able to consistently reproduce this issue. It seems to only occur 
> when the failover happens at the wrong time. I have disabled task local 
> recovery and will report back if we see this again. We need incremental 
> checkpoints for our workload.
>
> The SST files are not the ones for task local recovery, those would be in a 
> different directory (we have configured io.tmp.dirs as /mnt/data/tmp).
>
>
>
> Thanks,
>
> Peter
>
>
>
>
>
> From: Roman Khachatryan <ro...@apache.org>
> Date: Thursday, May 20, 2021 at 4:54 PM
> To: Peter Westermann <no.westerm...@genesys.com>
> Cc: user@flink.apache.org <user@flink.apache.org>
> Subject: Re: Job recovery issues with state restoration
>
> Hi Peter,
>
> Do you experience this issue if running without local recovery or
> incremental checkpoints enabled?
> Or have you maybe compared local (on TM) and  remove (on DFS) SST files?
>
> Regards,
> Roman
>
> On Thu, May 20, 2021 at 5:54 PM Peter Westermann
> <no.westerm...@genesys.com> wrote:
> >
> > Hello,
> >
> >
> >
> > I’ve reported issues around checkpoint recovery in case of a job failure 
> > due to zookeeper connection loss in the past. I am still seeing issues 
> > occasionally.
> >
> > This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, 
> > incremental checkpoints, and task-local recovery enabled.
> >
> >
> >
> > Here’s what happened: A zookeeper instance was terminated as part of a 
> > deployment for our zookeeper service, this caused a new jobmanager leader 
> > election (so far so good). A leader was elected and the job was restarted 
> > from the latest checkpoint but never became healthy. The root exception and 
> > the logs show issues reading state:
> >
> > o.r.RocksDBException: Sst file size mismatch: 
> > /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst.
> >  Size recorded in manifest 36718, actual size 2570\
> > Sst file size mismatch: 
> > /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003573.sst.
> >  Size recorded in manifest 13756, actual size 1307\
> > Sst file size mismatch: 
> > /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003575.sst.
> >  Size recorded in manifest 16278, actual size 1138\
> > Sst file size mismatch: 
> > /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003576.sst.
> >  Size recorded in manifest 23108, actual size 1267\
> > Sst file size mismatch: 
> > /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003577.sst.
> >  Size recorded in manifest 148089, actual size 1293\
> > \
> > \\tat org.rocksdb.RocksDB.open(RocksDB.java)\
> > \\tat org.rocksdb.RocksDB.open(RocksDB.java:286)\
> > \\tat 
> > o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)\
> > \\t... 22 common frames omitted\
> > Wrapped by: java.io.IOException: Error while opening RocksDB instance.\
> > \\tat 
> > o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)\
> > \\tat 
> > o.a.f.c.s.s.r.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:145)\
> > \\tat 
> > o.a.f.c.s.s.r.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOper...
> >
> >
> >
> > Since we retain multiple checkpoints, I tried redeploying the job from all 
> > checkpoints that were still available. All those attempts lead to similar 
> > failures. (I eventually had to use an older savepoint to recover the job.)
> >
> > Any guidance for avoiding this would be appreciated.
> >
> >
> >
> > Peter

Re: Job recovery issues with state restoration

Reply via email to