Re: Job recovery issues with state restoration

Roman Khachatryan Thu, 20 May 2021 13:46:25 -0700

Hi Peter,

Do you experience this issue if running without local recovery or
incremental checkpoints enabled?
Or have you maybe compared local (on TM) and  remove (on DFS) SST files?


Regards,
Roman

On Thu, May 20, 2021 at 5:54 PM Peter Westermann
<no.westerm...@genesys.com> wrote:
>
> Hello,
>
>
>
> I’ve reported issues around checkpoint recovery in case of a job failure due 
> to zookeeper connection loss in the past. I am still seeing issues 
> occasionally.
>
> This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, 
> incremental checkpoints, and task-local recovery enabled.
>
>
>
> Here’s what happened: A zookeeper instance was terminated as part of a 
> deployment for our zookeeper service, this caused a new jobmanager leader 
> election (so far so good). A leader was elected and the job was restarted 
> from the latest checkpoint but never became healthy. The root exception and 
> the logs show issues reading state:
>
> o.r.RocksDBException: Sst file size mismatch: 
> /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst.
>  Size recorded in manifest 36718, actual size 2570\
> Sst file size mismatch: 
> /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003573.sst.
>  Size recorded in manifest 13756, actual size 1307\
> Sst file size mismatch: 
> /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003575.sst.
>  Size recorded in manifest 16278, actual size 1138\
> Sst file size mismatch: 
> /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003576.sst.
>  Size recorded in manifest 23108, actual size 1267\
> Sst file size mismatch: 
> /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003577.sst.
>  Size recorded in manifest 148089, actual size 1293\
> \
> \\tat org.rocksdb.RocksDB.open(RocksDB.java)\
> \\tat org.rocksdb.RocksDB.open(RocksDB.java:286)\
> \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)\
> \\t... 22 common frames omitted\
> Wrapped by: java.io.IOException: Error while opening RocksDB instance.\
> \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)\
> \\tat 
> o.a.f.c.s.s.r.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:145)\
> \\tat 
> o.a.f.c.s.s.r.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOper...
>
>
>
> Since we retain multiple checkpoints, I tried redeploying the job from all 
> checkpoints that were still available. All those attempts lead to similar 
> failures. (I eventually had to use an older savepoint to recover the job.)
>
> Any guidance for avoiding this would be appreciated.
>
>
>
> Peter

Re: Job recovery issues with state restoration

Reply via email to