Re: RocksDB local snapshot sliently disappears and cause checkpoint to fail

Paul Lam Wed, 27 Mar 2019 23:34:27 -0700

Hi,

It turns out that under certain circumstances rocksdb statebackend mistakenly 
uses the default filesystem scheme, which is specified to hdfs in the new 
cluster in my case.


I’ve filed a Jira to track this[1]. 

[1] https://issues.apache.org/jira/browse/FLINK-12042 
<https://issues.apache.org/jira/browse/FLINK-12042>

Best,
Paul Lam

> 在 2019年3月27日，19:06，Paul Lam <paullin3...@gmail.com> 写道：
> 
> Hi,
> 
> I’m using Flink 1.6.4 and recently I ran into a weird issue of rocksdb 
> statebackend. A job that runs fine on a YARN cluster keeps failing on 
> checkpoint after migrated to a new one 
> (with almost everything the same but better machines), and even a clean 
> restart doesn’t help. 
> 
> The root cause is IllegalStateException but with no error message. The stack 
> trace shows that when the rocksdb statebackend is doing the async part of 
> snapshots (runSnapshot), 
> it finds that the local snapshot directory that is created by rocksdb earlier 
> (takeSnapshot) does not exist. 
> 
> I tried to log more informations in RocksDBKeyedStateBackend (see 
> attachment), and found that the local snapshot performed as expected and the 
> .sst files were written, 
> but when the async task accessed the directory, the whole snapshot directory 
> was gone. 
> 
> What could possibly be the cause? Thanks a lot.
> 
> Best,
> Paul Lam
> 
> <rocksdb_illegal_state.log.md>
>

Re: RocksDB local snapshot sliently disappears and cause checkpoint to fail

Reply via email to