Hi,

It turns out that under certain circumstances rocksdb statebackend mistakenly 
uses the default filesystem scheme, which is specified to hdfs in the new 
cluster in my case.

I’ve filed a Jira to track this[1]. 

[1] https://issues.apache.org/jira/browse/FLINK-12042 
<https://issues.apache.org/jira/browse/FLINK-12042>

Best,
Paul Lam

> 在 2019年3月27日,19:06,Paul Lam <paullin3...@gmail.com> 写道:
> 
> Hi,
> 
> I’m using Flink 1.6.4 and recently I ran into a weird issue of rocksdb 
> statebackend. A job that runs fine on a YARN cluster keeps failing on 
> checkpoint after migrated to a new one 
> (with almost everything the same but better machines), and even a clean 
> restart doesn’t help. 
> 
> The root cause is IllegalStateException but with no error message. The stack 
> trace shows that when the rocksdb statebackend is doing the async part of 
> snapshots (runSnapshot), 
> it finds that the local snapshot directory that is created by rocksdb earlier 
> (takeSnapshot) does not exist. 
> 
> I tried to log more informations in RocksDBKeyedStateBackend (see 
> attachment), and found that the local snapshot performed as expected and the 
> .sst files were written, 
> but when the async task accessed the directory, the whole snapshot directory 
> was gone. 
> 
> What could possibly be the cause? Thanks a lot.
> 
> Best,
> Paul Lam
> 
> <rocksdb_illegal_state.log.md>
> 

Reply via email to