Hi, It turns out that under certain circumstances rocksdb statebackend mistakenly uses the default filesystem scheme, which is specified to hdfs in the new cluster in my case.
I’ve filed a Jira to track this[1]. [1] https://issues.apache.org/jira/browse/FLINK-12042 <https://issues.apache.org/jira/browse/FLINK-12042> Best, Paul Lam > 在 2019年3月27日,19:06,Paul Lam <paullin3...@gmail.com> 写道: > > Hi, > > I’m using Flink 1.6.4 and recently I ran into a weird issue of rocksdb > statebackend. A job that runs fine on a YARN cluster keeps failing on > checkpoint after migrated to a new one > (with almost everything the same but better machines), and even a clean > restart doesn’t help. > > The root cause is IllegalStateException but with no error message. The stack > trace shows that when the rocksdb statebackend is doing the async part of > snapshots (runSnapshot), > it finds that the local snapshot directory that is created by rocksdb earlier > (takeSnapshot) does not exist. > > I tried to log more informations in RocksDBKeyedStateBackend (see > attachment), and found that the local snapshot performed as expected and the > .sst files were written, > but when the async task accessed the directory, the whole snapshot directory > was gone. > > What could possibly be the cause? Thanks a lot. > > Best, > Paul Lam > > <rocksdb_illegal_state.log.md> >