Re: Missing state in RocksDB checkpoints

Congxian Qiu Tue, 23 Apr 2019 07:54:42 -0700

Hi Ning,
Sorry for the misleading, in the previous email, I just want to say the problem 
is not caused by the UUID generation, it is caused by the different operators 
share the same directory(because currentlyFlink uses JobVertx as the directory)


Best, Congxian
On Apr 23, 2019, 19:41 +0800, Ning Shi <nings...@gmail.com>, wrote:
> Congxian,
>
> Thank you for creating the ticket and providing the relevant code. I’m 
> curious why you don’t think the directory collision is not a problem. What we 
> observe is that one of the operator states are not included in the checkpoint 
> and data is lost on restore. That’s a pretty serious problem especially when 
> Flink doesn’t generate any error in the log. People could be losing states 
> silently potentially.
>
> Please let me know how I can best help diagnose this issue and drive the 
> ticket forward. I’m happy to collect any relevant information.
>
> Thanks,
>
> —
> Ning
>
> > On Apr 23, 2019, at 2:54 AM, Congxian Qiu <qcx978132...@gmail.com> wrote:
> >
> > From the log message you given, the two operate share the same directory, 
> > and when snapshot, the directory will be deleted first if it 
> > exists(RocksIncrementalSnapshotStrategy#prepareLocalSnapshotDirectory).
> >
> > I did not find an issue for this problem, and I don’t thinks this is a 
> > problem of UUID generation problem, please check the path generation logic 
> > in LocalRecoveryDirectoryProviderImpl#subtaskSpecificCheckpointDirectory.
> >
> > I’ve created an issue for this problem.

Re: Missing state in RocksDB checkpoints

Reply via email to