Till,

Thank you for escalating this to blocker. I agree that data loss is always a 
serious issue.

For reference, the workaround is to unchain the stateful operators. To make the 
new job be able to recover from previous checkpoint, we also had to change the 
UID of the operator that was missing state and recover with allow non-restored 
argument. Otherwise, it would fail with RocksDB errors on restore.

—
Ning

> On Apr 24, 2019, at 5:02 AM, Till Rohrmann <trohrm...@apache.org> wrote:
> 
> Thanks for reporting this issue Ning. I think this is actually a blocker for 
> the next release and should be fixed right away. For future reference here is 
> the issue [1].
> 
> I've also pulled in Stefan who knows these components very well.
> 
> [1] https://issues.apache.org/jira/browse/FLINK-12296
> 
> Cheers,
> Till
> 
>> On Tue, Apr 23, 2019 at 5:24 PM Ning Shi <nings...@gmail.com> wrote:
>> On Tue, 23 Apr 2019 10:53:52 -0400,
>> Congxian Qiu wrote:
>> > Sorry for the misleading, in the previous email, I just want to say the 
>> > problem is not caused by the UUID generation, it is caused by the 
>> > different operators share the same directory(because currentlyFlink uses 
>> > JobVertx as the directory)
>> 
>> Ah, thank you for the clarification, Congxian. That makes sense.
>> 
>> Ning

Reply via email to