It is nice to see that we converge on the issues we find. Means that this is getting pretty stable :-)
On Tue, Jan 19, 2016 at 8:17 PM, Stephan Ewen <se...@apache.org> wrote: > Yeah, we saw this as well this morning, in a job that triggers checkpoints > super fast (50msecs). > > I think we have a good fix figured out, let's solve this for 1.0... > > On Tue, Jan 19, 2016 at 3:25 PM, Gyula Fóra <gyula.f...@gmail.com> wrote: > >> I just got back to this issue. The problem wasn't with the locking but >> that >> the StreamTask wasn't in running state before the first checkpoint trigger >> message. >> I actually just saw your JIRA as well, funny... :) >> >> Regards, >> Gyula >> >> Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. jan. 8., P, >> 15:36): >> >> > Hmm, strange issue indeed. >> > >> > So, checkpoints are definitely triggered (log message by coordinator to >> > trigger checkpoint) but are not completing? >> > Can you check which is the first checkpoint to complete? Is it >> Checkpoint >> > 1, or a later one (indicating that checkpoint 1 was somehow subsumed). >> > >> > Can you check in the stacktrace on which lock the checkpoint runables >> are >> > waiting, and who is holding that lock? >> > >> > Two thoughts: >> > >> > 1) What I mistakenly did once in one of my tests is to have the sleep() >> in >> > a downstream task. That would simply prevent the fast generated data >> > elements (and the inline checkpoint barriers) from passing though and >> > completing the checkpoint. >> > >> > 2) Is this another issue with the non-fair lock? Does the checkpoint >> > runnable simply not get the lock before the checkpoint. Not sure why it >> > would suddenly work after the failure. We could try and swap the lock >> > Object by a "ReentrantLock(true)" and see what would happen. >> > >> > >> > Stephan >> > >> > >> > On Fri, Jan 8, 2016 at 11:49 AM, Gyula Fóra <gyf...@apache.org> wrote: >> > >> > > Hey, >> > > >> > > I have encountered a weird issue in a checkpointing test I am trying >> to >> > > write. The logic is the same as with the previous checkpointing tests, >> > > there is a OnceFailingReducer. >> > > >> > > My problem is that before the reducer fails, my job cannot take any >> > > snapshots. The Runnables executing the checkpointing logic in the >> sources >> > > keep waiting on some lock. >> > > >> > > After the failure and the restart, everything is fine and the >> > checkpointing >> > > can succeed properly. >> > > >> > > Also if I remove the failure from the reducer, the job doesnt take any >> > > snapshots (waiting on lock) and the job will finish. >> > > >> > > Here is the code: >> > > >> > > >> > >> https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83 >> > > >> > > I assume there is no problem with the source as the Thread.sleep(..) >> is >> > > outside of the synchronized block. (and as I said after the failure it >> > > works fine). >> > > >> > > Any ideas? >> > > >> > > Thanks, >> > > Gyula >> > > >> > >> > >