Hmm, strange issue indeed.

So, checkpoints are definitely triggered (log message by coordinator to
trigger checkpoint) but are not completing?
Can you check which is the first checkpoint to complete? Is it Checkpoint
1, or a later one (indicating that checkpoint 1 was somehow subsumed).

Can you check in the stacktrace on which lock the checkpoint runables are
waiting, and who is holding that lock?

Two thoughts:

1) What I mistakenly did once in one of my tests is to have the sleep() in
a downstream task. That would simply prevent the fast generated data
elements (and the inline checkpoint barriers) from passing though and
completing the checkpoint.

2) Is this another issue with the non-fair lock? Does the checkpoint
runnable simply not get the lock before the checkpoint. Not sure why it
would suddenly work after the failure. We could try and swap the lock
Object by a "ReentrantLock(true)" and see what would happen.


Stephan


On Fri, Jan 8, 2016 at 11:49 AM, Gyula Fóra <gyf...@apache.org> wrote:

> Hey,
>
> I have encountered a weird issue in a checkpointing test I am trying to
> write. The logic is the same as with the previous checkpointing tests,
> there is a OnceFailingReducer.
>
> My problem is that before the reducer fails, my job cannot take any
> snapshots. The Runnables executing the checkpointing logic in the sources
> keep waiting on some lock.
>
> After the failure and the restart, everything is fine and the checkpointing
> can succeed properly.
>
> Also if I remove the failure from the reducer, the job doesnt take any
> snapshots (waiting on lock) and the job will finish.
>
> Here is the code:
>
> https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83
>
> I assume there is no problem with the source as the Thread.sleep(..) is
> outside of the synchronized block. (and as I said after the failure it
> works fine).
>
> Any ideas?
>
> Thanks,
> Gyula
>

Reply via email to