It is nice to see that we converge on the issues we find.

Means that this is getting pretty stable :-)

On Tue, Jan 19, 2016 at 8:17 PM, Stephan Ewen <se...@apache.org> wrote:

> Yeah, we saw this as well this morning, in a job that triggers checkpoints
> super fast (50msecs).
>
> I think we have a good fix figured out, let's solve this for 1.0...
>
> On Tue, Jan 19, 2016 at 3:25 PM, Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> I just got back to this issue. The problem wasn't with the locking but
>> that
>> the StreamTask wasn't in running state before the first checkpoint trigger
>> message.
>> I actually just saw your JIRA as well, funny... :)
>>
>> Regards,
>> Gyula
>>
>> Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. jan. 8., P,
>> 15:36):
>>
>> > Hmm, strange issue indeed.
>> >
>> > So, checkpoints are definitely triggered (log message by coordinator to
>> > trigger checkpoint) but are not completing?
>> > Can you check which is the first checkpoint to complete? Is it
>> Checkpoint
>> > 1, or a later one (indicating that checkpoint 1 was somehow subsumed).
>> >
>> > Can you check in the stacktrace on which lock the checkpoint runables
>> are
>> > waiting, and who is holding that lock?
>> >
>> > Two thoughts:
>> >
>> > 1) What I mistakenly did once in one of my tests is to have the sleep()
>> in
>> > a downstream task. That would simply prevent the fast generated data
>> > elements (and the inline checkpoint barriers) from passing though and
>> > completing the checkpoint.
>> >
>> > 2) Is this another issue with the non-fair lock? Does the checkpoint
>> > runnable simply not get the lock before the checkpoint. Not sure why it
>> > would suddenly work after the failure. We could try and swap the lock
>> > Object by a "ReentrantLock(true)" and see what would happen.
>> >
>> >
>> > Stephan
>> >
>> >
>> > On Fri, Jan 8, 2016 at 11:49 AM, Gyula Fóra <gyf...@apache.org> wrote:
>> >
>> > > Hey,
>> > >
>> > > I have encountered a weird issue in a checkpointing test I am trying
>> to
>> > > write. The logic is the same as with the previous checkpointing tests,
>> > > there is a OnceFailingReducer.
>> > >
>> > > My problem is that before the reducer fails, my job cannot take any
>> > > snapshots. The Runnables executing the checkpointing logic in the
>> sources
>> > > keep waiting on some lock.
>> > >
>> > > After the failure and the restart, everything is fine and the
>> > checkpointing
>> > > can succeed properly.
>> > >
>> > > Also if I remove the failure from the reducer, the job doesnt take any
>> > > snapshots (waiting on lock) and the job will finish.
>> > >
>> > > Here is the code:
>> > >
>> > >
>> >
>> https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83
>> > >
>> > > I assume there is no problem with the source as the Thread.sleep(..)
>> is
>> > > outside of the synchronized block. (and as I said after the failure it
>> > > works fine).
>> > >
>> > > Any ideas?
>> > >
>> > > Thanks,
>> > > Gyula
>> > >
>> >
>>
>
>

Reply via email to