Re: FlinkCEP, circular references and checkpointing failures

Shailesh Jain Wed, 26 Sep 2018 23:23:13 -0700

Hi Dawid,

Yes, it is version 1.4.2. We are running vanilla flink, but have added a
couple of changes in the CEP operator specifically (top 3 commits here:
https://github.com/jainshailesh/flink/commits/poc_on_1.4.2). Changes I've
made to CEP operators do not touch the checkpointing path, just overloading
the operator for a specific way of handling event time.


We are hitting this in production, so I'm not sure it'll be feasible to
move to 1.6.0 immediately, but eventually yes.

Thanks,
Shailesh

On Wed, Sep 26, 2018 at 5:44 PM Dawid Wysakowicz <dwysakow...@apache.org>
wrote:

> Hi Shailesh,
>
> Are you sure you are using version 1.4.2? Do you run a vanilla flink, or
> have you introduced some changes? I am asking cause the lines in stacktrace
> does not align with the source code for 1.4.2.
>
> Also it is a different exception than the one in the issue you've linked,
> so if it is a problem than it is definitely a different one. Last thing I
> would recommend upgrading to the newest version, as we rewritten the
> SharedBuffer implementation in 1.6.0.
>
> Best,
>
> Dawid
>
> On 26/09/18 13:50, Shailesh Jain wrote:
>
> Hi,
>
> I think I've hit this same issue on a 3 node standalone cluster (1.4.2)
> using HDFS (2.8.4) as state backend.
>
> 2018-09-26 17:07:39,370 INFO
> org.apache.flink.runtime.taskmanager.Task                     - Attempting
> to fail task externally SelectCepOperator (1/1)
> (3bec4aa1ef2226c4e0c5ff7b3860d340).
> 2018-09-26 17:07:39,370 INFO
> org.apache.flink.runtime.taskmanager.Task                     -
> SelectCepOperator (1/1) (3bec4aa1ef2226c4e0c5ff7b3860d340) switched from
> RUNNING to FAILED.
> AsynchronousException{java.lang.Exception: Could not materialize
> checkpoint 6 for operator SelectCepOperator (1/1).}
>     at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:948)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.Exception: Could not materialize checkpoint 6 for
> operator SelectCepOperator (1/1).
>     ... 6 more
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.NullPointerException
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>     at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>     at
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>     at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894)
>     ... 5 more
>     Suppressed: java.lang.Exception: Could not properly cancel managed
> keyed state future.
>         at
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:91)
>         at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:976)
>         at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:939)
>         ... 5 more
>     Caused by: java.util.concurrent.ExecutionException:
> java.lang.NullPointerException
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>         at
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>         at
> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:66)
>         at
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:89)
>         ... 7 more
>     Caused by: java.lang.NullPointerException
>         at
> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:954)
>         at
> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:825)
>         at
> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:888)
>         at
> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:820)
>         at
> org.apache.flink.runtime.state.heap.CopyOnWriteStateTableSnapshot.writeMappingsInKeyGroup(CopyOnWriteStateTableSnapshot.java:196)
>         at
> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:390)
>         at
> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:339)
>         at
> org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
>         at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894)
>         ... 5 more
>     [CIRCULAR REFERENCE:java.lang.NullPointerException]
>
> Any ideas on why I'm hitting this especially when this (
> https://issues.apache.org/jira/browse/FLINK-7756) says it has been fixed
> in 1.4.2 ?
>
> On Sat, Nov 4, 2017 at 12:34 AM Federico D'Ambrosio <
> federico.dambro...@smartlab.ws> wrote:
>
>> Thank you very much for your steady response, Kostas!
>>
>> Cheers,
>> Federico
>>
>> 2017-11-03 16:26 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com>:
>>
>>> Hi Federico,
>>>
>>> Thanks for trying it out!
>>> Great to hear that your problem was fixed!
>>>
>>> The feature freeze for the release is going to be next week, and I would
>>> expect 1 or 2 more weeks testing.
>>> So I would say in 2.5 weeks. But this is of course subject to potential
>>> issues we may find during testing.
>>>
>>> Cheers,
>>> Kostas
>>>
>>> On Nov 3, 2017, at 4:22 PM, Federico D'Ambrosio <
>>> federico.dambro...@smartlab.ws> wrote:
>>>
>>> Hi Kostas,
>>>
>>> I just tried running the same job with 1.4-SNAPSHOT for 10 minutes and
>>> it didn't crash, so that was the same underlying issue of the JIRA you
>>> linked.
>>>
>>> Do you happen to know when it's expected the 1.4 stable release?
>>>
>>> Thank you very much,
>>> Federico
>>>
>>> 2017-11-03 15:25 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com>:
>>>
>>>> Perfect! thanks a lot!
>>>>
>>>> Kostas
>>>>
>>>> On Nov 3, 2017, at 3:23 PM, Federico D'Ambrosio <
>>>> federico.dambro...@smartlab.ws> wrote:
>>>>
>>>> Hi Kostas,
>>>>
>>>> yes, I'm using 1.3.2. I'll try the current master and I'll get back to
>>>> you.
>>>>
>>>> 2017-11-03 15:21 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com>
>>>> :
>>>>
>>>>> Hi Federico,
>>>>>
>>>>> I assume that you are using Flink 1.3, right?
>>>>>
>>>>> In this case, in 1.4 we have fixed a bug that seems similar to your
>>>>> case:
>>>>> https://issues.apache.org/jira/browse/FLINK-7756
>>>>>
>>>>> Could you try the current master to see if it fixes your problem?
>>>>>
>>>>> Thanks,
>>>>> Kostas
>>>>>
>>>>> On Nov 3, 2017, at 3:12 PM, Federico D'Ambrosio <
>>>>> federico.dambro...@smartlab.ws> wrote:
>>>>>
>>>>>  Could not find id for
>>>>> entry:
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Federico D'Ambrosio
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Federico D'Ambrosio
>>>
>>>
>>>
>>
>>
>> --
>> Federico D'Ambrosio
>>
>
>

Re: FlinkCEP, circular references and checkpointing failures

Reply via email to