No worries, thanks for the update! It's good to hear that it worked for you.

Best regards,
Piotrek

wt., 13 paź 2020 o 22:43 Binh Nguyen Van <binhn...@gmail.com> napisał(a):

> Hi,
>
> Sorry for the late reply. It took me quite a while to change the JDK
> version to reproduce the issue. I confirmed that if I upgrade to a newer
> JDK version (I tried with JDK 1.8.0_265) the issue doesn’t happen.
>
> Thank you for helping
> -Binh
>
> On Fri, Oct 9, 2020 at 11:36 AM Piotr Nowojski <pnowoj...@apache.org>
> wrote:
>
>> Hi Binh,
>>
>> Could you try upgrading Flink's Java runtime? It was previously reported
>> that upgrading to jdk1.8.0_251 was solving the problem.
>>
>> Piotrek
>>
>> pt., 9 paź 2020 o 19:41 Binh Nguyen Van <binhn...@gmail.com> napisał(a):
>>
>>> Hi,
>>>
>>> Thank you for helping me!
>>> The code is compiled on
>>>
>>> java version "1.8.0_161"
>>> Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
>>>
>>> But I just checked our Hadoop and its Java version is
>>>
>>> java version "1.8.0_77"
>>> Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
>>>
>>> Thanks
>>> -Binh
>>>
>>> On Fri, Oct 9, 2020 at 10:23 AM Piotr Nowojski <pnowoj...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> One more thing. It looks like it's not a Flink issue, but some JDK bug.
>>>> Others reported that upgrading JDK version (for example to  jdk1.8.0_251)
>>>> seemed to be solving this problem. What JDK version are you using?
>>>>
>>>> Piotrek
>>>>
>>>> pt., 9 paź 2020 o 17:59 Piotr Nowojski <pnowoj...@apache.org>
>>>> napisał(a):
>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks for reporting the problem. I think this is a known issue [1] on
>>>>> which we are working to fix.
>>>>>
>>>>> Piotrek
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-18196
>>>>>
>>>>> pon., 5 paź 2020 o 08:54 Binh Nguyen Van <binhn...@gmail.com>
>>>>> napisał(a):
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a streaming job that is written in Apache Beam and uses Flink
>>>>>> as its runner. The job is working as expected for about 15 hours and then
>>>>>> it started to have checkpointing error. The error message looks like this
>>>>>>
>>>>>> java.lang.Exception: Could not perform checkpoint 910 for operator 
>>>>>> Source: <source-name> (8/60).
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760)
>>>>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
>>>>>>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
>>>>>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>>>>>>     at java.lang.Thread.run(Thread.java:745)
>>>>>> Caused by: java.lang.NullPointerException
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
>>>>>>     at 
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776)
>>>>>>     ... 11 more
>>>>>>
>>>>>> When this happened, I have to stop the job and then start it again,
>>>>>> and then 15 hours later the issue happens again.
>>>>>>
>>>>>> Here are some additional information
>>>>>>
>>>>>>    - Flink version is 1.10.1
>>>>>>    - Job reads data from Kafka, transform, and then writes to Kafka
>>>>>>    - There are 6 tasks with the parallelism of 60 each (each task
>>>>>>    reads from 1 Kafka topic)
>>>>>>    - The job is deployed to run on YARN with 60 task managers and
>>>>>>    each task manager has 1 slot
>>>>>>    - The State backend is filesystem and HDFS is the storage
>>>>>>    (Doesn’t seem to related to the type of state backend since the issue 
>>>>>> also
>>>>>>    happened when I use memory as the state backend)
>>>>>>    - The checkpointing interval is 60 seconds (The longest duration
>>>>>>    of the normal checkpoint as shown in Flink UI is 14 seconds)
>>>>>>    - The minimum pause between checkpoints is 30 seconds
>>>>>>    - Hadoop cluster is Kerberized but Kafka is not. Keytab and
>>>>>>    principal are set in the Flink configuration file
>>>>>>
>>>>>> Can someone please help?
>>>>>>
>>>>>> Thanks
>>>>>> -Binh
>>>>>>
>>>>>

Reply via email to