Hi,

I have a streaming job that is written in Apache Beam and uses Flink as its
runner. The job is working as expected for about 15 hours and then it
started to have checkpointing error. The error message looks like this

java.lang.Exception: Could not perform checkpoint 910 for operator
Source: <source-name> (8/60).
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
    at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
    at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
    at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776)
    ... 11 more

When this happened, I have to stop the job and then start it again, and
then 15 hours later the issue happens again.

Here are some additional information

   - Flink version is 1.10.1
   - Job reads data from Kafka, transform, and then writes to Kafka
   - There are 6 tasks with the parallelism of 60 each (each task reads
   from 1 Kafka topic)
   - The job is deployed to run on YARN with 60 task managers and each task
   manager has 1 slot
   - The State backend is filesystem and HDFS is the storage (Doesn’t seem
   to related to the type of state backend since the issue also happened when
   I use memory as the state backend)
   - The checkpointing interval is 60 seconds (The longest duration of the
   normal checkpoint as shown in Flink UI is 14 seconds)
   - The minimum pause between checkpoints is 30 seconds
   - Hadoop cluster is Kerberized but Kafka is not. Keytab and principal
   are set in the Flink configuration file

Can someone please help?

Thanks
-Binh

Reply via email to