[ https://issues.apache.org/jira/browse/KAFKA-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877136#comment-16877136 ]
Di Campo edited comment on KAFKA-5998 at 7/2/19 4:52 PM: --------------------------------------------------------- Just in case it helps. I just found it today on 2.1.1 (again, I commented here some months ago). 5 brokers cluster, 3 Kafka Streams instances (2 `num.streams.threads` each). AMZN Linux. Docker on ECS. I've seen that, before the task dies, it prints the following WARNs from one task. Please note that from the 64 partitions, only a few of them fail starting at 13:17. And the same batch of the same partitions start failing again at 13:42. Why are the same partitions failing? Does it match with your findings? {{[2019-07-02 13:17:01,101] WARN task [2_31] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_31/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:17:01,118] WARN task [2_47] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_47/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:17:01,156] WARN task [2_27] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_27/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:20:12,360] WARN task [2_63] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_63/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:20:12,579] WARN task [2_35] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_35/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:20:13,001] WARN task [2_23] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_23/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:23:18,421] WARN task [2_39] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_39/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:23:18,613] WARN task [2_55] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_55/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:42:46,366] WARN task [2_31] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_31/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:42:46,473] WARN task [2_47] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_47/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:42:46,639] WARN task [2_27] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_27/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:46:19,888] WARN task [2_63] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_63/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:46:20,042] WARN task [2_35] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_35/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:46:20,380] WARN task [2_55] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_55/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:46:20,384] WARN task [2_23] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_23/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:48:07,011] WARN task [2_39] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_39/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} Later, the application died some minutes later, at 13:59:13. In case there is a relation, it was killed due to OOM. was (Author: xmar): Just in case it helps. I just found it today on 2.1.1 (again, I commented here some months ago). 5 brokers cluster, 3 Kafka Streams instances (2 `num.streams.threads` each). AMZN Linux. Docker on ECS. I've seen that, before the task dies, it prints the following WARNs from one task. Please note that from the 64 partitions, only a few of them fail starting at 13:17. And the same batch of the same partitions start failing again at 13:42. Why are the same partitions failing? Does it match with your findings? {{ [2019-07-02 13:17:01,101] WARN task [2_31] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_31/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{ [2019-07-02 13:17:01,118] WARN task [2_47] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_47/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{ [2019-07-02 13:17:01,156] WARN task [2_27] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_27/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:20:12,360] WARN task [2_63] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_63/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:20:12,579] WARN task [2_35] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_35/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:20:13,001] WARN task [2_23] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_23/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:23:18,421] WARN task [2_39] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_39/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:23:18,613] WARN task [2_55] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_55/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:42:46,366] WARN task [2_31] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_31/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:42:46,473] WARN task [2_47] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_47/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:42:46,639] WARN task [2_27] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_27/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:46:19,888] WARN task [2_63] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_63/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:46:20,042] WARN task [2_35] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_35/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:46:20,380] WARN task [2_55] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_55/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:46:20,384] WARN task [2_23] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_23/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} {{[2019-07-02 13:48:07,011] WARN task [2_39] Failed to write offset checkpoint file to /data/kafka-streams/stream-processor-0.0.1/2_39/.checkpoint: {} (org.apache.kafka.streams.processor.internals.ProcessorStateManager)}} Later, the application died some minutes later, at 13:59:13. In case there is a relation, it was killed due to OOM. > /.checkpoint.tmp Not found exception > ------------------------------------ > > Key: KAFKA-5998 > URL: https://issues.apache.org/jira/browse/KAFKA-5998 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 0.11.0.0, 0.11.0.1, 2.1.1 > Reporter: Yogesh BG > Assignee: Bill Bejeck > Priority: Critical > Attachments: 5998.v1.txt, 5998.v2.txt, Topology.txt, exc.txt, > props.txt, streams.txt > > > I have one kafka broker and one kafka stream running... I am running its > since two days under load of around 2500 msgs per second.. On third day am > getting below exception for some of the partitions, I have 16 partitions only > 0_0 and 0_1 gives this error > {{09:43:25.955 [ks_0_inst-StreamThread-6] WARN > o.a.k.s.p.i.ProcessorStateManager - Failed to write checkpoint file to > /data/kstreams/rtp-kafkastreams/0_1/.checkpoint: > java.io.FileNotFoundException: > /data/kstreams/rtp-kafkastreams/0_1/.checkpoint.tmp (No such file or > directory) > at java.io.FileOutputStream.open(Native Method) ~[na:1.7.0_111] > at java.io.FileOutputStream.<init>(FileOutputStream.java:221) > ~[na:1.7.0_111] > at java.io.FileOutputStream.<init>(FileOutputStream.java:171) > ~[na:1.7.0_111] > at > org.apache.kafka.streams.state.internals.OffsetCheckpoint.write(OffsetCheckpoint.java:73) > ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.ProcessorStateManager.checkpoint(ProcessorStateManager.java:324) > ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:267) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:201) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:260) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:254) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks$1.apply(AssignedTasks.java:322) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks.applyToRunningTasks(AssignedTasks.java:415) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks.commit(AssignedTasks.java:314) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:700) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:683) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:523) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:480) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:457) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > 09:43:25.974 [ks_0_inst-StreamThread-15] WARN > o.a.k.s.p.i.ProcessorStateManager - Failed to write checkpoint file to > /data/kstreams/rtp-kafkastreams/0_0/.checkpoint: > java.io.FileNotFoundException: > /data/kstreams/rtp-kafkastreams/0_0/.checkpoint.tmp (No such file or > directory) > at java.io.FileOutputStream.open(Native Method) ~[na:1.7.0_111] > at java.io.FileOutputStream.<init>(FileOutputStream.java:221) > ~[na:1.7.0_111] > at java.io.FileOutputStream.<init>(FileOutputStream.java:171) > ~[na:1.7.0_111] > at > org.apache.kafka.streams.state.internals.OffsetCheckpoint.write(OffsetCheckpoint.java:73) > ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.ProcessorStateManager.checkpoint(ProcessorStateManager.java:324) > ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:267) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:201) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:260) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:254) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks$1.apply(AssignedTasks.java:322) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks.applyToRunningTasks(AssignedTasks.java:415) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks.commit(AssignedTasks.java:314) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:700) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:683) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:523) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:480) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:457) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > }} -- This message was sent by Atlassian JIRA (v7.6.3#76005)