[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load
[ https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755904#comment-17755904 ] Haruki Okada edited comment on KAFKA-15046 at 8/18/23 12:05 PM: I submitted a patch [https://github.com/apache/kafka/pull/14242] . In the meantime, I tested above patch (with porting it to 3.3.2, which is the version we use) in our experimental environment: * Setting: ** num.io.threads = 48 ** incoming byte-rate: 18MB/sec ** Adding 300ms artificial write-delay into the device using [device-mapper|https://github.com/kawamuray/ddi] * Without patch: ** !image-2023-08-18-19-23-36-597.png|width=292,height=164! ** request-handler idle ratio is below 40% ** produce-response time 99.9%ile is over 1 sec ** We see producer-state snapshotting takes hundreds of millisecs *** {code:java} (snip) [2023-08-18 13:23:02,552] INFO [ProducerStateManager partition=xxx-3] Wrote producer snapshot at offset 3030259 with 0 producer ids in 777 ms. (kafka.log.ProducerStateManager) [2023-08-18 13:23:02,852] INFO [ProducerStateManager partition=xxx-10] Wrote producer snapshot at offset 2991767 with 0 producer ids in 678 ms. (kafka.log.ProducerStateManager){code} * With patch: ** !image-2023-08-18-19-29-56-377.png|width=297,height=169! ** request-handler idle ratio is kept 75% ** produce-response time 99.9%ile is around 100ms ** producer-state snapshotting done in millisecs in most cases *** {code:java} (snip) [2023-08-18 13:40:09,383] INFO [ProducerStateManager partition=xxx-3] Wrote producer snapshot at offset 6219284 with 0 producer ids in 0 ms. (kafka.log.ProducerStateManager) [2023-08-18 13:40:09,818] INFO [ProducerStateManager partition=icbm-2] Wrote producer snapshot at offset 6208459 with 0 producer ids in 0 ms. (kafka.log.ProducerStateManager){code} was (Author: ocadaruma): I submitted a patch [https://github.com/apache/kafka/pull/14242] . In the meantime, I tested above patch (with porting it to 3.3.2, which is the version we use) in our experimental environment: * Setting: ** num.io.threads = 48 ** incoming byte-rate: 18MB/sec ** Adding 300ms artificial write-delay into the device using [device-mapper|https://github.com/kawamuray/ddi] * Without patch: ** !image-2023-08-18-19-23-36-597.png|width=292,height=164! ** request-handler idle ratio is below 40% ** produce-response time 99.9%ile is over 1 sec ** We see producer-state snapshotting takes hundreds of millisecs *** {code:java} (snip) [2023-08-18 13:23:02,552] INFO [ProducerStateManager partition=xxx-3] Wrote producer snapshot at offset 3030259 with 0 producer ids in 777 ms. (kafka.log.ProducerStateManager) [2023-08-18 13:23:02,852] INFO [ProducerStateManager partition=xxx-10] Wrote producer snapshot at offset 2991767 with 0 producer ids in 678 ms. (kafka.log.ProducerStateManager){code} * With patch: ** !image-2023-08-18-19-29-56-377.png|width=297,height=169! ** request-handler idle ratio is kept 75% ** produce-response time 99.9%ile is around 100ms ** producer-state snapshotting takes few millisecs in most cases *** {code:java} (snip) [2023-08-18 13:40:09,383] INFO [ProducerStateManager partition=xxx-3] Wrote producer snapshot at offset 6219284 with 0 producer ids in 0 ms. (kafka.log.ProducerStateManager) [2023-08-18 13:40:09,818] INFO [ProducerStateManager partition=icbm-2] Wrote producer snapshot at offset 6208459 with 0 producer ids in 0 ms. (kafka.log.ProducerStateManager){code} > Produce performance issue under high disk load > -- > > Key: KAFKA-15046 > URL: https://issues.apache.org/jira/browse/KAFKA-15046 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: Haruki Okada >Assignee: Haruki Okada >Priority: Major > Labels: performance > Attachments: image-2023-06-01-12-46-30-058.png, > image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, > image-2023-06-01-12-56-19-108.png, image-2023-08-18-19-23-36-597.png, > image-2023-08-18-19-29-56-377.png > > > * Phenomenon: > ** !image-2023-06-01-12-46-30-058.png|width=259,height=236! > ** Producer response time 99%ile got quite bad when we performed replica > reassignment on the cluster > *** RequestQueue scope was significant > ** Also request-time throttling happened at the incidental time. This caused > producers to delay sending messages in the mean time. > ** The disk I/O latency was higher than usual due to the high load for > replica reassignment. > *** !image-2023-06-01-12-56-19-108.png|width=255,height=128! > * Analysis: > ** The request-handler utilization was much higher than usual. > *** !image-2023-06-01-12-52-40-959.png|width=278,height=113! > ** Also, thread time utiliza
[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load
[ https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755351#comment-17755351 ] Haruki Okada edited comment on KAFKA-15046 at 8/17/23 4:09 AM: --- [~junrao] Hi, sorry for the late response. Thanks for your suggestion. > Another way to improve this is to move the LeaderEpochFile flushing logic to > be part of the flushing of rolled segments Yeah, that sounds make sense. I think ProducerState snapshot also should be unified to existing flushing logic then, instead of fsync-ing ProducerState separately in log.roll (i.e. current Kafka behavior), nor submitting to scheduler separately (i.e. like ongoing patch([https://github.com/apache/kafka/pull/13782]) does) was (Author: ocadaruma): [~junrao] Hi, sorry for the late response. Thanks for your suggestion. > Another way to improve this is to move the LeaderEpochFile flushing logic to > be part of the flushing of rolled segments Yeah, that sounds make sense. I think ProducerState snapshot also should be the unified to existing flushing logic then, instead of fsync-ing ProducerState separately in log.roll (i.e. current Kafka behavior), nor submitting to scheduler separately (i.e. like ongoing patch([https://github.com/apache/kafka/pull/13782]) does) > Produce performance issue under high disk load > -- > > Key: KAFKA-15046 > URL: https://issues.apache.org/jira/browse/KAFKA-15046 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: Haruki Okada >Assignee: Haruki Okada >Priority: Major > Labels: performance > Attachments: image-2023-06-01-12-46-30-058.png, > image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, > image-2023-06-01-12-56-19-108.png > > > * Phenomenon: > ** !image-2023-06-01-12-46-30-058.png|width=259,height=236! > ** Producer response time 99%ile got quite bad when we performed replica > reassignment on the cluster > *** RequestQueue scope was significant > ** Also request-time throttling happened at the incidental time. This caused > producers to delay sending messages in the mean time. > ** The disk I/O latency was higher than usual due to the high load for > replica reassignment. > *** !image-2023-06-01-12-56-19-108.png|width=255,height=128! > * Analysis: > ** The request-handler utilization was much higher than usual. > *** !image-2023-06-01-12-52-40-959.png|width=278,height=113! > ** Also, thread time utilization was much higher than usual on almost all > users > *** !image-2023-06-01-12-54-04-211.png|width=276,height=110! > ** From taking jstack several times, for most of them, we found that a > request-handler was doing fsync for flusing ProducerState and meanwhile other > request-handlers were waiting Log#lock for appending messages. > * > ** > *** > {code:java} > "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 > cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 > runnable [0x7ef9a12e2000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native > Method) > at > sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82) > at > sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461) > at > kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451) > at > kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754) > at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:919) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956) > at > kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown > Source) > at > scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28) > at > scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27) > at scala.collection.mutable.HashMap.map(HashMap.scala:35) > at > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944) >
[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load
[ https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755351#comment-17755351 ] Haruki Okada edited comment on KAFKA-15046 at 8/17/23 4:08 AM: --- [~junrao] Hi, sorry for the late response. Thanks for your suggestion. > Another way to improve this is to move the LeaderEpochFile flushing logic to > be part of the flushing of rolled segments Yeah, that sounds make sense. I think ProducerState snapshot also should be the unified to existing flushing logic then, instead of fsync-ing ProducerState separately in log.roll (i.e. current Kafka behavior), nor submitting to scheduler separately (i.e. like ongoing patch([https://github.com/apache/kafka/pull/13782]) does) was (Author: ocadaruma): [~junrao] Hi, sorry for the late response. Thanks for your suggestion. > Another way to improve this is to move the LeaderEpochFile flushing logic to > be part of the flushing of rolled segments Yeah, that sounds make sense. I think ProducerState snapshot also should be the unified to existing flushing logic then, instead of fsync-ing ProducerState separately in log.roll (i.e. current Kafka behavior), nor submitting to scheduler separately (i.e. like [ongoing patch|[https://github.com/apache/kafka/pull/13782]] ongoing patch does) > Produce performance issue under high disk load > -- > > Key: KAFKA-15046 > URL: https://issues.apache.org/jira/browse/KAFKA-15046 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: Haruki Okada >Assignee: Haruki Okada >Priority: Major > Labels: performance > Attachments: image-2023-06-01-12-46-30-058.png, > image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, > image-2023-06-01-12-56-19-108.png > > > * Phenomenon: > ** !image-2023-06-01-12-46-30-058.png|width=259,height=236! > ** Producer response time 99%ile got quite bad when we performed replica > reassignment on the cluster > *** RequestQueue scope was significant > ** Also request-time throttling happened at the incidental time. This caused > producers to delay sending messages in the mean time. > ** The disk I/O latency was higher than usual due to the high load for > replica reassignment. > *** !image-2023-06-01-12-56-19-108.png|width=255,height=128! > * Analysis: > ** The request-handler utilization was much higher than usual. > *** !image-2023-06-01-12-52-40-959.png|width=278,height=113! > ** Also, thread time utilization was much higher than usual on almost all > users > *** !image-2023-06-01-12-54-04-211.png|width=276,height=110! > ** From taking jstack several times, for most of them, we found that a > request-handler was doing fsync for flusing ProducerState and meanwhile other > request-handlers were waiting Log#lock for appending messages. > * > ** > *** > {code:java} > "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 > cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 > runnable [0x7ef9a12e2000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native > Method) > at > sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82) > at > sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461) > at > kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451) > at > kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754) > at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:919) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956) > at > kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown > Source) > at > scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28) > at > scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27) > at scala.collection.mutable.HashMap.map(HashMap.scala:35) > at > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.
[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load
[ https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728606#comment-17728606 ] Luke Chen edited comment on KAFKA-15046 at 6/2/23 7:30 AM: --- Currently, the RW lock in LeaderEpochFileCache is to protect the epoch entries in the cache. So you can see that not every `flush` call has a write lock. Ex: LeaderEpochFileCache#assign, in the assign, we only lock when updating the cache, not lock for flush. [https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/server/epoch/LeaderEpochFileCache.scala#L59] If we use the RW lock for log flush, I'm afraid the cache update will be blocked when disk IO is bad. I think using the same way like [https://github.com/apache/kafka/pull/13782] did to let the [fileDescriptor sync call|https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L95] using background thread should be fine. WDYT? was (Author: showuon): Currently, the RW lock in LeaderEpochFileCache is to protect the epoch entries in the cache. So you can see that not every `flush` call has a write lock. Ex: LeaderEpochFileCache#assign, in the assign, we only lock when updating the cache, not lock for flush. [https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/server/epoch/LeaderEpochFileCache.scala#L59] I think using the same way like [https://github.com/apache/kafka/pull/13782] did to let the [fileDescriptor sync call|https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L95] using background thread should be fine. WDYT? > Produce performance issue under high disk load > -- > > Key: KAFKA-15046 > URL: https://issues.apache.org/jira/browse/KAFKA-15046 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: Haruki Okada >Priority: Major > Labels: performance > Attachments: image-2023-06-01-12-46-30-058.png, > image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, > image-2023-06-01-12-56-19-108.png > > > * Phenomenon: > ** !image-2023-06-01-12-46-30-058.png|width=259,height=236! > ** Producer response time 99%ile got quite bad when we performed replica > reassignment on the cluster > *** RequestQueue scope was significant > ** Also request-time throttling happened at the incidental time. This caused > producers to delay sending messages in the mean time. > ** The disk I/O latency was higher than usual due to the high load for > replica reassignment. > *** !image-2023-06-01-12-56-19-108.png|width=255,height=128! > * Analysis: > ** The request-handler utilization was much higher than usual. > *** !image-2023-06-01-12-52-40-959.png|width=278,height=113! > ** Also, thread time utilization was much higher than usual on almost all > users > *** !image-2023-06-01-12-54-04-211.png|width=276,height=110! > ** From taking jstack several times, for most of them, we found that a > request-handler was doing fsync for flusing ProducerState and meanwhile other > request-handlers were waiting Log#lock for appending messages. > * > ** > *** > {code:java} > "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 > cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 > runnable [0x7ef9a12e2000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native > Method) > at > sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82) > at > sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461) > at > kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451) > at > kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754) > at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:919) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956) > at > kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown >
[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load
[ https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728551#comment-17728551 ] Haruki Okada edited comment on KAFKA-15046 at 6/2/23 7:19 AM: -- [~showuon] Maybe I linked wrong file. What I thought is to make any LeaderEpochFileCache methods which needs flush() to be called outside of Log's global lock. LeaderEpochFileCache already does exclusive control by its RW lock so I think we don't need to call it inside the Log's global lock. [https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/server/epoch/LeaderEpochFileCache.scala#L44] was (Author: ocadaruma): [~showuon] Maybe I linked wrong file. What I thought is to make any LeaderEpochFileCache methods (which needs flush()) to be called outside of Log's global lock. LeaderEpochFileCache already does exclusive control by its RW lock so I think we don't need to call it inside the Log's global lock. [https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/server/epoch/LeaderEpochFileCache.scala#L44] > Produce performance issue under high disk load > -- > > Key: KAFKA-15046 > URL: https://issues.apache.org/jira/browse/KAFKA-15046 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: Haruki Okada >Priority: Major > Labels: performance > Attachments: image-2023-06-01-12-46-30-058.png, > image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, > image-2023-06-01-12-56-19-108.png > > > * Phenomenon: > ** !image-2023-06-01-12-46-30-058.png|width=259,height=236! > ** Producer response time 99%ile got quite bad when we performed replica > reassignment on the cluster > *** RequestQueue scope was significant > ** Also request-time throttling happened at the incidental time. This caused > producers to delay sending messages in the mean time. > ** The disk I/O latency was higher than usual due to the high load for > replica reassignment. > *** !image-2023-06-01-12-56-19-108.png|width=255,height=128! > * Analysis: > ** The request-handler utilization was much higher than usual. > *** !image-2023-06-01-12-52-40-959.png|width=278,height=113! > ** Also, thread time utilization was much higher than usual on almost all > users > *** !image-2023-06-01-12-54-04-211.png|width=276,height=110! > ** From taking jstack several times, for most of them, we found that a > request-handler was doing fsync for flusing ProducerState and meanwhile other > request-handlers were waiting Log#lock for appending messages. > * > ** > *** > {code:java} > "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 > cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 > runnable [0x7ef9a12e2000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native > Method) > at > sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82) > at > sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461) > at > kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451) > at > kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754) > at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:919) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956) > at > kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown > Source) > at > scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28) > at > scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27) > at scala.collection.mutable.HashMap.map(HashMap.scala:35) > at > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944) > at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:602) > at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:666) > at kafka.server.KafkaApis.handle(KafkaApis.scala:175) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHan
[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load
[ https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728515#comment-17728515 ] Haruki Okada edited comment on KAFKA-15046 at 6/2/23 4:01 AM: -- Yeah, io_uring is promising. However it only works with newer kernel (which some on-premises Kafka users may not be easy to update) and require rewriting a lot of parts of the code base. -For leader-epoch cache, the checkpointing is already done in scheduler thread so we should adopt solution2 I think- For leader epoch cache, some paths already doing checkpointing asynchronously (e.g. UnifiedLog.deleteOldSegments => UnifiedLog.maybeIncrementLogStartOffset => LeaderEpochFileCache.truncateFromStart on kafka scheduler), so we have to make fsync called outside of the lock (i.e. solution-2) anyways I think. Writing to CheckpointFile is already synchronized, so can't we just move checkpointing to outside of the lock? [https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L76] was (Author: ocadaruma): Yeah, io_uring is promising. However it only works with newer kernel (which some on-premises Kafka users may not be easy to update) and require rewriting a lot of parts of the code base. For leader-epoch cache, the checkpointing is already done in scheduler thread so we should adopt solution2 I think. Writing to CheckpointFile is already synchronized, so can't we just move checkpointing to outside of the lock? [https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L76] > Produce performance issue under high disk load > -- > > Key: KAFKA-15046 > URL: https://issues.apache.org/jira/browse/KAFKA-15046 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: Haruki Okada >Priority: Major > Labels: performance > Attachments: image-2023-06-01-12-46-30-058.png, > image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, > image-2023-06-01-12-56-19-108.png > > > * Phenomenon: > ** !image-2023-06-01-12-46-30-058.png|width=259,height=236! > ** Producer response time 99%ile got quite bad when we performed replica > reassignment on the cluster > *** RequestQueue scope was significant > ** Also request-time throttling happened at the incidental time. This caused > producers to delay sending messages in the mean time. > ** The disk I/O latency was higher than usual due to the high load for > replica reassignment. > *** !image-2023-06-01-12-56-19-108.png|width=255,height=128! > * Analysis: > ** The request-handler utilization was much higher than usual. > *** !image-2023-06-01-12-52-40-959.png|width=278,height=113! > ** Also, thread time utilization was much higher than usual on almost all > users > *** !image-2023-06-01-12-54-04-211.png|width=276,height=110! > ** From taking jstack several times, for most of them, we found that a > request-handler was doing fsync for flusing ProducerState and meanwhile other > request-handlers were waiting Log#lock for appending messages. > * > ** > *** > {code:java} > "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 > cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 > runnable [0x7ef9a12e2000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native > Method) > at > sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82) > at > sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461) > at > kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451) > at > kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754) > at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:919) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956) > at > kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown > Source) > at > scala.collection.StrictOptim
[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load
[ https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728515#comment-17728515 ] Haruki Okada edited comment on KAFKA-15046 at 6/2/23 12:01 AM: --- Yeah, io_uring is promising. However it only works with newer kernel (which some on-premises Kafka users may not be easy to update) and require rewriting a lot of parts of the code base. For leader-epoch cache, the checkpointing is already done in scheduler thread so we should adopt solution2 I think. Writing to CheckpointFile is already synchronized, so can't we just move checkpointing to outside of the lock? [https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L76] was (Author: ocadaruma): Yeah, io_uring is promising. However it only works with newer kernel (which some on-premises Kafka users may not be easy to update) and require a lot of parts of the code base. For leader-epoch cache, the checkpointing is already done in scheduler thread so we should adopt solution2 I think. Writing to CheckpointFile is already synchronized, so can't we just move checkpointing to outside of the lock? https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L76 > Produce performance issue under high disk load > -- > > Key: KAFKA-15046 > URL: https://issues.apache.org/jira/browse/KAFKA-15046 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 3.3.2 >Reporter: Haruki Okada >Priority: Major > Labels: performance > Attachments: image-2023-06-01-12-46-30-058.png, > image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, > image-2023-06-01-12-56-19-108.png > > > * Phenomenon: > ** !image-2023-06-01-12-46-30-058.png|width=259,height=236! > ** Producer response time 99%ile got quite bad when we performed replica > reassignment on the cluster > *** RequestQueue scope was significant > ** Also request-time throttling happened at the incidental time. This caused > producers to delay sending messages in the mean time. > ** The disk I/O latency was higher than usual due to the high load for > replica reassignment. > *** !image-2023-06-01-12-56-19-108.png|width=255,height=128! > * Analysis: > ** The request-handler utilization was much higher than usual. > *** !image-2023-06-01-12-52-40-959.png|width=278,height=113! > ** Also, thread time utilization was much higher than usual on almost all > users > *** !image-2023-06-01-12-54-04-211.png|width=276,height=110! > ** From taking jstack several times, for most of them, we found that a > request-handler was doing fsync for flusing ProducerState and meanwhile other > request-handlers were waiting Log#lock for appending messages. > * > ** > *** > {code:java} > "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 > cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 > runnable [0x7ef9a12e2000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native > Method) > at > sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82) > at > sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461) > at > kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451) > at > kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754) > at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:919) > - locked <0x00060d75d820> (a java.lang.Object) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956) > at > kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown > Source) > at > scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28) > at > scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27) > at scala.collection.mutable.HashMap.map(HashMap.scala:35) > at > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944) > at kafka.server.Repl
[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load
[ https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728266#comment-17728266 ] Haruki Okada edited comment on KAFKA-15046 at 6/1/23 8:39 AM: -- Hm, when I dug into further this, I noticed there's another path that causes essentially same phenomenon. {code:java} "data-plane-kafka-request-handler-17" #169 daemon prio=5 os_prio=0 cpu=50994542.49ms elapsed=595635.65s tid=0x7efdaebabe30 nid=0x1e707 runnable [0x7ef9a0fdf000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native Method) at sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82) at sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461) at org.apache.kafka.common.utils.Utils.flushDir(Utils.java:966) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:951) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:925) at org.apache.kafka.server.common.CheckpointFile.write(CheckpointFile.java:98) - locked <0x000680fc4930> (a java.lang.Object) at kafka.server.checkpoints.CheckpointFileWithFailureHandler.write(CheckpointFileWithFailureHandler.scala:37) at kafka.server.checkpoints.LeaderEpochCheckpointFile.write(LeaderEpochCheckpointFile.scala:71) at kafka.server.epoch.LeaderEpochFileCache.flush(LeaderEpochFileCache.scala:291) at kafka.server.epoch.LeaderEpochFileCache.$anonfun$truncateFromStart$3(LeaderEpochFileCache.scala:263) at kafka.server.epoch.LeaderEpochFileCache.$anonfun$truncateFromStart$3$adapted(LeaderEpochFileCache.scala:259) at kafka.server.epoch.LeaderEpochFileCache$$Lambda$571/0x00080045f040.apply(Unknown Source) at scala.Option.foreach(Option.scala:437) at kafka.server.epoch.LeaderEpochFileCache.$anonfun$truncateFromStart$1(LeaderEpochFileCache.scala:259) at kafka.server.epoch.LeaderEpochFileCache.truncateFromStart(LeaderEpochFileCache.scala:254) at kafka.log.UnifiedLog.$anonfun$maybeIncrementLogStartOffset$4(UnifiedLog.scala:1043) at kafka.log.UnifiedLog.$anonfun$maybeIncrementLogStartOffset$4$adapted(UnifiedLog.scala:1043) at kafka.log.UnifiedLog$$Lambda$2324/0x000800b59040.apply(Unknown Source) at scala.Option.foreach(Option.scala:437) at kafka.log.UnifiedLog.maybeIncrementLogStartOffset(UnifiedLog.scala:1043) - locked <0x000680fc5080> (a java.lang.Object) at kafka.cluster.Partition.$anonfun$deleteRecordsOnLeader$1(Partition.scala:1476) at kafka.cluster.Partition.deleteRecordsOnLeader(Partition.scala:1463) at kafka.server.ReplicaManager.$anonfun$deleteRecordsOnLocalLog$2(ReplicaManager.scala:687) at kafka.server.ReplicaManager$$Lambda$3156/0x000800d7c840.apply(Unknown Source) at scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28) at scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27) at scala.collection.mutable.HashMap.map(HashMap.scala:35) at kafka.server.ReplicaManager.deleteRecordsOnLocalLog(ReplicaManager.scala:680) at kafka.server.ReplicaManager.deleteRecords(ReplicaManager.scala:875) at kafka.server.KafkaApis.handleDeleteRecordsRequest(KafkaApis.scala:2216) at kafka.server.KafkaApis.handle(KafkaApis.scala:196) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75) at java.lang.Thread.run(java.base@11.0.17/Thread.java:829) {code} LeaderEpoch checkpointing also calls fsync with holding Log#lock and blocking request-handler threads to append in the meantime. This is called by scheduler thread on log-segment-breaching so might be less frequent than log roll though. Does it make sense to also making LeaderEpochCheckpointFile-flush to be outside of the lock? was (Author: ocadaruma): Hm, when I dug into further this, I noticed there's another path that causes essentially same phenomenon. {code:java} "data-plane-kafka-request-handler-17" #169 daemon prio=5 os_prio=0 cpu=50994542.49ms elapsed=595635.65s tid=0x7efdaebabe30 nid=0x1e707 runnable [0x7ef9a0fdf000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native Method) at sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82) at sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461) at org.apache.kafka.common.utils.Utils.flushDir(Utils.java:966) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:951) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:925) at org.apache.kafk