[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load

2023-08-18 Thread Haruki Okada (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755904#comment-17755904
 ] 

Haruki Okada edited comment on KAFKA-15046 at 8/18/23 12:05 PM:


I submitted a patch [https://github.com/apache/kafka/pull/14242] .

In the meantime, I tested above patch (with porting it to 3.3.2, which is the 
version we use) in our experimental environment:
 * Setting:
 ** num.io.threads = 48
 ** incoming byte-rate: 18MB/sec
 ** Adding 300ms artificial write-delay into the device using 
[device-mapper|https://github.com/kawamuray/ddi]
 * Without patch:
 ** !image-2023-08-18-19-23-36-597.png|width=292,height=164!
 ** request-handler idle ratio is below 40%
 ** produce-response time 99.9%ile is over 1 sec
 ** We see producer-state snapshotting takes hundreds of millisecs
 *** 
{code:java}
(snip)
[2023-08-18 13:23:02,552] INFO [ProducerStateManager partition=xxx-3] Wrote 
producer snapshot at offset 3030259 with 0 producer ids in 777 ms. 
(kafka.log.ProducerStateManager)
[2023-08-18 13:23:02,852] INFO [ProducerStateManager partition=xxx-10] Wrote 
producer snapshot at offset 2991767 with 0 producer ids in 678 ms. 
(kafka.log.ProducerStateManager){code}

 * With patch:
 ** !image-2023-08-18-19-29-56-377.png|width=297,height=169!
 ** request-handler idle ratio is kept 75%
 ** produce-response time 99.9%ile is around 100ms
 ** producer-state snapshotting done in millisecs in most cases
 *** 
{code:java}
(snip)
[2023-08-18 13:40:09,383] INFO [ProducerStateManager partition=xxx-3] Wrote 
producer snapshot at offset 6219284 with 0 producer ids in 0 ms. 
(kafka.log.ProducerStateManager) 
[2023-08-18 13:40:09,818] INFO [ProducerStateManager partition=icbm-2] Wrote 
producer snapshot at offset 6208459 with 0 producer ids in 0 ms. 
(kafka.log.ProducerStateManager){code}


was (Author: ocadaruma):
I submitted a patch [https://github.com/apache/kafka/pull/14242] .

In the meantime, I tested above patch (with porting it to 3.3.2, which is the 
version we use) in our experimental environment:
 * Setting:
 ** num.io.threads = 48
 ** incoming byte-rate: 18MB/sec
 ** Adding 300ms artificial write-delay into the device using 
[device-mapper|https://github.com/kawamuray/ddi]
 * Without patch:
 ** !image-2023-08-18-19-23-36-597.png|width=292,height=164!
 ** request-handler idle ratio is below 40%
 ** produce-response time 99.9%ile is over 1 sec
 ** We see producer-state snapshotting takes hundreds of millisecs
 *** 
{code:java}
(snip)
[2023-08-18 13:23:02,552] INFO [ProducerStateManager partition=xxx-3] Wrote 
producer snapshot at offset 3030259 with 0 producer ids in 777 ms. 
(kafka.log.ProducerStateManager)
[2023-08-18 13:23:02,852] INFO [ProducerStateManager partition=xxx-10] Wrote 
producer snapshot at offset 2991767 with 0 producer ids in 678 ms. 
(kafka.log.ProducerStateManager){code}

 * With patch:
 ** !image-2023-08-18-19-29-56-377.png|width=297,height=169!
 ** request-handler idle ratio is kept 75%
 ** produce-response time 99.9%ile is around 100ms
 ** producer-state snapshotting takes few millisecs in most cases
 *** 
{code:java}
(snip)
[2023-08-18 13:40:09,383] INFO [ProducerStateManager partition=xxx-3] Wrote 
producer snapshot at offset 6219284 with 0 producer ids in 0 ms. 
(kafka.log.ProducerStateManager) 
[2023-08-18 13:40:09,818] INFO [ProducerStateManager partition=icbm-2] Wrote 
producer snapshot at offset 6208459 with 0 producer ids in 0 ms. 
(kafka.log.ProducerStateManager){code}

> Produce performance issue under high disk load
> --
>
> Key: KAFKA-15046
> URL: https://issues.apache.org/jira/browse/KAFKA-15046
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: Haruki Okada
>Assignee: Haruki Okada
>Priority: Major
>  Labels: performance
> Attachments: image-2023-06-01-12-46-30-058.png, 
> image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, 
> image-2023-06-01-12-56-19-108.png, image-2023-08-18-19-23-36-597.png, 
> image-2023-08-18-19-29-56-377.png
>
>
> * Phenomenon:
>  ** !image-2023-06-01-12-46-30-058.png|width=259,height=236!
>  ** Producer response time 99%ile got quite bad when we performed replica 
> reassignment on the cluster
>  *** RequestQueue scope was significant
>  ** Also request-time throttling happened at the incidental time. This caused 
> producers to delay sending messages in the mean time.
>  ** The disk I/O latency was higher than usual due to the high load for 
> replica reassignment.
>  *** !image-2023-06-01-12-56-19-108.png|width=255,height=128!
>  * Analysis:
>  ** The request-handler utilization was much higher than usual.
>  *** !image-2023-06-01-12-52-40-959.png|width=278,height=113!
>  ** Also, thread time utiliza

[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load

2023-08-16 Thread Haruki Okada (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755351#comment-17755351
 ] 

Haruki Okada edited comment on KAFKA-15046 at 8/17/23 4:09 AM:
---

[~junrao] Hi, sorry for the late response.

Thanks for your suggestion.

 

> Another way to improve this is to move the LeaderEpochFile flushing logic to 
> be part of the flushing of rolled segments

 

Yeah, that sounds make sense.

I think ProducerState snapshot also should be unified to existing flushing 
logic then, instead of fsync-ing ProducerState separately in log.roll (i.e. 
current Kafka behavior), nor submitting to scheduler separately (i.e. like 
ongoing patch([https://github.com/apache/kafka/pull/13782]) does)

 


was (Author: ocadaruma):
[~junrao] Hi, sorry for the late response.

Thanks for your suggestion.

 

> Another way to improve this is to move the LeaderEpochFile flushing logic to 
> be part of the flushing of rolled segments

 

Yeah, that sounds make sense.

I think ProducerState snapshot also should be the unified to existing flushing 
logic then, instead of fsync-ing ProducerState separately in log.roll (i.e. 
current Kafka behavior), nor submitting to scheduler separately (i.e. like 
ongoing patch([https://github.com/apache/kafka/pull/13782]) does)

 

> Produce performance issue under high disk load
> --
>
> Key: KAFKA-15046
> URL: https://issues.apache.org/jira/browse/KAFKA-15046
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: Haruki Okada
>Assignee: Haruki Okada
>Priority: Major
>  Labels: performance
> Attachments: image-2023-06-01-12-46-30-058.png, 
> image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, 
> image-2023-06-01-12-56-19-108.png
>
>
> * Phenomenon:
>  ** !image-2023-06-01-12-46-30-058.png|width=259,height=236!
>  ** Producer response time 99%ile got quite bad when we performed replica 
> reassignment on the cluster
>  *** RequestQueue scope was significant
>  ** Also request-time throttling happened at the incidental time. This caused 
> producers to delay sending messages in the mean time.
>  ** The disk I/O latency was higher than usual due to the high load for 
> replica reassignment.
>  *** !image-2023-06-01-12-56-19-108.png|width=255,height=128!
>  * Analysis:
>  ** The request-handler utilization was much higher than usual.
>  *** !image-2023-06-01-12-52-40-959.png|width=278,height=113!
>  ** Also, thread time utilization was much higher than usual on almost all 
> users
>  *** !image-2023-06-01-12-54-04-211.png|width=276,height=110!
>  ** From taking jstack several times, for most of them, we found that a 
> request-handler was doing fsync for flusing ProducerState and meanwhile other 
> request-handlers were waiting Log#lock for appending messages.
>  * 
>  ** 
>  *** 
> {code:java}
> "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 
> cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 
> runnable  [0x7ef9a12e2000]
>java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native 
> Method)
> at 
> sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
> at 
> sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
> at 
> kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451)
> at 
> kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754)
> at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.append(UnifiedLog.scala:919)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
> at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
> at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
> at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
> at 
> kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown 
> Source)
> at 
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
> at 
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
> at scala.collection.mutable.HashMap.map(HashMap.scala:35)
> at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944)
>   

[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load

2023-08-16 Thread Haruki Okada (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755351#comment-17755351
 ] 

Haruki Okada edited comment on KAFKA-15046 at 8/17/23 4:08 AM:
---

[~junrao] Hi, sorry for the late response.

Thanks for your suggestion.

 

> Another way to improve this is to move the LeaderEpochFile flushing logic to 
> be part of the flushing of rolled segments

 

Yeah, that sounds make sense.

I think ProducerState snapshot also should be the unified to existing flushing 
logic then, instead of fsync-ing ProducerState separately in log.roll (i.e. 
current Kafka behavior), nor submitting to scheduler separately (i.e. like 
ongoing patch([https://github.com/apache/kafka/pull/13782]) does)

 


was (Author: ocadaruma):
[~junrao] Hi, sorry for the late response.

Thanks for your suggestion.

 

> Another way to improve this is to move the LeaderEpochFile flushing logic to 
> be part of the flushing of rolled segments

 

Yeah, that sounds make sense.

I think ProducerState snapshot also should be the unified to existing flushing 
logic then, instead of fsync-ing ProducerState separately in log.roll (i.e. 
current Kafka behavior), nor submitting to scheduler separately (i.e. like 
[ongoing patch|[https://github.com/apache/kafka/pull/13782]] ongoing patch does)

 

> Produce performance issue under high disk load
> --
>
> Key: KAFKA-15046
> URL: https://issues.apache.org/jira/browse/KAFKA-15046
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: Haruki Okada
>Assignee: Haruki Okada
>Priority: Major
>  Labels: performance
> Attachments: image-2023-06-01-12-46-30-058.png, 
> image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, 
> image-2023-06-01-12-56-19-108.png
>
>
> * Phenomenon:
>  ** !image-2023-06-01-12-46-30-058.png|width=259,height=236!
>  ** Producer response time 99%ile got quite bad when we performed replica 
> reassignment on the cluster
>  *** RequestQueue scope was significant
>  ** Also request-time throttling happened at the incidental time. This caused 
> producers to delay sending messages in the mean time.
>  ** The disk I/O latency was higher than usual due to the high load for 
> replica reassignment.
>  *** !image-2023-06-01-12-56-19-108.png|width=255,height=128!
>  * Analysis:
>  ** The request-handler utilization was much higher than usual.
>  *** !image-2023-06-01-12-52-40-959.png|width=278,height=113!
>  ** Also, thread time utilization was much higher than usual on almost all 
> users
>  *** !image-2023-06-01-12-54-04-211.png|width=276,height=110!
>  ** From taking jstack several times, for most of them, we found that a 
> request-handler was doing fsync for flusing ProducerState and meanwhile other 
> request-handlers were waiting Log#lock for appending messages.
>  * 
>  ** 
>  *** 
> {code:java}
> "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 
> cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 
> runnable  [0x7ef9a12e2000]
>java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native 
> Method)
> at 
> sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
> at 
> sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
> at 
> kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451)
> at 
> kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754)
> at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.append(UnifiedLog.scala:919)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
> at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
> at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
> at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
> at 
> kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown 
> Source)
> at 
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
> at 
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
> at scala.collection.mutable.HashMap.map(HashMap.scala:35)
> at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.

[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load

2023-06-02 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728606#comment-17728606
 ] 

Luke Chen edited comment on KAFKA-15046 at 6/2/23 7:30 AM:
---

Currently, the RW lock in LeaderEpochFileCache is to protect the epoch entries 
in the cache. So you can see that not every `flush` call has a write lock. Ex: 
LeaderEpochFileCache#assign, in the assign, we only lock when updating the 
cache, not lock for flush.

[https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/server/epoch/LeaderEpochFileCache.scala#L59]

 

If we use the RW lock for log flush, I'm afraid the cache update will be 
blocked when disk IO is bad.

 

I think using the same way like [https://github.com/apache/kafka/pull/13782] 
did to let the [fileDescriptor sync 
call|https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L95]
 using background thread should be fine. WDYT?

 

 


was (Author: showuon):
Currently, the RW lock in LeaderEpochFileCache is to protect the epoch entries 
in the cache. So you can see that not every `flush` call has a write lock. Ex: 
LeaderEpochFileCache#assign, in the assign, we only lock when updating the 
cache, not lock for flush.

[https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/server/epoch/LeaderEpochFileCache.scala#L59]

 

I think using the same way like [https://github.com/apache/kafka/pull/13782] 
did to let the [fileDescriptor sync 
call|https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L95]
 using background thread should be fine. WDYT?

 

 

> Produce performance issue under high disk load
> --
>
> Key: KAFKA-15046
> URL: https://issues.apache.org/jira/browse/KAFKA-15046
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: Haruki Okada
>Priority: Major
>  Labels: performance
> Attachments: image-2023-06-01-12-46-30-058.png, 
> image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, 
> image-2023-06-01-12-56-19-108.png
>
>
> * Phenomenon:
>  ** !image-2023-06-01-12-46-30-058.png|width=259,height=236!
>  ** Producer response time 99%ile got quite bad when we performed replica 
> reassignment on the cluster
>  *** RequestQueue scope was significant
>  ** Also request-time throttling happened at the incidental time. This caused 
> producers to delay sending messages in the mean time.
>  ** The disk I/O latency was higher than usual due to the high load for 
> replica reassignment.
>  *** !image-2023-06-01-12-56-19-108.png|width=255,height=128!
>  * Analysis:
>  ** The request-handler utilization was much higher than usual.
>  *** !image-2023-06-01-12-52-40-959.png|width=278,height=113!
>  ** Also, thread time utilization was much higher than usual on almost all 
> users
>  *** !image-2023-06-01-12-54-04-211.png|width=276,height=110!
>  ** From taking jstack several times, for most of them, we found that a 
> request-handler was doing fsync for flusing ProducerState and meanwhile other 
> request-handlers were waiting Log#lock for appending messages.
>  * 
>  ** 
>  *** 
> {code:java}
> "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 
> cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 
> runnable  [0x7ef9a12e2000]
>java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native 
> Method)
> at 
> sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
> at 
> sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
> at 
> kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451)
> at 
> kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754)
> at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.append(UnifiedLog.scala:919)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
> at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
> at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
> at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
> at 
> kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown 
>

[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load

2023-06-02 Thread Haruki Okada (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728551#comment-17728551
 ] 

Haruki Okada edited comment on KAFKA-15046 at 6/2/23 7:19 AM:
--

[~showuon]  Maybe I linked wrong file.

What I thought is to make any LeaderEpochFileCache methods which needs flush() 
to be called outside of Log's global lock.

LeaderEpochFileCache already does exclusive control by its RW lock so I think 
we don't need to call it inside the Log's global lock.

[https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/server/epoch/LeaderEpochFileCache.scala#L44]


was (Author: ocadaruma):
[~showuon]  Maybe I linked wrong file.

What I thought is to make any LeaderEpochFileCache methods (which needs 
flush()) to be called outside of Log's global lock.

LeaderEpochFileCache already does exclusive control by its RW lock so I think 
we don't need to call it inside the Log's global lock.

[https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/server/epoch/LeaderEpochFileCache.scala#L44]

> Produce performance issue under high disk load
> --
>
> Key: KAFKA-15046
> URL: https://issues.apache.org/jira/browse/KAFKA-15046
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: Haruki Okada
>Priority: Major
>  Labels: performance
> Attachments: image-2023-06-01-12-46-30-058.png, 
> image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, 
> image-2023-06-01-12-56-19-108.png
>
>
> * Phenomenon:
>  ** !image-2023-06-01-12-46-30-058.png|width=259,height=236!
>  ** Producer response time 99%ile got quite bad when we performed replica 
> reassignment on the cluster
>  *** RequestQueue scope was significant
>  ** Also request-time throttling happened at the incidental time. This caused 
> producers to delay sending messages in the mean time.
>  ** The disk I/O latency was higher than usual due to the high load for 
> replica reassignment.
>  *** !image-2023-06-01-12-56-19-108.png|width=255,height=128!
>  * Analysis:
>  ** The request-handler utilization was much higher than usual.
>  *** !image-2023-06-01-12-52-40-959.png|width=278,height=113!
>  ** Also, thread time utilization was much higher than usual on almost all 
> users
>  *** !image-2023-06-01-12-54-04-211.png|width=276,height=110!
>  ** From taking jstack several times, for most of them, we found that a 
> request-handler was doing fsync for flusing ProducerState and meanwhile other 
> request-handlers were waiting Log#lock for appending messages.
>  * 
>  ** 
>  *** 
> {code:java}
> "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 
> cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 
> runnable  [0x7ef9a12e2000]
>java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native 
> Method)
> at 
> sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
> at 
> sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
> at 
> kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451)
> at 
> kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754)
> at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.append(UnifiedLog.scala:919)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
> at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
> at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
> at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
> at 
> kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown 
> Source)
> at 
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
> at 
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
> at scala.collection.mutable.HashMap.map(HashMap.scala:35)
> at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944)
> at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:602)
> at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:666)
> at kafka.server.KafkaApis.handle(KafkaApis.scala:175)
> at kafka.server.KafkaRequestHandler.run(KafkaRequestHan

[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load

2023-06-01 Thread Haruki Okada (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728515#comment-17728515
 ] 

Haruki Okada edited comment on KAFKA-15046 at 6/2/23 4:01 AM:
--

Yeah, io_uring is promising.

However it only works with newer kernel (which some on-premises Kafka users may 
not be easy to update) and require rewriting a lot of parts of the code base.

-For leader-epoch cache, the checkpointing is already done in scheduler thread 
so we should adopt solution2 I think-

For leader epoch cache, some paths already doing checkpointing asynchronously 
(e.g. UnifiedLog.deleteOldSegments => UnifiedLog.maybeIncrementLogStartOffset 
=> LeaderEpochFileCache.truncateFromStart on kafka scheduler), so we have to 
make fsync called outside of the lock (i.e. solution-2) anyways I think.

 

Writing to CheckpointFile is already synchronized, so can't we just move 
checkpointing to outside of the lock? 
[https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L76]


was (Author: ocadaruma):
Yeah, io_uring is promising.

However it only works with newer kernel (which some on-premises Kafka users may 
not be easy to update) and require rewriting a lot of parts of the code base.

For leader-epoch cache, the checkpointing is already done in scheduler thread 
so we should adopt solution2 I think.

Writing to CheckpointFile is already synchronized, so can't we just move 
checkpointing to outside of the lock? 
[https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L76]

> Produce performance issue under high disk load
> --
>
> Key: KAFKA-15046
> URL: https://issues.apache.org/jira/browse/KAFKA-15046
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: Haruki Okada
>Priority: Major
>  Labels: performance
> Attachments: image-2023-06-01-12-46-30-058.png, 
> image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, 
> image-2023-06-01-12-56-19-108.png
>
>
> * Phenomenon:
>  ** !image-2023-06-01-12-46-30-058.png|width=259,height=236!
>  ** Producer response time 99%ile got quite bad when we performed replica 
> reassignment on the cluster
>  *** RequestQueue scope was significant
>  ** Also request-time throttling happened at the incidental time. This caused 
> producers to delay sending messages in the mean time.
>  ** The disk I/O latency was higher than usual due to the high load for 
> replica reassignment.
>  *** !image-2023-06-01-12-56-19-108.png|width=255,height=128!
>  * Analysis:
>  ** The request-handler utilization was much higher than usual.
>  *** !image-2023-06-01-12-52-40-959.png|width=278,height=113!
>  ** Also, thread time utilization was much higher than usual on almost all 
> users
>  *** !image-2023-06-01-12-54-04-211.png|width=276,height=110!
>  ** From taking jstack several times, for most of them, we found that a 
> request-handler was doing fsync for flusing ProducerState and meanwhile other 
> request-handlers were waiting Log#lock for appending messages.
>  * 
>  ** 
>  *** 
> {code:java}
> "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 
> cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 
> runnable  [0x7ef9a12e2000]
>java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native 
> Method)
> at 
> sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
> at 
> sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
> at 
> kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451)
> at 
> kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754)
> at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.append(UnifiedLog.scala:919)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
> at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
> at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
> at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
> at 
> kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown 
> Source)
> at 
> scala.collection.StrictOptim

[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load

2023-06-01 Thread Haruki Okada (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728515#comment-17728515
 ] 

Haruki Okada edited comment on KAFKA-15046 at 6/2/23 12:01 AM:
---

Yeah, io_uring is promising.

However it only works with newer kernel (which some on-premises Kafka users may 
not be easy to update) and require rewriting a lot of parts of the code base.

For leader-epoch cache, the checkpointing is already done in scheduler thread 
so we should adopt solution2 I think.

Writing to CheckpointFile is already synchronized, so can't we just move 
checkpointing to outside of the lock? 
[https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L76]


was (Author: ocadaruma):
Yeah, io_uring is promising.

However it only works with newer kernel (which some on-premises Kafka users may 
not be easy to update) and require a lot of parts of the code base.

For leader-epoch cache, the checkpointing is already done in scheduler thread 
so we should adopt solution2 I think.

Writing to CheckpointFile is already synchronized, so can't we just move 
checkpointing to outside of the lock? 
https://github.com/apache/kafka/blob/3.3.2/server-common/src/main/java/org/apache/kafka/server/common/CheckpointFile.java#L76

> Produce performance issue under high disk load
> --
>
> Key: KAFKA-15046
> URL: https://issues.apache.org/jira/browse/KAFKA-15046
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: Haruki Okada
>Priority: Major
>  Labels: performance
> Attachments: image-2023-06-01-12-46-30-058.png, 
> image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, 
> image-2023-06-01-12-56-19-108.png
>
>
> * Phenomenon:
>  ** !image-2023-06-01-12-46-30-058.png|width=259,height=236!
>  ** Producer response time 99%ile got quite bad when we performed replica 
> reassignment on the cluster
>  *** RequestQueue scope was significant
>  ** Also request-time throttling happened at the incidental time. This caused 
> producers to delay sending messages in the mean time.
>  ** The disk I/O latency was higher than usual due to the high load for 
> replica reassignment.
>  *** !image-2023-06-01-12-56-19-108.png|width=255,height=128!
>  * Analysis:
>  ** The request-handler utilization was much higher than usual.
>  *** !image-2023-06-01-12-52-40-959.png|width=278,height=113!
>  ** Also, thread time utilization was much higher than usual on almost all 
> users
>  *** !image-2023-06-01-12-54-04-211.png|width=276,height=110!
>  ** From taking jstack several times, for most of them, we found that a 
> request-handler was doing fsync for flusing ProducerState and meanwhile other 
> request-handlers were waiting Log#lock for appending messages.
>  * 
>  ** 
>  *** 
> {code:java}
> "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 
> cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 
> runnable  [0x7ef9a12e2000]
>java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native 
> Method)
> at 
> sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
> at 
> sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
> at 
> kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451)
> at 
> kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754)
> at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.append(UnifiedLog.scala:919)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
> at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
> at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
> at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
> at 
> kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown 
> Source)
> at 
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
> at 
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
> at scala.collection.mutable.HashMap.map(HashMap.scala:35)
> at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944)
> at kafka.server.Repl

[jira] [Comment Edited] (KAFKA-15046) Produce performance issue under high disk load

2023-06-01 Thread Haruki Okada (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728266#comment-17728266
 ] 

Haruki Okada edited comment on KAFKA-15046 at 6/1/23 8:39 AM:
--

Hm, when I dug into further this, I noticed there's another path that causes 
essentially same phenomenon.

 
{code:java}
"data-plane-kafka-request-handler-17" #169 daemon prio=5 os_prio=0 
cpu=50994542.49ms elapsed=595635.65s tid=0x7efdaebabe30 nid=0x1e707 
runnable  [0x7ef9a0fdf000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native Method)
        at 
sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
        at 
sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
        at org.apache.kafka.common.utils.Utils.flushDir(Utils.java:966)
        at 
org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:951)
        at 
org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:925)
        at 
org.apache.kafka.server.common.CheckpointFile.write(CheckpointFile.java:98)
        - locked <0x000680fc4930> (a java.lang.Object)
        at 
kafka.server.checkpoints.CheckpointFileWithFailureHandler.write(CheckpointFileWithFailureHandler.scala:37)
        at 
kafka.server.checkpoints.LeaderEpochCheckpointFile.write(LeaderEpochCheckpointFile.scala:71)
        at 
kafka.server.epoch.LeaderEpochFileCache.flush(LeaderEpochFileCache.scala:291)
        at 
kafka.server.epoch.LeaderEpochFileCache.$anonfun$truncateFromStart$3(LeaderEpochFileCache.scala:263)
        at 
kafka.server.epoch.LeaderEpochFileCache.$anonfun$truncateFromStart$3$adapted(LeaderEpochFileCache.scala:259)
        at 
kafka.server.epoch.LeaderEpochFileCache$$Lambda$571/0x00080045f040.apply(Unknown
 Source)
        at scala.Option.foreach(Option.scala:437)
        at 
kafka.server.epoch.LeaderEpochFileCache.$anonfun$truncateFromStart$1(LeaderEpochFileCache.scala:259)
        at 
kafka.server.epoch.LeaderEpochFileCache.truncateFromStart(LeaderEpochFileCache.scala:254)
        at 
kafka.log.UnifiedLog.$anonfun$maybeIncrementLogStartOffset$4(UnifiedLog.scala:1043)
        at 
kafka.log.UnifiedLog.$anonfun$maybeIncrementLogStartOffset$4$adapted(UnifiedLog.scala:1043)
        at kafka.log.UnifiedLog$$Lambda$2324/0x000800b59040.apply(Unknown 
Source)
        at scala.Option.foreach(Option.scala:437)
        at 
kafka.log.UnifiedLog.maybeIncrementLogStartOffset(UnifiedLog.scala:1043)
        - locked <0x000680fc5080> (a java.lang.Object)
        at 
kafka.cluster.Partition.$anonfun$deleteRecordsOnLeader$1(Partition.scala:1476)
        at kafka.cluster.Partition.deleteRecordsOnLeader(Partition.scala:1463)
        at 
kafka.server.ReplicaManager.$anonfun$deleteRecordsOnLocalLog$2(ReplicaManager.scala:687)
        at 
kafka.server.ReplicaManager$$Lambda$3156/0x000800d7c840.apply(Unknown 
Source)
        at 
scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
        at 
scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
        at scala.collection.mutable.HashMap.map(HashMap.scala:35)
        at 
kafka.server.ReplicaManager.deleteRecordsOnLocalLog(ReplicaManager.scala:680)
        at kafka.server.ReplicaManager.deleteRecords(ReplicaManager.scala:875)
        at 
kafka.server.KafkaApis.handleDeleteRecordsRequest(KafkaApis.scala:2216)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:196)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
        at java.lang.Thread.run(java.base@11.0.17/Thread.java:829) {code}
 

LeaderEpoch checkpointing also calls fsync with holding Log#lock and blocking 
request-handler threads to append in the meantime.

 

This is called by scheduler thread on log-segment-breaching so might be less 
frequent than log roll though.

Does it make sense to also making LeaderEpochCheckpointFile-flush to be outside 
of the lock?


was (Author: ocadaruma):
Hm, when I dug into further this, I noticed there's another path that causes 
essentially same phenomenon.

 
{code:java}
"data-plane-kafka-request-handler-17" #169 daemon prio=5 os_prio=0 
cpu=50994542.49ms elapsed=595635.65s tid=0x7efdaebabe30 nid=0x1e707 
runnable  [0x7ef9a0fdf000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native Method)
        at 
sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
        at 
sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
        at org.apache.kafka.common.utils.Utils.flushDir(Utils.java:966)
        at 
org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:951)
        at 
org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:925)
        at 
org.apache.kafk