[ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755898#comment-17755898
 ] 

Haruki Okada commented on KAFKA-15046:
--------------------------------------

After dug into the fsync call paths in detail, I summarized the problem and the 
solutions like below:
h2. Problem
 * While any blocking operation under holding the UnifiedLog.lock could lead to 
serious performance (even availability) issues, currently there are several 
paths that calls fsync(2) inside the lock
 ** In the meantime the lock is held, all subsequent produces against the 
partition may block
 ** This easily causes all request-handlers to be busy on bad disk performance
 ** Even worse, when a disk experiences tens of seconds of glitch (it's not 
rare in spinning drives), it makes the broker to unable to process any requests 
with unfenced from the cluster (i.e. "zombie" like status)

h2. Analysis of fsync(2) inside UnifidedLog.lock

First, fsyncs on start-up/shutdown timing isn't a problem since the broker 
isn't processing requests.

Given that, there are essentially 4 problematic call paths listed below:
h3. 1. [ProducerStateManager.takeSnapshot at 
UnifiedLog.roll|https://github.com/apache/kafka/blob/3f4816dd3eafaf1a0636d3ee689069f897c99e28/core/src/main/scala/kafka/log/UnifiedLog.scala#L2133]
 * Here, solutions is just moving fsync(2) call to the scheduler thread as part 
of existing "flush-log" job (before incrementing recovery point)

h3. 2. [ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log 
segment 
deletion|https://github.com/apache/kafka/blob/3f4816dd3eafaf1a0636d3ee689069f897c99e28/core/src/main/scala/kafka/log/UnifiedLog.scala#L2133]
 * removeAndMarkSnapshotForDeletion calls Utils.atomicMoveWithFallback with 
parent-dir flushing when renaming to add .deleted suffix
 * Here, we don't need to flush parent-dir I suppose.
 * Worst case scenario, few producer snapshots which should've been deleted are 
remained with lucking .deleted after unclean shutdown
 ** In this case, these files will be eventually deleted so shouldn't be a big 
problem.

h3. 3. [LeaderEpochFileCache.truncateFromStart when incrementing 
log-start-offset|https://github.com/apache/kafka/blob/3f4816dd3eafaf1a0636d3ee689069f897c99e28/core/src/main/scala/kafka/log/UnifiedLog.scala#L986]
 * This path is called from deleteRecords on request-handler threads.
 * Here, we don't need fsync(2) either actually.
 * Upon unclean shutdown, few leader epochs might be remained in the file but 
it will be [handled by 
LogLoader|https://github.com/apache/kafka/blob/3f4816dd3eafaf1a0636d3ee689069f897c99e28/core/src/main/scala/kafka/log/LogLoader.scala#L185]
 on start-up so not a problem

h3. 4. [LeaderEpochFileCache.truncateFromEnd as part of log 
truncation|https://github.com/apache/kafka/blob/3f4816dd3eafaf1a0636d3ee689069f897c99e28/core/src/main/scala/kafka/log/UnifiedLog.scala#L1663]
 
 * Though this path is called mainly on replica fetcher threads, blocking 
replica fetchers isn't ideal either, since it could cause remote-scope produce 
performance degradation on leader side
 * Likewise, we don't need fsync(2) here since any epochs which untruncated 
will be handled on log loading procedure

> Produce performance issue under high disk load
> ----------------------------------------------
>
>                 Key: KAFKA-15046
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15046
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 3.3.2
>            Reporter: Haruki Okada
>            Assignee: Haruki Okada
>            Priority: Major
>              Labels: performance
>         Attachments: image-2023-06-01-12-46-30-058.png, 
> image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, 
> image-2023-06-01-12-56-19-108.png
>
>
> * Phenomenon:
>  ** !image-2023-06-01-12-46-30-058.png|width=259,height=236!
>  ** Producer response time 99%ile got quite bad when we performed replica 
> reassignment on the cluster
>  *** RequestQueue scope was significant
>  ** Also request-time throttling happened at the incidental time. This caused 
> producers to delay sending messages in the mean time.
>  ** The disk I/O latency was higher than usual due to the high load for 
> replica reassignment.
>  *** !image-2023-06-01-12-56-19-108.png|width=255,height=128!
>  * Analysis:
>  ** The request-handler utilization was much higher than usual.
>  *** !image-2023-06-01-12-52-40-959.png|width=278,height=113!
>  ** Also, thread time utilization was much higher than usual on almost all 
> users
>  *** !image-2023-06-01-12-54-04-211.png|width=276,height=110!
>  ** From taking jstack several times, for most of them, we found that a 
> request-handler was doing fsync for flusing ProducerState and meanwhile other 
> request-handlers were waiting Log#lock for appending messages.
>  * 
>  ** 
>  *** 
> {code:java}
> "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 
> cpu=51264789.27ms elapsed=599242.76s tid=0x00007efdaeba7770 nid=0x1e704 
> runnable  [0x00007ef9a12e2000]
>    java.lang.Thread.State: RUNNABLE
>         at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native 
> Method)
>         at 
> sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
>         at 
> sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
>         at 
> kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451)
>         at 
> kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754)
>         at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544)
>         - locked <0x000000060d75d820> (a java.lang.Object)
>         at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523)
>         - locked <0x000000060d75d820> (a java.lang.Object)
>         at kafka.log.UnifiedLog.append(UnifiedLog.scala:919)
>         - locked <0x000000060d75d820> (a java.lang.Object)
>         at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
>         at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
>         at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
>         at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
>         at 
> kafka.server.ReplicaManager$$Lambda$2379/0x0000000800b7c040.apply(Unknown 
> Source)
>         at 
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
>         at 
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
>         at scala.collection.mutable.HashMap.map(HashMap.scala:35)
>         at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944)
>         at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:602)
>         at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:666)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:175)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
>         at java.lang.Thread.run(java.base@11.0.17/Thread.java:829) {code}
>  * 
>  ** Also there were bunch of logs that writing producer snapshots took 
> hundreds of milliseconds.
>  *** 
> {code:java}
> ...
> [2023-05-01 11:08:36,689] INFO [ProducerStateManager partition=xxx-4] Wrote 
> producer snapshot at offset 1748817854 with 8 producer ids in 809 ms. 
> (kafka.log.ProducerStateManager)
> [2023-05-01 11:08:37,319] INFO [ProducerStateManager partition=yyy-34] Wrote 
> producer snapshot at offset 247996937813 with 0 producer ids in 547 ms. 
> (kafka.log.ProducerStateManager)
> [2023-05-01 11:08:38,887] INFO [ProducerStateManager partition=zzz-9] Wrote 
> producer snapshot at offset 226222355404 with 0 producer ids in 576 ms. 
> (kafka.log.ProducerStateManager)
> ... {code}
>  * From the analysis, we summarized the issue as below:
>  * 
>  ** 1. Disk write latency got worse due to the replica reassignment
>  *** We already use replication quota, and lowering the quota further may not 
> be acceptable for too long assignment duration
>  ** 2. ProducerStateManager#takeSnapshot started to take time due to fsync 
> latency
>  *** This is done at every log segment roll.
>  *** In our case, the broker hosts high load partitions so log roll is 
> occurring very frequently.
>  ** 3. During ProducerStateManager#takeSnapshot is doing fsync, all 
> subsequent produce requests to the partition is blocked due to Log#lock
>  ** 4. During produce requests waiting the lock, they consume request handler 
> threads time so it's accounted as thread-time utilization and caused 
> throttling
>  * Suggestion:
>  ** We didn't see this phenomenon when we used Kafka 2.4.1.
>  *** ProducerState fsync was introduced in 2.8.0 by this: 
> https://issues.apache.org/jira/browse/KAFKA-9892
>  ** The reason why ProducerState needs to be fsync is not well described in 
> above ticket though, we think fsync is not really necessary here. Because:
>  * 
>  ** 
>  *** If ProducerState snapshot file was not written to the disk after power 
> failure, it will be just rebuilt from logs.
>  *** Also, even if ProducerState snapshot was corrupted after power failure, 
> it will be rebuilt from logs thanks to CRC



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to