[ https://issues.apache.org/jira/browse/KAFKA-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802074#comment-16802074 ]
Jun Rao commented on KAFKA-7504: -------------------------------- [~cwaldrop], thanks for the results. It does seem to be nice improvement. The 99% fetch response send time can still be 1000ms, which is kind of high. Was that because after the log segment chunk is pre-touched, it gets evicted quickly by the time it's being sent to network? > Broker performance degradation caused by call of sendfile reading disk in > network thread > ---------------------------------------------------------------------------------------- > > Key: KAFKA-7504 > URL: https://issues.apache.org/jira/browse/KAFKA-7504 > Project: Kafka > Issue Type: Improvement > Components: core > Affects Versions: 0.10.2.1 > Reporter: Yuto Kawamura > Assignee: Yuto Kawamura > Priority: Major > Labels: latency, performance > Attachments: Network_Request_Idle_After_Patch.png, > Network_Request_Idle_Per_Before_Patch.png, Response_Times_After_Patch.png, > Response_Times_Before_Patch.png, image-2018-10-14-14-18-38-149.png, > image-2018-10-14-14-18-57-429.png, image-2018-10-14-14-19-17-395.png, > image-2018-10-14-14-19-27-059.png, image-2018-10-14-14-19-41-397.png, > image-2018-10-14-14-19-51-823.png, image-2018-10-14-14-20-09-822.png, > image-2018-10-14-14-20-19-217.png, image-2018-10-14-14-20-33-500.png, > image-2018-10-14-14-20-46-566.png, image-2018-10-14-14-20-57-233.png > > > h2. Environment > OS: CentOS6 > Kernel version: 2.6.32-XX > Kafka version: 0.10.2.1, 0.11.1.2 (but reproduces with latest build from > trunk (2.2.0-SNAPSHOT) > h2. Phenomenon > Response time of Produce request (99th ~ 99.9th %ile) degrading to 50x ~ 100x > more than usual. > Normally 99th %ile is lower than 20ms, but when this issue occurs it marks > 50ms to 200ms. > At the same time we could see two more things in metrics: > 1. Disk read coincidence from the volume assigned to log.dirs. > 2. Raise in network threads utilization (by > `kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent`) > As we didn't see increase of requests in metrics, we suspected blocking in > event loop ran by network thread as the cause of raising network thread > utilization. > Reading through Kafka broker source code, we understand that the only disk > IO performed in network thread is reading log data through calling > sendfile(2) (via FileChannel#transferTo). > To probe that the calls of sendfile(2) are blocking network thread for some > moments, I ran following SystemTap script to inspect duration of sendfile > syscalls. > {code:java} > # Systemtap script to measure syscall duration > global s > global records > probe syscall.$1 { > s[tid()] = gettimeofday_us() > } > probe syscall.$1.return { > elapsed = gettimeofday_us() - s[tid()] > delete s[tid()] > records <<< elapsed > } > probe end { > print(@hist_log(records)) > }{code} > {code:java} > $ stap -v syscall-duration.stp sendfile > # value (us) > value |---------------------------------------- count > 0 | 0 > 1 | 71 > 2 |@@@ 6171 > 16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29472 > 32 |@@@ 3418 > 2048 | 0 > ... > 8192 | 3{code} > As you can see there were some cases taking more than few milliseconds, > implies that it blocks network thread for that long and applying the same > latency for all other request/response processing. > h2. Hypothesis > Gathering the above observations, I made the following hypothesis. > Let's say network-thread-1 multiplexing 3 connections. > - producer-A > - follower-B (broker replica fetch) > - consumer-C > Broker receives requests from each of those clients, [Produce, FetchFollower, > FetchConsumer]. > They are processed well by request handler threads, and now the response > queue of the network-thread contains 3 responses in following order: > [FetchConsumer, Produce, FetchFollower]. > network-thread-1 takes 3 responses and processes them sequentially > ([https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/SocketServer.scala#L632]). > Ideally processing of these 3 responses completes in microseconds as in it > just copies ready responses into client socket's buffer with non-blocking > manner. > However, Kafka uses sendfile(2) for transferring log data to client sockets. > The target data might be in page cache, but old data which has written a bit > far ago and never read since then, are likely not. > If the target data isn't in page cache, kernel first needs to load the > target page into cache. This takes more than few milliseconds to tens of > milliseconds depending on disk hardware and current load being applied to it. > Linux kernel doesn't considers the moment loading data from disk into page > cache as "blocked", hence it awaits completion of target data loading rather > than returning EAGAIN. > Thus, if this occurs when processing the first FetchConsumer request, the > latency from reading disk applies to all following responses as well. Not > just those which has been received by network thread but also those in > response queue of that network thread. > h2. Experiment > For reproducing this issue, I did experiment on our test cluster. > Kafka cluster consist of 3 broker machines with following spec: > CPU: Intel(R) Xeon(R) 2.20GHz x 20 cores (HT) * 2 > Memory: 256GiB > Network: 10Gbps > Disk: HDD x 12 RAID 1+0 > Essential broker configs are below: > {code:java} > num.network.threads=10 > num.io.threads=30 > num.replica.fetchers=3{code} > And created two topics both with replicas=3, > [retention.ms|http://retention.ms/]=2days and min.insync.replicas=2: > - {{large-topic}}: partitions=60 > - {{regular-topic}}: partitions=30 > > Ran producer which produces 1kb data into {{large-topic}} and accumulated to > have around 3GiB data for each partition, 60GiB data in total for each broker > and stop the producer. > To simulate the situation that old data dropped from page cache, run {{echo 1 > > /proc/sys/vm/drop_caches}} in each broker machine. > Start another producer which produces 512 bytes data into topic > {{regular-topic}}. > Throughput is 56k messages/sec. 99.9th %ile response time for Produce is > very good: > !image-2018-10-14-14-18-38-149.png! > !image-2018-10-14-14-18-57-429.png! > Then start consumer which reads from head of {{large-topic}} which is already > not in page cache. > Response time of Produce degrade and throughput decreases: > !image-2018-10-14-14-19-17-395.png! > !image-2018-10-14-14-19-27-059.png! > Disk read is occurring and network threads got busier: > !image-2018-10-14-14-19-41-397.png! !image-2018-10-14-14-19-51-823.png! > h2. Solution > For fixing this issue, we've considered several ideas, but finally decided to > take following approach. > The essential problem is blocking in the middle of event loop. If it does > something which is not purely computation (like blocking IO), the latency > from it directly applies for other irrelevant requests, leaving CPU cores > stale which can do work for other requests while awaiting completion of disk > IO. > To fix this, there should be two possible ways: > A. Make call of sendfile(2) not to wait even on waiting pages loaded from > disks to memory. > B. Make sure that data is always available in memory when the response > object passed to network-thread. > For A. Make call of sendfile(2) not to wait even on waiting pages loaded from > disks to memory, we must have to have support from kernel. > Actually FreeBSD recently added such, nonblocking style sendfile system call > cooperating with community: > [https://www.nginx.com/blog/nginx-and-netflix-contribute-new-sendfile2-to-freebsd/] > > It might be useful for case like this, but this time it wasn't an option > because: > 1. Linux doesn't have such support in syscall > 2. Even with kernel support, we can't use it until JDK supports it. > Otherwise we have to go towards introducing JNI code. > So we sought approach B and tried some ways. > The basic idea is to read the target data once in request handler thread so > the page cache of target data gets populated and stays for a while until it > gets transferred to client socket through sendfile(2) called from network > thread. > The easiest way is to do read(2) on target data. However this is risky from > the overhead point of view as in it would leads overhead of copying data from > kernel to userland (which is what Kafka trying to avoid leveraging > sendfile(2)). Also allocated buffer for read()ing data would cause heap > pressure unless we implement it in optimal way (perhaps using off-heap). > While we tried several other ways (like calling readahead(2) syscall with > checking completion of load calling mincore(2) in loop through JNI), we > finally turned out that calling sendfile for target data with setting > destination to /dev/null does exactly what we want. > When calling it setting the destination fd which is pointing /dev/null, > linux kernel loads data from disk into page cache, but skips (not exactly the > whole part) copying loaded data to the destination device. > To implement this, we just need Java code as in we can expect > FileChannel#transferTo calls sendfile internally. > We patched our kafka broker to do this warmup, and it perfectly fixed this > problem. The patched version is based on 0.10.2.1 but I'm now testing the > same patch on latest build from > trunk(905f8135078127e08633400277c5829b10971d42) and seems that it works > effectively for now. > To avoid additional overhead calling one more syscall (sendfile for > /dev/null) when processing all Fetch requests, I added bit optimization that > is skipping the last log segment of topic partitions assuming it contains > only newly written data which are still hot and has its data in page cache. > By this optimization, since most of Fetch requests are requesting data of > the tail of topic-partition in our case, just few of Fetch requests less than > 1% are the subject of this warmup operation. Hence we haven't seen any > overhead in CPU utilization since we deployed this patch to our production > system. > > To test its effect in the last experiment scenario, I deployed our patch to > experiment cluster and tested again. > When starting catch up read consumer, network thread got busier again, but > utilization was bit lower. And per-client request-time metric showed that > this time producer (normal-client) didn't affected: > !image-2018-10-14-14-20-09-822.png! !image-2018-10-14-14-20-19-217.png! > The same amount of disk read was occurring. > !image-2018-10-14-14-20-33-500.png! > However there was no Produce time degradation and throughput was stable: > !image-2018-10-14-14-20-46-566.png! !image-2018-10-14-14-20-57-233.png! > h2. Suggestion > The performance degradation explained in this issue can happen very easy in > any Kafka deployments. Either one of following clients can cause Fetch for > old data, and almost all clients connecting to the same network thread which > processes that Fetch can be affected. > - Consumers delaying and processing offset far behind the latest offset. > - Consumers which starts periodically and consumes accumulated topic data at > once. > - Follower broker which is attempting to restore its replica from current > leaders. > This patch can fix performance degradation in this scenario. However we still > have one concern. While the whole patch completes in pure Java code, its > semantics heavily assumes underlying implementation in kernel. We haven't > tested this in any OSes others than linux, so not sure if this effects well > in other platforms as well. > Hence I think its better to add configuration which controls behavior of > kafka broker wether to allow to do this or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)