[ 
https://issues.apache.org/jira/browse/KAFKA-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801935#comment-16801935
 ] 

Cabe Waldrop commented on KAFKA-7504:
-------------------------------------

We have been working on recreating this behavior in our lab environment before 
moving to stage and were able to finish our testing there yesterday.  It was a 
bit harder for us to recreate the issue in our lab cluster, but we took the 
following steps:

 
 * We reduced the ZFS ARC cache max to 128mb to ensure that the cache was 
rapidly turning over and forcing disk reads.
 * We created several poorly behaved producers and consumers along with good 
ones. We had 310 producers producing small messages across 10 topics and 100 
consumers running from the beginning of those topics. In addition, we populated 
another topic with messages of approximately 1kb with 10 producers. We had a 
small pool of another 50 consumers grabbing off the end offset of all topics.
 * Our poorly behaved producers produced 1 message at a time and our poorly 
behaved consumers slept for nearly the max poll interval so that they would 
constantly be lagging.
 * We updated the number of io and network threads and decreased our request 
queue with the following configs: num.io.threads=1, num.network.threads=1, 
queued.max.requests=5

 

Since this patch is specifically regarding how request serving is impacted when 
a cluster is strapped for resources, we tried hard to reduce the resources 
available to the cluster in a way that made it easier for us to max them out 
with clients.  This simulates an overburdened cluster without damaging the 
integrity of the experiment.

Our lab machines had the following specs:

5 Brokers

16 CPU Cores per broker

36 GB Memory (Limited to 128mb for the ZFS pool) per broker

 

We saw dramatic improvement in both 99^th^ percentile fetch and produce request 
times following the implementation of this patch.

 

Before:

99^th^ percentile fetch response time: 1,000 – 8,000ms

99^th^ percentile produce response time: 1,000 – 2,200ms

Network Thread Idle Percentage: 5% –7.5%

Request Handler Idle Percentage: 75% – 81%

 

After:

99^th^ percentile fetch response time: 0 – 1000ms

99^th^ percentile produce response time: 0 – 12ms

Network Thread Idle Percentage: 55% - 60%

Request Handler Idle Percentage: 30% - 40%

 

We intend to continue this testing in our staging environment and if it 
performs well there we will move this to production where we are seeing the 
most severe symptoms.  We will continue to post our findings here. 

!Network_Request_Idle_After_Patch.png!!Response_Times_After_Patch.png!

 

> Broker performance degradation caused by call of sendfile reading disk in 
> network thread
> ----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7504
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7504
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 0.10.2.1
>            Reporter: Yuto Kawamura
>            Assignee: Yuto Kawamura
>            Priority: Major
>              Labels: latency, performance
>         Attachments: Network_Request_Idle_After_Patch.png, 
> Network_Request_Idle_Per_Before_Patch.png, Response_Times_After_Patch.png, 
> Response_Times_Before_Patch.png, image-2018-10-14-14-18-38-149.png, 
> image-2018-10-14-14-18-57-429.png, image-2018-10-14-14-19-17-395.png, 
> image-2018-10-14-14-19-27-059.png, image-2018-10-14-14-19-41-397.png, 
> image-2018-10-14-14-19-51-823.png, image-2018-10-14-14-20-09-822.png, 
> image-2018-10-14-14-20-19-217.png, image-2018-10-14-14-20-33-500.png, 
> image-2018-10-14-14-20-46-566.png, image-2018-10-14-14-20-57-233.png
>
>
> h2. Environment
> OS: CentOS6
> Kernel version: 2.6.32-XX
>  Kafka version: 0.10.2.1, 0.11.1.2 (but reproduces with latest build from 
> trunk (2.2.0-SNAPSHOT)
> h2. Phenomenon
> Response time of Produce request (99th ~ 99.9th %ile) degrading to 50x ~ 100x 
> more than usual.
>  Normally 99th %ile is lower than 20ms, but when this issue occurs it marks 
> 50ms to 200ms.
> At the same time we could see two more things in metrics:
> 1. Disk read coincidence from the volume assigned to log.dirs.
>  2. Raise in network threads utilization (by 
> `kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent`)
> As we didn't see increase of requests in metrics, we suspected blocking in 
> event loop ran by network thread as the cause of raising network thread 
> utilization.
>  Reading through Kafka broker source code, we understand that the only disk 
> IO performed in network thread is reading log data through calling 
> sendfile(2) (via FileChannel#transferTo).
>  To probe that the calls of sendfile(2) are blocking network thread for some 
> moments, I ran following SystemTap script to inspect duration of sendfile 
> syscalls.
> {code:java}
> # Systemtap script to measure syscall duration
> global s
> global records
> probe syscall.$1 {
>     s[tid()] = gettimeofday_us()
> }
> probe syscall.$1.return {
>     elapsed = gettimeofday_us() - s[tid()]
>     delete s[tid()]
>     records <<< elapsed
> }
> probe end {
>     print(@hist_log(records))
> }{code}
> {code:java}
> $ stap -v syscall-duration.stp sendfile
> # value (us)
> value |---------------------------------------- count
>     0 |                                             0
>     1 |                                            71
>     2 |@@@                                       6171
>    16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  29472
>    32 |@@@                                       3418
>  2048 |                                             0
> ...
>  8192 |                                             3{code}
> As you can see there were some cases taking more than few milliseconds, 
> implies that it blocks network thread for that long and applying the same 
> latency for all other request/response processing.
> h2. Hypothesis
> Gathering the above observations, I made the following hypothesis.
> Let's say network-thread-1 multiplexing 3 connections.
>  - producer-A
>  - follower-B (broker replica fetch)
>  - consumer-C
> Broker receives requests from each of those clients, [Produce, FetchFollower, 
> FetchConsumer].
> They are processed well by request handler threads, and now the response 
> queue of the network-thread contains 3 responses in following order: 
> [FetchConsumer, Produce, FetchFollower].
> network-thread-1 takes 3 responses and processes them sequentially 
> ([https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/SocketServer.scala#L632]).
>  Ideally processing of these 3 responses completes in microseconds as in it 
> just copies ready responses into client socket's buffer with non-blocking 
> manner.
>  However, Kafka uses sendfile(2) for transferring log data to client sockets. 
> The target data might be in page cache, but old data which has written a bit 
> far ago and never read since then, are likely not.
>  If the target data isn't in page cache, kernel first needs to load the 
> target page into cache. This takes more than few milliseconds to tens of 
> milliseconds depending on disk hardware and current load being applied to it.
>  Linux kernel doesn't considers the moment loading data from disk into page 
> cache as "blocked", hence it awaits completion of target data loading rather 
> than returning EAGAIN.
>  Thus, if this occurs when processing the first FetchConsumer request, the 
> latency from reading disk applies to all following responses as well. Not 
> just those which has been received by network thread but also those in 
> response queue of that network thread.
> h2. Experiment
> For reproducing this issue, I did experiment on our test cluster.
> Kafka cluster consist of 3 broker machines with following spec:
>  CPU: Intel(R) Xeon(R) 2.20GHz x 20 cores (HT) * 2
>  Memory: 256GiB
>  Network: 10Gbps
>  Disk: HDD x 12 RAID 1+0
> Essential broker configs are below:
> {code:java}
> num.network.threads=10
> num.io.threads=30
> num.replica.fetchers=3{code}
> And created two topics both with replicas=3, 
> [retention.ms|http://retention.ms/]=2days and min.insync.replicas=2:
>  - {{large-topic}}: partitions=60
>  - {{regular-topic}}: partitions=30
>  
> Ran producer which produces 1kb data into {{large-topic}} and accumulated to 
> have around 3GiB data for each partition, 60GiB data in total for each broker 
> and stop the producer.
> To simulate the situation that old data dropped from page cache, run {{echo 1 
> > /proc/sys/vm/drop_caches}} in each broker machine.
> Start another producer which produces 512 bytes data into topic 
> {{regular-topic}}.
>  Throughput is 56k messages/sec. 99.9th %ile response time for Produce is 
> very good:
> !image-2018-10-14-14-18-38-149.png!  
> !image-2018-10-14-14-18-57-429.png!
> Then start consumer which reads from head of {{large-topic}} which is already 
> not in page cache.
> Response time of Produce degrade and throughput decreases:
> !image-2018-10-14-14-19-17-395.png!
> !image-2018-10-14-14-19-27-059.png!
> Disk read is occurring and network threads got busier:
> !image-2018-10-14-14-19-41-397.png!   !image-2018-10-14-14-19-51-823.png!
> h2. Solution
> For fixing this issue, we've considered several ideas, but finally decided to 
> take following approach.
> The essential problem is blocking in the middle of event loop. If it does 
> something which is not purely computation (like blocking IO), the latency 
> from it directly applies for other irrelevant requests, leaving CPU cores 
> stale which can do work for other requests while awaiting completion of disk 
> IO.
>  To fix this, there should be two possible ways:
>  A. Make call of sendfile(2) not to wait even on waiting pages loaded from 
> disks to memory.
>  B. Make sure that data is always available in memory when the response 
> object passed to network-thread.
> For A. Make call of sendfile(2) not to wait even on waiting pages loaded from 
> disks to memory, we must have to have support from kernel.
>  Actually FreeBSD recently added such, nonblocking style sendfile system call 
> cooperating with community: 
> [https://www.nginx.com/blog/nginx-and-netflix-contribute-new-sendfile2-to-freebsd/]
>  
>  It might be useful for case like this, but this time it wasn't an option 
> because:
>  1. Linux doesn't have such support in syscall
>  2. Even with kernel support, we can't use it until JDK supports it. 
> Otherwise we have to go towards introducing JNI code.
> So we sought approach B and tried some ways.
>  The basic idea is to read the target data once in request handler thread so 
> the page cache of target data gets populated and stays for a while until it 
> gets transferred to client socket through sendfile(2) called from network 
> thread.
> The easiest way is to do read(2) on target data. However this is risky from 
> the overhead point of view as in it would leads overhead of copying data from 
> kernel to userland (which is what Kafka trying to avoid leveraging 
> sendfile(2)). Also allocated buffer for read()ing data would cause heap 
> pressure unless we implement it in optimal way (perhaps using off-heap).
> While we tried several other ways (like calling readahead(2) syscall with 
> checking completion of load calling mincore(2) in loop through JNI), we 
> finally turned out that calling sendfile for target data with setting 
> destination to /dev/null does exactly what we want.
>  When calling it setting the destination fd which is pointing /dev/null, 
> linux kernel loads data from disk into page cache, but skips (not exactly the 
> whole part) copying loaded data to the destination device.
>  To implement this, we just need Java code as in we can expect 
> FileChannel#transferTo calls sendfile internally.
> We patched our kafka broker to do this warmup, and it perfectly fixed this 
> problem. The patched version is based on 0.10.2.1 but I'm now testing the 
> same patch on latest build from 
> trunk(905f8135078127e08633400277c5829b10971d42) and seems that it works 
> effectively for now.
> To avoid additional overhead calling one more syscall (sendfile for 
> /dev/null) when processing all Fetch requests, I added bit optimization that 
> is skipping the last log segment of topic partitions assuming it contains 
> only newly written data which are still hot and has its data in page cache.
>  By this optimization, since most of Fetch requests are requesting data of 
> the tail of topic-partition in our case, just few of Fetch requests less than 
> 1% are the subject of this warmup operation. Hence we haven't seen any 
> overhead in CPU utilization since we deployed this patch to our production 
> system.
>  
> To test its effect in the last experiment scenario, I deployed our patch to 
> experiment cluster and tested again.
> When starting catch up read consumer, network thread got busier again, but 
> utilization was bit lower. And per-client request-time metric showed that 
> this time producer (normal-client) didn't affected:
> !image-2018-10-14-14-20-09-822.png!   !image-2018-10-14-14-20-19-217.png!
> The same amount of disk read was occurring. 
> !image-2018-10-14-14-20-33-500.png!
> However there was no Produce time degradation and throughput was stable:  
> !image-2018-10-14-14-20-46-566.png!   !image-2018-10-14-14-20-57-233.png!
> h2. Suggestion
> The performance degradation explained in this issue can happen very easy in 
> any Kafka deployments. Either one of following clients can cause Fetch for 
> old data, and almost all clients connecting to the same network thread which 
> processes that Fetch can be affected. 
>  - Consumers delaying and processing offset far behind the latest offset.
>  - Consumers which starts periodically and consumes accumulated topic data at 
> once.
>  - Follower broker which is attempting to restore its replica from current 
> leaders.
> This patch can fix performance degradation in this scenario. However we still 
> have one concern. While the whole patch completes in pure Java code, its 
> semantics heavily assumes underlying implementation in kernel. We haven't 
> tested this in any OSes others than linux, so not sure if this effects well 
> in other platforms as well.
>  Hence I think its better to add configuration which controls behavior of 
> kafka broker wether to allow to do this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to