[jira] [Commented] (KAFKA-9648) kafka server should resize backlog when create serversocket
[ https://issues.apache.org/jira/browse/KAFKA-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385313#comment-17385313 ] Haruki Okada commented on KAFKA-9648: - [~flashmouse] Hi, I proposed a new KIP regarding this. https://cwiki.apache.org/confluence/display/KAFKA/KIP-764%3A+Configurable+backlog+size+for+creating+Acceptor Please take a look if you're interested ! > kafka server should resize backlog when create serversocket > --- > > Key: KAFKA-9648 > URL: https://issues.apache.org/jira/browse/KAFKA-9648 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 0.10.0.1 >Reporter: li xiangyuan >Priority: Minor > > I have describe a mystery problem > (https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka > server will trigger tcp Congestion Control in some condition. finally we > found the root cause. > when kafka server restart for any reason and then execute preferred replica > leader, lots of replica leader will give back to it & trigger cluster > metadata update. then all clients will establish connection to this server. > at the monment many tcp estable request are waiting in the tcp sync queue , > and then to accept queue. > kafka create serversocket in SocketServer.scala > > {code:java} > serverChannel.socket.bind(socketAddress);{code} > this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) > will decide the queue length.beacues kafka haven't set ,it is default value > 50. > if this queue is full, and tcp_syncookies = 0, then new connection request > will be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie > mechanism. this mechanism could allow linux handle more tcp sync request, but > it would lose many tcp external parameter, include "wscale", the one that > allow tcp connection to send much more bytes per tcp package. because > syncookie triggerd, wscale has lost, and this tcp connection will handle > network very slow, forever,until this connection is closed and establish > another tcp connection. > so after a preferred repilca executed, lots of new tcp connection will > establish without set wscale,and many network traffic to this server will > have a very slow speed. > i'm not sure whether new linux version have resolved this problem, but kafka > also should set backlog a larger value. we now have modify this to 512, seems > everything is ok. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9648) kafka server should resize backlog when create serversocket
[ https://issues.apache.org/jira/browse/KAFKA-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379515#comment-17379515 ] Haruki Okada commented on KAFKA-9648: - Hi. We operate a Kafka cluster in our company which has 130+ brokers, 1.3M+ total socket server connections and 25K+ partitions. We faced very similar issue to https://issues.apache.org/jira/browse/KAFKA-9211 (i.e. producer slowing down mysteriously with small TCP packets) recently, and we figured out the cause. We hope our knowledge could help. h2. Environment: * Kafka broker version: 2.4.1 (but we suppose the version doesn't matter) * Kafka broker OS: CentOS7 (kernel 3.10.X) h2. Phenomenon: * Restart a broker process, and execute preferred leader election after the broker became in-sync * Some producers's node-request-latency to the broker got insanely higher than usual ** However no such high produce response latency is observed on broker-side metrics * As the result, producer batches couldn't be sent out in sufficient pace, then caused batch expiration h2. Analysis: * We observed TCP SYN cookies metric was increased at incidental time, with following dmesg message: ** {code:java} TCP: request_sock_TCP: Possible SYN flooding on port 22991. Sending cookies. Check SNMP counters.{code} * So we also suspected the phenomenon is due to `wscale` drop (as like described in this issue), but we doubt it because: ** Even with TCP SYN cookies, `wscale` should be available as long as TCP timestamp is enabled. (refs: [https://blog.cloudflare.com/syn-packet-handling-in-the-wild/]) ** In our environment, TCP timestamp is enabled. ** Also, generally, `wscale` (window scaling factor) is used for extending window beyond 65535 (max window size in TCP spec) on large-network round trip environment such as internet *** Our typical produce request size is smaller than that *** So it's hard to imagine that `wscale` drop causes such significant request-latency degradation * After several attempts to reproduce, we found out that receiver (i.e. broker in this context)'s `wscale` is inconsistent between producer and broker at incidental time. ** receiver's `wscale` advertised from broker -> producer along with SYN+ACK packet: 7 *** {code:java} 17:32:05.161213 IP BROKER.HOST. > CLIENT.: Flags [S.], seq 29292755, ack 17469019, win 28960, options [mss 1460,sackOK,TS val 25589601 ecr 9055245,nop,wscale 1], length 0{code} *** (seq numbers are substituted with random value) ** actual receiver's `wscale` after established: 1 *** {code:java} [user@BROKER ~]$ /sbin/ss -e -i -t | less ... ESTAB 0 0 BROKER. CLIENT. timer:(keepalive,21min,0) uid:503 ino:15143860 sk:9ba25dc4f440 <-> ts sack cubic wscale:7,7 rto:201 rtt:0.179/0.006 ato:140 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:97054 bytes_received:18950537 segs_out:15094 segs_in:13941 send 647.2Mbps lastsnd:217 lastrcv:17 lastack:217 pacing_rate 1288.0Mbps rcv_rtt:1.875 rcv_space:29200{code} *** `wscale:7,7` means that broker's receiver window scaling factor is 7 * Okay, then this inconsistency could explain the phenomenon as below: ** Premise: When `wscale` is enabled, TCP window size is calculated by `window_size * 2^wscale` ** When broker calculates advertised window size, it's bit-shifted to the right by `wscale` (== 7) *** [https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_output.c#L290] ** On the other hand, producer multiplies advertised window size by 1, which is conveyed through SYN+ACK ** As the result, window size became 64 times smaller than expected ** Then producer splits TCP packets to much smaller size than usual (possibly under MSS) *** TCP acks are delayed due to the conditions are not met ([https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_input.c#L4760]) * Last remaining question is "why such wscale inconsistency happened?" ** Read through the kernel source code, then we found that there's an issue in the logic to calculate wscale if TCP connection is established through syncookies *** It's fixed in this commit: [https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417] h2. Solution: * 1. Upgrade linux kernel to at least v5.10 (which the patch is committed) * 2. Disable SYN cookies ** i.e. setting `net.ipv4.tcp_syncookies` kernel parameter to 0 ** With SYN cookies disabled, some SYN's are dropped on same situations but typically it's not be a serious problem thanks to SYN retries. Clients should reconnect soon. * 3. Adjust backlog size (as this ticket suggests) ** Even disabling SYN cookies work thanks to SYN retries, it will cause certain delay in TCP establishment if reconnect happened. ** So generally it's preferable to adjust backlog size as well, to avoid SYN drops in the first place. I think we should add new config
[jira] [Commented] (KAFKA-9648) kafka server should resize backlog when create serversocket
[ https://issues.apache.org/jira/browse/KAFKA-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053195#comment-17053195 ] li xiangyuan commented on KAFKA-9648: - [~junrao]: Yes 1412 is quite small, I have noticed that but couldn't understand why. In tcpdump file it shows after connection established, server's windows size suddenly drop down from x to 394, and producer's window size keep 31088. I think it's beyond my ability & knowledge to explain the reason ... > kafka server should resize backlog when create serversocket > --- > > Key: KAFKA-9648 > URL: https://issues.apache.org/jira/browse/KAFKA-9648 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 0.10.0.1 >Reporter: li xiangyuan >Priority: Minor > > I have describe a mystery problem > (https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka > server will trigger tcp Congestion Control in some condition. finally we > found the root cause. > when kafka server restart for any reason and then execute preferred replica > leader, lots of replica leader will give back to it & trigger cluster > metadata update. then all clients will establish connection to this server. > at the monment many tcp estable request are waiting in the tcp sync queue , > and then to accept queue. > kafka create serversocket in SocketServer.scala > > {code:java} > serverChannel.socket.bind(socketAddress);{code} > this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) > will decide the queue length.beacues kafka haven't set ,it is default value > 50. > if this queue is full, and tcp_syncookies = 0, then new connection request > will be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie > mechanism. this mechanism could allow linux handle more tcp sync request, but > it would lose many tcp external parameter, include "wscale", the one that > allow tcp connection to send much more bytes per tcp package. because > syncookie triggerd, wscale has lost, and this tcp connection will handle > network very slow, forever,until this connection is closed and establish > another tcp connection. > so after a preferred repilca executed, lots of new tcp connection will > establish without set wscale,and many network traffic to this server will > have a very slow speed. > i'm not sure whether new linux version have resolved this problem, but kafka > also should set backlog a larger value. we now have modify this to 512, seems > everything is ok. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9648) kafka server should resize backlog when create serversocket
[ https://issues.apache.org/jira/browse/KAFKA-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052419#comment-17052419 ] Jun Rao commented on KAFKA-9648: [~flashmouse] : Thanks for the reply. Just to clarify. TCP window scaling allows receiving window size more than 64KB. But it seems that the 1412 bytes used by the producer is much smaller than 64KB. So, even with window scaling turned off, it seems that the producer should be able to send more bytes before being ack-ed? > kafka server should resize backlog when create serversocket > --- > > Key: KAFKA-9648 > URL: https://issues.apache.org/jira/browse/KAFKA-9648 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 0.10.0.1 >Reporter: li xiangyuan >Priority: Minor > > I have describe a mystery problem > (https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka > server will trigger tcp Congestion Control in some condition. finally we > found the root cause. > when kafka server restart for any reason and then execute preferred replica > leader, lots of replica leader will give back to it & trigger cluster > metadata update. then all clients will establish connection to this server. > at the monment many tcp estable request are waiting in the tcp sync queue , > and then to accept queue. > kafka create serversocket in SocketServer.scala > > {code:java} > serverChannel.socket.bind(socketAddress);{code} > this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) > will decide the queue length.beacues kafka haven't set ,it is default value > 50. > if this queue is full, and tcp_syncookies = 0, then new connection request > will be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie > mechanism. this mechanism could allow linux handle more tcp sync request, but > it would lose many tcp external parameter, include "wscale", the one that > allow tcp connection to send much more bytes per tcp package. because > syncookie triggerd, wscale has lost, and this tcp connection will handle > network very slow, forever,until this connection is closed and establish > another tcp connection. > so after a preferred repilca executed, lots of new tcp connection will > establish without set wscale,and many network traffic to this server will > have a very slow speed. > i'm not sure whether new linux version have resolved this problem, but kafka > also should set backlog a larger value. we now have modify this to 512, seems > everything is ok. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9648) kafka server should resize backlog when create serversocket
[ https://issues.apache.org/jira/browse/KAFKA-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051780#comment-17051780 ] li xiangyuan commented on KAFKA-9648: - [~junrao]: all our kafka server have the same tcp configuration below, left are default value: {code:java} socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 {code} I have upload tcpdump file (https://issues.apache.org/jira/browse/KAFKA-9211) when this problem occur. it mentioned that every producer tcp packet size are small than 1412 bytes and producer won't send next tcp package until get server's tcp ack that always delay 40ms(sometimes 200ms). it means a kafka produce request larger than 1412 bytes will split to 2 tcp packages and need at least 40ms to complete send all applcation request to kafka server, the request larger , it cost multi 40ms to complete. [~ijuma] glad do it! just change {code:java} serverChannel.socket.bind(socketAddress); {code} to {code:java} serverChannel.socket.bind(socketAddress, 512); {code} or anything else? > kafka server should resize backlog when create serversocket > --- > > Key: KAFKA-9648 > URL: https://issues.apache.org/jira/browse/KAFKA-9648 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 0.10.0.1 >Reporter: li xiangyuan >Priority: Minor > > I have describe a mystery problem > (https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka > server will trigger tcp Congestion Control in some condition. finally we > found the root cause. > when kafka server restart for any reason and then execute preferred replica > leader, lots of replica leader will give back to it & trigger cluster > metadata update. then all clients will establish connection to this server. > at the monment many tcp estable request are waiting in the tcp sync queue , > and then to accept queue. > kafka create serversocket in SocketServer.scala > > {code:java} > serverChannel.socket.bind(socketAddress);{code} > this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) > will decide the queue length.beacues kafka haven't set ,it is default value > 50. > if this queue is full, and tcp_syncookies = 0, then new connection request > will be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie > mechanism. this mechanism could allow linux handle more tcp sync request, but > it would lose many tcp external parameter, include "wscale", the one that > allow tcp connection to send much more bytes per tcp package. because > syncookie triggerd, wscale has lost, and this tcp connection will handle > network very slow, forever,until this connection is closed and establish > another tcp connection. > so after a preferred repilca executed, lots of new tcp connection will > establish without set wscale,and many network traffic to this server will > have a very slow speed. > i'm not sure whether new linux version have resolved this problem, but kafka > also should set backlog a larger value. we now have modify this to 512, seems > everything is ok. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9648) kafka server should resize backlog when create serversocket
[ https://issues.apache.org/jira/browse/KAFKA-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051699#comment-17051699 ] Jun Rao commented on KAFKA-9648: [~flashmouse]: Thanks for the analysis. Great finding! The impact seems on the producer. Did you send a large tcp socket buffer for the producer? Was the network latency between the producer and the broker high? > kafka server should resize backlog when create serversocket > --- > > Key: KAFKA-9648 > URL: https://issues.apache.org/jira/browse/KAFKA-9648 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 0.10.0.1 >Reporter: li xiangyuan >Priority: Minor > > I have describe a mystery problem > (https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka > server will trigger tcp Congestion Control in some condition. finally we > found the root cause. > when kafka server restart for any reason and then execute preferred replica > leader, lots of replica leader will give back to it & trigger cluster > metadata update. then all clients will establish connection to this server. > at the monment many tcp estable request are waiting in the tcp sync queue , > and then to accept queue. > kafka create serversocket in SocketServer.scala > > {code:java} > serverChannel.socket.bind(socketAddress);{code} > this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) > will decide the queue length.beacues kafka haven't set ,it is default value > 50. > if this queue is full, and tcp_syncookies = 0, then new connection request > will be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie > mechanism. this mechanism could allow linux handle more tcp sync request, but > it would lose many tcp external parameter, include "wscale", the one that > allow tcp connection to send much more bytes per tcp package. because > syncookie triggerd, wscale has lost, and this tcp connection will handle > network very slow, forever,until this connection is closed and establish > another tcp connection. > so after a preferred repilca executed, lots of new tcp connection will > establish without set wscale,and many network traffic to this server will > have a very slow speed. > i'm not sure whether new linux version have resolved this problem, but kafka > also should set backlog a larger value. we now have modify this to 512, seems > everything is ok. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9648) kafka server should resize backlog when create serversocket
[ https://issues.apache.org/jira/browse/KAFKA-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051211#comment-17051211 ] Ismael Juma commented on KAFKA-9648: Would you be interested in submitting a pull request? > kafka server should resize backlog when create serversocket > --- > > Key: KAFKA-9648 > URL: https://issues.apache.org/jira/browse/KAFKA-9648 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 0.10.0.1 >Reporter: li xiangyuan >Priority: Minor > > I have describe a mystery problem > (https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka > server will trigger tcp Congestion Control in some condition. finally we > found the root cause. > when kafka server restart for any reason and then execute preferred replica > leader, lots of replica leader will give back to it & trigger cluster > metadata update. then all clients will establish connection to this server. > at the monment many tcp estable request are waiting in the tcp sync queue , > and then to accept queue. > kafka create serversocket in SocketServer.scala > > {code:java} > serverChannel.socket.bind(socketAddress);{code} > this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) > will decide the queue length.beacues kafka haven't set ,it is default value > 50. > if this queue is full, and tcp_syncookies = 0, then new connection request > will be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie > mechanism. this mechanism could allow linux handle more tcp sync request, but > it would lose many tcp external parameter, include "wscale", the one that > allow tcp connection to send much more bytes per tcp package. because > syncookie triggerd, wscale has lost, and this tcp connection will handle > network very slow, forever,until this connection is closed and establish > another tcp connection. > so after a preferred repilca executed, lots of new tcp connection will > establish without set wscale,and many network traffic to this server will > have a very slow speed. > i'm not sure whether new linux version have resolved this problem, but kafka > also should set backlog a larger value. we now have modify this to 512, seems > everything is ok. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9648) kafka server should resize backlog when create serversocket
[ https://issues.apache.org/jira/browse/KAFKA-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051209#comment-17051209 ] Ismael Juma commented on KAFKA-9648: Thanks for the ticket, interesting discovery. cc [~junrao] [~rajinisiva...@gmail.com] > kafka server should resize backlog when create serversocket > --- > > Key: KAFKA-9648 > URL: https://issues.apache.org/jira/browse/KAFKA-9648 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 0.10.0.1 >Reporter: li xiangyuan >Priority: Minor > > I have describe a mystery problem > (https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka > server will trigger tcp Congestion Control in some condition. finally we > found the root cause. > when kafka server restart for any reason and then execute preferred replica > leader, lots of replica leader will give back to it & trigger cluster > metadata update. then all clients will establish connection to this server. > at the monment many tcp estable request are waiting in the tcp sync queue , > and then to accept queue. > kafka create serversocket in SocketServer.scala > > {code:java} > serverChannel.socket.bind(socketAddress);{code} > this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) > will decide the queue length.beacues kafka haven't set ,it is default value > 50. > if this queue is full, and tcp_syncookies = 0, then new connection request > will be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie > mechanism. this mechanism could allow linux handle more tcp sync request, but > it would lose many tcp external parameter, include "wscale", the one that > allow tcp connection to send much more bytes per tcp package. because > syncookie triggerd, wscale has lost, and this tcp connection will handle > network very slow, forever,until this connection is closed and establish > another tcp connection. > so after a preferred repilca executed, lots of new tcp connection will > establish without set wscale,and many network traffic to this server will > have a very slow speed. > i'm not sure whether new linux version have resolved this problem, but kafka > also should set backlog a larger value. we now have modify this to 512, seems > everything is ok. > -- This message was sent by Atlassian Jira (v8.3.4#803005)