[ 
https://issues.apache.org/jira/browse/KAFKA-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kafka updated KAFKA-7349:
------------------------------
    Description: 
We run our Kafka cluster on a cloud service provider. As a consequence, we 
notice a large tail latency write time that is out of our control. Some writes 
take on the order of seconds. We have noticed that often these long write times 
are correlated with subsequent Zookeeper disconnects from the brokers. It 
appears that during the long write time, the Zookeeper heartbeat thread does 
not get scheduled CPU time, resulting in a long gap of heartbeats sent. After 
the write, the ZK thread does get scheduled CPU time, but it detects that it 
has not received a heartbeat from Zookeeper in a while, so it drops its 
connection then rejoins the cluster.

Note that the timeout reported is inconsistent with the timeout as set by the 
configuration ({{zookeeper.session.timeout.ms}} = default of 6 seconds). We 
have seen a range of values reported here, including 5950ms (less than 
threshold), 12032ms (double the threshold), 25999ms (much larger than the 
threshold).

We noticed that during a service degradation of the storage service of our 
cloud provider, these Zookeeper disconnects increased drastically in frequency. 

We are hoping there is a way to decouple these components. Do you agree with 
our diagnosis that the ZK disconnects are occurring due to thread contention 
caused by long disk writes? Perhaps the ZK thread could be scheduled at a 
higher priority? Do you have any suggestions for how to avoid the ZK 
disconnects?

Here is an example of one of these events:
Logs on the Broker:
{code}
[2018-08-25 04:10:19,695] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:21,697] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:23,700] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:25,701] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:27,702] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:29,704] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:31,707] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:33,709] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:35,712] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:37,714] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:39,716] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:41,719] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
...
[2018-08-25 04:10:53,752] WARN Client session timed out, have not heard from 
server in 12032ms for sessionid 0x36202ab4337002c 
(org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:53,754] INFO Client session timed out, have not heard from 
server in 12032ms for sessionid 0x36202ab4337002c, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:53,920] INFO zookeeper state changed (Disconnected) 
(org.I0Itec.zkclient.ZkClient)
[2018-08-25 04:10:53,920] INFO Waiting for keeper state SyncConnected 
(org.I0Itec.zkclient.ZkClient)
...
{code}

GC logs during the same time (demonstrating this is not just a long GC):
{code}
2018-08-25T04:10:36.434+0000: 35150.779: [GC (Allocation Failure)  
3074119K->2529089K(6223360K), 0.0137342 secs]
2018-08-25T04:10:37.367+0000: 35151.713: [GC (Allocation Failure)  
3074433K->2528524K(6223360K), 0.0127938 secs]
2018-08-25T04:10:38.274+0000: 35152.620: [GC (Allocation Failure)  
3073868K->2528357K(6223360K), 0.0131040 secs]
2018-08-25T04:10:39.220+0000: 35153.566: [GC (Allocation Failure)  
3073701K->2528885K(6223360K), 0.0133247 secs]
2018-08-25T04:10:40.175+0000: 35154.520: [GC (Allocation Failure)  
3074229K->2528639K(6223360K), 0.0127870 secs]
2018-08-25T04:10:41.084+0000: 35155.429: [GC (Allocation Failure)  
3073983K->2530769K(6223360K), 0.0135058 secs]
2018-08-25T04:10:42.012+0000: 35156.358: [GC (Allocation Failure)  
3076113K->2531772K(6223360K), 0.0165919 secs]
2018-08-25T04:10:53.737+0000: 35168.083: [GC (Allocation Failure)  
3077116K->2529630K(6223360K), 0.0135440 secs]
2018-08-25T04:10:58.515+0000: 35172.860: [GC (Allocation Failure)  
3074974K->2531144K(6223360K), 0.0164601 secs]
2018-08-25T04:11:03.149+0000: 35177.495: [GC (Allocation Failure)  
3076488K->2531751K(6223360K), 0.0146978 secs]
{code}

The attached screenshot shows a spike in write time to our data volume at the 
same second. Note that this an average value, so it does not represent the max 
write time, only a sample of write times. 

 !SpikeInWriteTime.png! 

  was:
We run our Kafka cluster on a cloud service provider. As a consequence, we 
notice a large tail latency write time that is out of our control. Some writes 
take on the order of seconds. We have noticed that often these long write times 
are correlated with subsequent Zookeeper disconnects from the brokers. It 
appears that during the long write time, the Zookeeper heartbeat thread does 
not get scheduled CPU time, resulting in a long gap of heartbeats sent. After 
the write, the ZK thread does get scheduled CPU time, but it detects that it 
has not received a heartbeat from Zookeeper in a while, so it drops its 
connection then rejoins the cluster.

Note that the timeout reported is inconsistent with the timeout as set by the 
configuration ({{zookeeper.session.timeout.ms}} = default of 6 seconds). We 
have seen a range of values reported here, including 5950ms (less than 
threshold), 12032ms (double the threshold), 25999ms (much larger than the 
threshold).

We noticed that during a service degradation of the storage service of our 
cloud provider, these Zookeeper disconnects increased drastically in frequency. 

We are hoping there is a way to decouple these components. Do you agree with 
our diagnosis that the ZK disconnects are occurring due to thread contention 
caused by long disk writes? Perhaps the ZK thread could be scheduled at a 
higher priority? Do you have any suggestions for how to avoid the ZK 
disconnects?

Here is an example of one of these events:
Logs on the Broker:
{code}
[2018-08-25 04:10:19,695] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:21,697] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:23,700] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:25,701] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:27,702] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:29,704] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:31,707] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:33,709] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:35,712] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:37,714] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:39,716] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:41,719] DEBUG Got ping response for sessionid: 
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
...
[2018-08-25 04:10:53,752] WARN Client session timed out, have not heard from 
server in 12032ms for sessionid 0x36202ab4337002c 
(org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:53,754] INFO Client session timed out, have not heard from 
server in 12032ms for sessionid 0x36202ab4337002c, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:53,920] INFO zookeeper state changed (Disconnected) 
(org.I0Itec.zkclient.ZkClient)
[2018-08-25 04:10:53,920] INFO Waiting for keeper state SyncConnected 
(org.I0Itec.zkclient.ZkClient)
...
{code}

GC logs during the same time (demonstrating this is not just a long GC):
{code}
2018-08-25T04:10:36.434+0000: 35150.779: [GC (Allocation Failure)  
3074119K->2529089K(6223360K), 0.0137342 secs]
2018-08-25T04:10:37.367+0000: 35151.713: [GC (Allocation Failure)  
3074433K->2528524K(6223360K), 0.0127938 secs]
2018-08-25T04:10:38.274+0000: 35152.620: [GC (Allocation Failure)  
3073868K->2528357K(6223360K), 0.0131040 secs]
2018-08-25T04:10:39.220+0000: 35153.566: [GC (Allocation Failure)  
3073701K->2528885K(6223360K), 0.0133247 secs]
2018-08-25T04:10:40.175+0000: 35154.520: [GC (Allocation Failure)  
3074229K->2528639K(6223360K), 0.0127870 secs]
2018-08-25T04:10:41.084+0000: 35155.429: [GC (Allocation Failure)  
3073983K->2530769K(6223360K), 0.0135058 secs]
2018-08-25T04:10:42.012+0000: 35156.358: [GC (Allocation Failure)  
3076113K->2531772K(6223360K), 0.0165919 secs]
2018-08-25T04:10:53.737+0000: 35168.083: [GC (Allocation Failure)  
3077116K->2529630K(6223360K), 0.0135440 secs]
2018-08-25T04:10:58.515+0000: 35172.860: [GC (Allocation Failure)  
3074974K->2531144K(6223360K), 0.0164601 secs]
2018-08-25T04:11:03.149+0000: 35177.495: [GC (Allocation Failure)  
3076488K->2531751K(6223360K), 0.0146978 secs]
{code}

The attached screenshot shows a spike in write time to our data volume at the 
same second. Note that this an average value, so it does not represent the max 
write time, only a sample of write times. 

 !SpikeInWriteTime.png|thumbnail! 


> Long Disk Writes cause Zookeeper Disconnects
> --------------------------------------------
>
>                 Key: KAFKA-7349
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7349
>             Project: Kafka
>          Issue Type: Bug
>          Components: zkclient
>    Affects Versions: 0.11.0.1
>            Reporter: Adam Kafka
>            Priority: Minor
>         Attachments: SpikeInWriteTime.png
>
>
> We run our Kafka cluster on a cloud service provider. As a consequence, we 
> notice a large tail latency write time that is out of our control. Some 
> writes take on the order of seconds. We have noticed that often these long 
> write times are correlated with subsequent Zookeeper disconnects from the 
> brokers. It appears that during the long write time, the Zookeeper heartbeat 
> thread does not get scheduled CPU time, resulting in a long gap of heartbeats 
> sent. After the write, the ZK thread does get scheduled CPU time, but it 
> detects that it has not received a heartbeat from Zookeeper in a while, so it 
> drops its connection then rejoins the cluster.
> Note that the timeout reported is inconsistent with the timeout as set by the 
> configuration ({{zookeeper.session.timeout.ms}} = default of 6 seconds). We 
> have seen a range of values reported here, including 5950ms (less than 
> threshold), 12032ms (double the threshold), 25999ms (much larger than the 
> threshold).
> We noticed that during a service degradation of the storage service of our 
> cloud provider, these Zookeeper disconnects increased drastically in 
> frequency. 
> We are hoping there is a way to decouple these components. Do you agree with 
> our diagnosis that the ZK disconnects are occurring due to thread contention 
> caused by long disk writes? Perhaps the ZK thread could be scheduled at a 
> higher priority? Do you have any suggestions for how to avoid the ZK 
> disconnects?
> Here is an example of one of these events:
> Logs on the Broker:
> {code}
> [2018-08-25 04:10:19,695] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:21,697] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:23,700] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:25,701] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:27,702] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:29,704] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:31,707] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:33,709] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:35,712] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:37,714] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:39,716] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:41,719] DEBUG Got ping response for sessionid: 
> 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
> ...
> [2018-08-25 04:10:53,752] WARN Client session timed out, have not heard from 
> server in 12032ms for sessionid 0x36202ab4337002c 
> (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:53,754] INFO Client session timed out, have not heard from 
> server in 12032ms for sessionid 0x36202ab4337002c, closing socket connection 
> and attempting reconnect (org.apache.zookeeper.ClientCnxn)
> [2018-08-25 04:10:53,920] INFO zookeeper state changed (Disconnected) 
> (org.I0Itec.zkclient.ZkClient)
> [2018-08-25 04:10:53,920] INFO Waiting for keeper state SyncConnected 
> (org.I0Itec.zkclient.ZkClient)
> ...
> {code}
> GC logs during the same time (demonstrating this is not just a long GC):
> {code}
> 2018-08-25T04:10:36.434+0000: 35150.779: [GC (Allocation Failure)  
> 3074119K->2529089K(6223360K), 0.0137342 secs]
> 2018-08-25T04:10:37.367+0000: 35151.713: [GC (Allocation Failure)  
> 3074433K->2528524K(6223360K), 0.0127938 secs]
> 2018-08-25T04:10:38.274+0000: 35152.620: [GC (Allocation Failure)  
> 3073868K->2528357K(6223360K), 0.0131040 secs]
> 2018-08-25T04:10:39.220+0000: 35153.566: [GC (Allocation Failure)  
> 3073701K->2528885K(6223360K), 0.0133247 secs]
> 2018-08-25T04:10:40.175+0000: 35154.520: [GC (Allocation Failure)  
> 3074229K->2528639K(6223360K), 0.0127870 secs]
> 2018-08-25T04:10:41.084+0000: 35155.429: [GC (Allocation Failure)  
> 3073983K->2530769K(6223360K), 0.0135058 secs]
> 2018-08-25T04:10:42.012+0000: 35156.358: [GC (Allocation Failure)  
> 3076113K->2531772K(6223360K), 0.0165919 secs]
> 2018-08-25T04:10:53.737+0000: 35168.083: [GC (Allocation Failure)  
> 3077116K->2529630K(6223360K), 0.0135440 secs]
> 2018-08-25T04:10:58.515+0000: 35172.860: [GC (Allocation Failure)  
> 3074974K->2531144K(6223360K), 0.0164601 secs]
> 2018-08-25T04:11:03.149+0000: 35177.495: [GC (Allocation Failure)  
> 3076488K->2531751K(6223360K), 0.0146978 secs]
> {code}
> The attached screenshot shows a spike in write time to our data volume at the 
> same second. Note that this an average value, so it does not represent the 
> max write time, only a sample of write times. 
>  !SpikeInWriteTime.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to