[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204367#comment-16204367
 ] 

Jun Rao commented on KAFKA-2729:
--------------------------------

[~fravigotti], sorry to hear that. A couple of quick suggestions.

(1) Do you see any ZK session expiration in the log (e.g., INFO zookeeper state 
changed (Expired) (org.I0Itec.zkclient.ZkClient))? There are known bugs in 
Kafka in handling ZK session expiration. So, it would be useful to avoid it in 
the first place. Typical causes of ZK session expiration are long GC in the 
broker or network glitches. So you can either tune the broker or increase 
zookeeper.session.timeout.ms.

(2) Do you have lots of partitions (say a few thousands) per broker? If so, you 
want to check if the controlled shutdown succeeds when shutting down a broker. 
If not, restarting the broker too soon could also lead the cluster to a weird 
state. To address this issue, you can increase request.timeout.ms on the broker.

We are fixing the known issue in (1) and improving the performance with lots of 
partitions in (2) in KAFKA-5642 and we expect the fix to be included in the 
1.1.0 release in Feb.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>            Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to