[jira] [Commented] (KAFKA-7888) kafka cluster not recovering - Shrinking ISR from 14,13 to 13 (kafka.cluster.Partition) continously

Jun Rao (JIRA) Fri, 01 Feb 2019 12:15:28 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758638#comment-16758638
 ]


Jun Rao commented on KAFKA-7888:
--------------------------------

[~kemalerden], from the log, the "Cached zkVersion" log started around 23:53:35.

 
{code:java}
server_b13.log.2019-01-26-22:[2019-01-26 23:53:35,041] INFO [Partition 
ucTrade-6 broker=13] Cached zkVersion [21] not equal to that in zookeeper, skip 
updating ISR (kafka.cluster.Partition) 
{code}
 

The controller log showed that broker 13 was never able to re-register itself 
in ZK after 23:53:16.

 
{code:java}
controller_b14.log.2019-01-26-23:[2019-01-26 23:53:16,267] INFO [Controller 
id=14] Newly added brokers: , deleted brokers: 13, all live brokers: 14,15 
(kafka.controller.KafkaController)
controller_b14.log.2019-01-26-23:[2019-01-26 23:53:42,281] INFO [Controller 
id=14] Newly added brokers: , deleted brokers: 15, all live brokers: 14 
(kafka.controller.KafkaController)
controller_b14.log.2019-01-26-23:[2019-01-26 23:53:46,809] INFO [Controller 
id=14] Newly added brokers: 15, deleted brokers: , all live brokers: 14,15 
(kafka.controller.KafkaController)
{code}
 

>From broker 13's log, it failed to re-register itself in ZK around 23:53:11.

 
{code:java}
server_b13.log.2019-01-26-22:[2019-01-26 23:53:11,841] ERROR Error while 
creating ephemeral at /brokers/ids/13, node already exists and owner 
'937991457960493056' does not match current session '1010049473220837376' 
(kafka.zk.KafkaZkClient$CheckedEphemeral)
server_b13.log.2019-01-26-22:[2019-01-26 23:53:11,841] INFO Result of znode 
creation at /brokers/ids/13 is: NODEEXISTS (kafka.zk.KafkaZkClient)
{code}
 

We recently fixed KAFKA-7165 which could lead to the above. Perhaps you could 
try 2.2.0 when it's released.

 

 

> kafka cluster not recovering - Shrinking ISR from 14,13 to 13 
> (kafka.cluster.Partition) continously
> ---------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7888
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7888
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, replication, zkclient
>    Affects Versions: 2.1.0
>         Environment: using kafka_2.12-2.1.0
> 3 ZKs 3 Broker cluster, using 3 boxes (1 ZK and 1 broker on each box), 
> default.replication factor: 2, 
> offset replication factor was 1 when the error happened, increased to 2 after 
> seeing this error by reassigning-partitions.
> compression: default (producer) on broker but sending gzip from producers.
> linux (redhat) etx4 kafka logs on single local disk
>            Reporter: Kemal ERDEN
>            Priority: Major
>         Attachments: combined.log, producer.log
>
>
> we're seeing the following repeating logs on our kafka cluster from time to 
> time which seems to cause messages expiring on Producers and the cluster 
> going into a non-recoverable state. The only fix seems to be to restart 
> brokers.
> {{Shrinking ISR from 14,13 to 13 (kafka.cluster.Partition)}}
>  {{Cached zkVersion [21] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)}}
>  and later on the following log is repeated:
> {{Got user-level KeeperException when processing sessionid:0xe046aa4f8e60000 
> type:setData cxid:0x2df zxid:0xa000001fd txntype:-1 reqpath:n/a Error 
> Path:/brokers/topics/ucTrade/partitions/6/state Error:KeeperErrorCode = 
> BadVersion for /brokers/topics/ucTrade/partitions/6/state}}
> We haven't interfered with any of the brokers/zookeepers whilst this happened.
> I've attached a combined log which represents a combination of controller, 
> server and state change logs from each broker (ids 13,14 and 15, log files 
> have the suffix b13, b14, b15 respectively)
> We have increased the heaps from 1g to 6g for the brokers and from 512m to 4g 
> for the zookeepers since this happened but not sure if it is relevant. the ZK 
> logs are unfortunately overwritten so can't provide those.
> We produce varying message sizes but some messages are relatively large (6mb) 
> but we use compression on the producers (set to gzip).
> I've attached some logs from one of our producers as well.
> producer.properties that we've changed:
> spring.kafka.producer.key-serializer=org.apache.kafka.common.serialization.StringSerializer
> spring.kafka.producer.compression-type=gzip
> spring.kafka.producer.retries=5
> spring.kafka.producer.acks=-1
> spring.kafka.producer.batch-size=1048576
> spring.kafka.producer.properties.linger.ms=200
> spring.kafka.producer.properties.request.timeout.ms=600000
> spring.kafka.producer.properties.max.block.ms=240000
> spring.kafka.producer.properties.max.request.size=104857600
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-7888) kafka cluster not recovering - Shrinking ISR from 14,13 to 13 (kafka.cluster.Partition) continously

Reply via email to