[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

Francesco vigotti (JIRA) Fri, 13 Oct 2017 01:48:49 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203198#comment-16203198
 ]


Francesco vigotti commented on KAFKA-2729:
------------------------------------------

I'm having the same issue and definitely losing trust in kafka, every 2 months 
there is something that force me to reset the whole cluster, I'm searching for 
a good alternative for a distributed-persisted-fast-queue for a while.. yet to 
find something that give me a good vibe.. 

anyway I'm facing this same issue with some small differences
- restarting all brokers ( together and rolling-restart ) didn't fix it..

all brokers in the cluster log such errors :
--- broker 5 

{code:java}

[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,17] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,23] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,47] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,29] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

{code}

--- broker3

)
{code:java}

[2017-10-13 08:13:58,547] INFO Partition [__consumer_offsets,20] on broker 3: 
Expanding ISR for partition __consumer_offsets-20 from 3,2 to 3,2,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,551] INFO Partition [__consumer_offsets,44] on broker 3: 
Expanding ISR for partition __consumer_offsets-44 from 3,2 to 3,2,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,554] INFO Partition [__consumer_offsets,5] on broker 3: 
Expanding ISR for partition __consumer_offsets-5 from 2,3 to 2,3,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,557] INFO Partition [__consumer_offsets,26] on broker 3: 
Expanding ISR for partition __consumer_offsets-26 from 3,2 to 3,2,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,563] INFO Partition [__consumer_offsets,29] on broker 3: 
Expanding ISR for partition __consumer_offsets-29 from 2,3 to 2,3,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,566] INFO Partition [__consumer_offsets,32] on broker 3: 
Expanding ISR for partition __consumer_offsets-32 from 3,2 to 3,2,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,570] INFO Partition [legacyJavaVarT,2] on broker 3: 
Expanding ISR for partition legacyJavaVarT-2 from 3 to 3,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,573] INFO Partition [test4,3] on broker 3: Expanding ISR 
for partition test4-3 from 2,3 to 2,3,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,577] INFO Partition [test4,0] on broker 3: Expanding ISR 
for partition test4-0 from 3,2 to 3,2,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,582] INFO Partition [test3,5] on broker 3: Expanding ISR 
for partition test3-5 from 3 to 3,5 (kafka.cluster.Partition)

{code}


--- broker2 

{code:java}

[2017-10-13 08:13:36,289] INFO Partition [__consumer_offsets,11] on broker 2: 
Expanding ISR for partition __consumer_offsets-11 from 2,5 to 2,5,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,293] INFO Partition [__consumer_offsets,41] on broker 2: 
Expanding ISR for partition __consumer_offsets-41 from 2,5 to 2,5,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,296] INFO Partition [test3,2] on broker 2: Expanding ISR 
for partition test3-2 from 2 to 2,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,300] INFO Partition [__consumer_offsets,23] on broker 2: 
Expanding ISR for partition __consumer_offsets-23 from 2,5 to 2,5,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,304] INFO Partition [__consumer_offsets,5] on broker 2: 
Expanding ISR for partition __consumer_offsets-5 from 2,5 to 2,5,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,337] INFO Partition [__consumer_offsets,35] on broker 2: 
Expanding ISR for partition __consumer_offsets-35 from 2,5 to 2,5,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,372] INFO Partition [test_mainlog,24] on broker 2: 
Expanding ISR for partition test_mainlog-24 from 2 to 2,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,375] INFO Partition [test_mainlog,6] on broker 2: 
Expanding ISR for partition test_mainlog-6 from 2 to 2,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,379] INFO Partition [test_mainlog,18] on broker 2: 
Expanding ISR for partition test_mainlog-18 from 2 to 2,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,384] INFO Partition [test_mainlog,0] on broker 2: 
Expanding ISR for partition test_mainlog-0 from 2 to 2,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,388] INFO Partition [test_mainlog,12] on broker 2: 
Expanding ISR for partition test_mainlog-12 from 2 to 2,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:40,367] INFO [ReplicaFetcherManager on broker 2] Removed 
fetcher for partitions __consumer_offsets-47 
(kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,367] INFO Truncating log __consumer_offsets-47 to offset 
0. (kafka.log.Log)
[2017-10-13 08:13:40,374] INFO [ReplicaFetcherThread-0-3], Starting  
(kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,374] INFO [ReplicaFetcherManager on broker 2] Added 
fetcher for partitions List([__consumer_offsets-47, initOffset 0 to broker 
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,376] ERROR [ReplicaFetcherThread-0-3], Error for partition 
[__consumer_offsets,47] to broker 
3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,393] INFO [ReplicaFetcherManager on broker 2] Removed 
fetcher for partitions __consumer_offsets-29 
(kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,393] INFO Truncating log __consumer_offsets-29 to offset 
0. (kafka.log.Log)
[2017-10-13 08:13:40,402] INFO [ReplicaFetcherManager on broker 2] Added 
fetcher for partitions List([__consumer_offsets-29, initOffset 0 to broker 
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,403] ERROR [ReplicaFetcherThread-0-3], Error for partition 
[__consumer_offsets,29] to broker 
3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,407] INFO [ReplicaFetcherManager on broker 2] Removed 
fetcher for partitions __consumer_offsets-41 
(kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,407] INFO Truncating log __consumer_offsets-41 to offset 
0. (kafka.log.Log)
[2017-10-13 08:13:40,413] INFO [ReplicaFetcherManager on broker 2] Added 
fetcher for partitions List([__consumer_offsets-41, initOffset 0 to broker 
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,414] ERROR [ReplicaFetcherThread-0-3], Error for partition 
[__consumer_offsets,41] to broker 
3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,419] INFO [ReplicaFetcherManager on broker 2] Removed 
fetcher for partitions test_mainlog-6 (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,419] INFO Truncating log test_mainlog-6 to offset 
4997933406. (kafka.log.Log)
[2017-10-13 08:13:40,425] INFO [ReplicaFetcherManager on broker 2] Added 
fetcher for partitions List([test_mainlog-6, initOffset 4997933406 to broker 
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,432] INFO [ReplicaFetcherManager on broker 2] Removed 
fetcher for partitions __consumer_offsets-17 
(kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,432] INFO Truncating log __consumer_offsets-17 to offset 
0. (kafka.log.Log)
[2017-10-13 08:13:40,438] INFO [ReplicaFetcherManager on broker 2] Added 
fetcher for partitions List([__consumer_offsets-17, initOffset 0 to broker 
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,443] INFO [ReplicaFetcherManager on broker 2] Removed 
fetcher for partitions test_mainlog-0 (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,443] INFO Truncating log test_mainlog-0 to offset 
5704085814. (kafka.log.Log)
[2017-10-13 08:13:40,449] INFO [ReplicaFetcherManager on broker 2] Added 
fetcher for partitions List([test_mainlog-0, initOffset 5704085814 to broker 
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,464] INFO [ReplicaFetcherManager on broker 2] Removed 
fetcher for partitions __consumer_offsets-14 
(kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,464] INFO Truncating log __consumer_offsets-14 to offset 
0. (kafka.log.Log)
[2017-10-13 08:13:40,472] INFO [ReplicaFetcherManager on broker 2] Added 
fetcher for partitions List([__consumer_offsets-14, initOffset 0 to broker 
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)

{code}







those logs goes for hours and the cluster never recover, the only things that 
change something is when I repeatedly from zookeeper
delete /controller # repeatedly untill it get assigned to kafka3 node 

and at this point all errors stop ( no more error logs ) , kafka seems working, 
kafkamanager show offsets for all partitions ( while some offset was missing ) 
, data ingestion /consumption works , the only things that presages something 
wrong is that on one topic with 30 partitions and replication 2 there is 1 
broker skew ( 1 broker have 1 partitions more than normal and one broker have 1 
partition less than normal )
and the situation remain stable with this small anomaly for hours..  nodes 
delete indexes, delete segments , roll new segments.. 

If i now delete the controller again, or restart the kafka3-node evreything 
goes to the previous situation again ( all errors logged ) and at this point I 
don't even know how to recover , the only "fix" I'm left to try is to wipe the 
whole cluster data and restart  :( but what to do then if this happens again in 
future ?

I don't know why two nodes seems to have a ("broken controller" ??) and the 
cluster remain in this in-consistent state forever.. 
If you have any suggestion... on what to inspect / how to try to fix , those 
are very welcomed..

Thank you,
Francesco



> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>            Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

Reply via email to