[
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203198#comment-16203198
]
Francesco vigotti commented on KAFKA-2729:
------------------------------------------
I'm having the same issue and definitely losing trust in kafka, every 2 months
there is something that force me to reset the whole cluster, I'm searching for
a good alternative for a distributed-persisted-fast-queue for a while.. yet to
find something that give me a good vibe..
anyway I'm facing this same issue with some small differences
- restarting all brokers ( together and rolling-restart ) didn't fix it..
all brokers in the cluster log such errors :
--- broker 5
{code:java}
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition
[__consumer_offsets,17] to broker
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition
[__consumer_offsets,23] to broker
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition
[__consumer_offsets,47] to broker
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition
[__consumer_offsets,29] to broker
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
{code}
--- broker3
)
{code:java}
[2017-10-13 08:13:58,547] INFO Partition [__consumer_offsets,20] on broker 3:
Expanding ISR for partition __consumer_offsets-20 from 3,2 to 3,2,5
(kafka.cluster.Partition)
[2017-10-13 08:13:58,551] INFO Partition [__consumer_offsets,44] on broker 3:
Expanding ISR for partition __consumer_offsets-44 from 3,2 to 3,2,5
(kafka.cluster.Partition)
[2017-10-13 08:13:58,554] INFO Partition [__consumer_offsets,5] on broker 3:
Expanding ISR for partition __consumer_offsets-5 from 2,3 to 2,3,5
(kafka.cluster.Partition)
[2017-10-13 08:13:58,557] INFO Partition [__consumer_offsets,26] on broker 3:
Expanding ISR for partition __consumer_offsets-26 from 3,2 to 3,2,5
(kafka.cluster.Partition)
[2017-10-13 08:13:58,563] INFO Partition [__consumer_offsets,29] on broker 3:
Expanding ISR for partition __consumer_offsets-29 from 2,3 to 2,3,5
(kafka.cluster.Partition)
[2017-10-13 08:13:58,566] INFO Partition [__consumer_offsets,32] on broker 3:
Expanding ISR for partition __consumer_offsets-32 from 3,2 to 3,2,5
(kafka.cluster.Partition)
[2017-10-13 08:13:58,570] INFO Partition [legacyJavaVarT,2] on broker 3:
Expanding ISR for partition legacyJavaVarT-2 from 3 to 3,5
(kafka.cluster.Partition)
[2017-10-13 08:13:58,573] INFO Partition [test4,3] on broker 3: Expanding ISR
for partition test4-3 from 2,3 to 2,3,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,577] INFO Partition [test4,0] on broker 3: Expanding ISR
for partition test4-0 from 3,2 to 3,2,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,582] INFO Partition [test3,5] on broker 3: Expanding ISR
for partition test3-5 from 3 to 3,5 (kafka.cluster.Partition)
{code}
--- broker2
{code:java}
[2017-10-13 08:13:36,289] INFO Partition [__consumer_offsets,11] on broker 2:
Expanding ISR for partition __consumer_offsets-11 from 2,5 to 2,5,3
(kafka.cluster.Partition)
[2017-10-13 08:13:36,293] INFO Partition [__consumer_offsets,41] on broker 2:
Expanding ISR for partition __consumer_offsets-41 from 2,5 to 2,5,3
(kafka.cluster.Partition)
[2017-10-13 08:13:36,296] INFO Partition [test3,2] on broker 2: Expanding ISR
for partition test3-2 from 2 to 2,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,300] INFO Partition [__consumer_offsets,23] on broker 2:
Expanding ISR for partition __consumer_offsets-23 from 2,5 to 2,5,3
(kafka.cluster.Partition)
[2017-10-13 08:13:36,304] INFO Partition [__consumer_offsets,5] on broker 2:
Expanding ISR for partition __consumer_offsets-5 from 2,5 to 2,5,3
(kafka.cluster.Partition)
[2017-10-13 08:13:36,337] INFO Partition [__consumer_offsets,35] on broker 2:
Expanding ISR for partition __consumer_offsets-35 from 2,5 to 2,5,3
(kafka.cluster.Partition)
[2017-10-13 08:13:36,372] INFO Partition [test_mainlog,24] on broker 2:
Expanding ISR for partition test_mainlog-24 from 2 to 2,3
(kafka.cluster.Partition)
[2017-10-13 08:13:36,375] INFO Partition [test_mainlog,6] on broker 2:
Expanding ISR for partition test_mainlog-6 from 2 to 2,3
(kafka.cluster.Partition)
[2017-10-13 08:13:36,379] INFO Partition [test_mainlog,18] on broker 2:
Expanding ISR for partition test_mainlog-18 from 2 to 2,3
(kafka.cluster.Partition)
[2017-10-13 08:13:36,384] INFO Partition [test_mainlog,0] on broker 2:
Expanding ISR for partition test_mainlog-0 from 2 to 2,3
(kafka.cluster.Partition)
[2017-10-13 08:13:36,388] INFO Partition [test_mainlog,12] on broker 2:
Expanding ISR for partition test_mainlog-12 from 2 to 2,3
(kafka.cluster.Partition)
[2017-10-13 08:13:40,367] INFO [ReplicaFetcherManager on broker 2] Removed
fetcher for partitions __consumer_offsets-47
(kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,367] INFO Truncating log __consumer_offsets-47 to offset
0. (kafka.log.Log)
[2017-10-13 08:13:40,374] INFO [ReplicaFetcherThread-0-3], Starting
(kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,374] INFO [ReplicaFetcherManager on broker 2] Added
fetcher for partitions List([__consumer_offsets-47, initOffset 0 to broker
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,376] ERROR [ReplicaFetcherThread-0-3], Error for partition
[__consumer_offsets,47] to broker
3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,393] INFO [ReplicaFetcherManager on broker 2] Removed
fetcher for partitions __consumer_offsets-29
(kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,393] INFO Truncating log __consumer_offsets-29 to offset
0. (kafka.log.Log)
[2017-10-13 08:13:40,402] INFO [ReplicaFetcherManager on broker 2] Added
fetcher for partitions List([__consumer_offsets-29, initOffset 0 to broker
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,403] ERROR [ReplicaFetcherThread-0-3], Error for partition
[__consumer_offsets,29] to broker
3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,407] INFO [ReplicaFetcherManager on broker 2] Removed
fetcher for partitions __consumer_offsets-41
(kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,407] INFO Truncating log __consumer_offsets-41 to offset
0. (kafka.log.Log)
[2017-10-13 08:13:40,413] INFO [ReplicaFetcherManager on broker 2] Added
fetcher for partitions List([__consumer_offsets-41, initOffset 0 to broker
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,414] ERROR [ReplicaFetcherThread-0-3], Error for partition
[__consumer_offsets,41] to broker
3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,419] INFO [ReplicaFetcherManager on broker 2] Removed
fetcher for partitions test_mainlog-6 (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,419] INFO Truncating log test_mainlog-6 to offset
4997933406. (kafka.log.Log)
[2017-10-13 08:13:40,425] INFO [ReplicaFetcherManager on broker 2] Added
fetcher for partitions List([test_mainlog-6, initOffset 4997933406 to broker
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,432] INFO [ReplicaFetcherManager on broker 2] Removed
fetcher for partitions __consumer_offsets-17
(kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,432] INFO Truncating log __consumer_offsets-17 to offset
0. (kafka.log.Log)
[2017-10-13 08:13:40,438] INFO [ReplicaFetcherManager on broker 2] Added
fetcher for partitions List([__consumer_offsets-17, initOffset 0 to broker
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,443] INFO [ReplicaFetcherManager on broker 2] Removed
fetcher for partitions test_mainlog-0 (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,443] INFO Truncating log test_mainlog-0 to offset
5704085814. (kafka.log.Log)
[2017-10-13 08:13:40,449] INFO [ReplicaFetcherManager on broker 2] Added
fetcher for partitions List([test_mainlog-0, initOffset 5704085814 to broker
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,464] INFO [ReplicaFetcherManager on broker 2] Removed
fetcher for partitions __consumer_offsets-14
(kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,464] INFO Truncating log __consumer_offsets-14 to offset
0. (kafka.log.Log)
[2017-10-13 08:13:40,472] INFO [ReplicaFetcherManager on broker 2] Added
fetcher for partitions List([__consumer_offsets-14, initOffset 0 to broker
BrokerEndPoint(3,--hidden----.73,9092)] ) (kafka.server.ReplicaFetcherManager)
{code}
those logs goes for hours and the cluster never recover, the only things that
change something is when I repeatedly from zookeeper
delete /controller # repeatedly untill it get assigned to kafka3 node
and at this point all errors stop ( no more error logs ) , kafka seems working,
kafkamanager show offsets for all partitions ( while some offset was missing )
, data ingestion /consumption works , the only things that presages something
wrong is that on one topic with 30 partitions and replication 2 there is 1
broker skew ( 1 broker have 1 partitions more than normal and one broker have 1
partition less than normal )
and the situation remain stable with this small anomaly for hours.. nodes
delete indexes, delete segments , roll new segments..
If i now delete the controller again, or restart the kafka3-node evreything
goes to the previous situation again ( all errors logged ) and at this point I
don't even know how to recover , the only "fix" I'm left to try is to wipe the
whole cluster data and restart :( but what to do then if this happens again in
future ?
I don't know why two nodes seems to have a ("broken controller" ??) and the
cluster remain in this in-consistent state forever..
If you have any suggestion... on what to inspect / how to try to fix , those
are very welcomed..
Thank you,
Francesco
> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
> Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other,
> we started seeing a large number of undereplicated partitions. The zookeeper
> cluster recovered, however we continued to see a large number of
> undereplicated partitions. Two brokers in the kafka cluster were showing this
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66]
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered
> after a restart. Our own investigation yielded nothing, I was hoping you
> could shed some light on this issue. Possibly if it's related to:
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using
> 0.8.2.1.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)