[
https://issues.apache.org/jira/browse/KAFKA-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395504#comment-14395504
]
Sriharsha Chintalapani commented on KAFKA-2082:
-----------------------------------------------
[~eapache] trying to understand "TestReliableProducing" test better. Right now
I've patch which adds leaderFinder logic to ReplciaFetcherThread . Now the
brokers won't be in a bad test but the test still fails because many_partition
topic has 32 partitions with 3 replicas.
The test disables two zookeepers
[sarama] 2015/04/03 19:59:40 zk1 disabled
[sarama] 2015/04/03 19:59:43 zk3 disabled
and later enables one of the zookeeper only
[sarama] 2015/04/03 19:59:58 zk3 enabled
later test fails with
--- FAIL: TestReliableProducing (34.67s)
functional_producer_test.go:217: kafka server: Tried to send a message
to a replica that is not the leader for some partition. Your metadata is out of
date.
This exception should be handled by the producer code right ?
[~junrao] [~guozhang] any reason for not having leaderFinderThread similar to
ConsumerFetcherManager for ReplicaFetcherManager too. I've a patch on top of
KAFKA-1461 which adds this to ReplicaFetcherManager.
> Kafka Replication ends up in a bad state
> ----------------------------------------
>
> Key: KAFKA-2082
> URL: https://issues.apache.org/jira/browse/KAFKA-2082
> Project: Kafka
> Issue Type: Bug
> Components: replication
> Affects Versions: 0.8.2.1
> Reporter: Evan Huus
> Assignee: Sriharsha Chintalapani
> Priority: Critical
>
> While running integration tests for Sarama (the go client) we came across a
> pattern of connection losses that reliably puts kafka into a bad state:
> several of the brokers start spinning, chewing ~30% CPU and spamming the logs
> with hundreds of thousands of lines like:
> {noformat}
> [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch
> request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on
> partition [many_partition,1] failed due to Leader not local for partition
> [many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch
> request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on
> partition [many_partition,6] failed due to Leader not local for partition
> [many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on
> partition [many_partition,21] failed due to Leader not local for partition
> [many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on
> partition [many_partition,26] failed due to Leader not local for partition
> [many_partition,26] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on
> partition [many_partition,1] failed due to Leader not local for partition
> [many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on
> partition [many_partition,6] failed due to Leader not local for partition
> [many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch
> request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on
> partition [many_partition,21] failed due to Leader not local for partition
> [many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch
> request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on
> partition [many_partition,26] failed due to Leader not local for partition
> [many_partition,26] on broker 9093 (kafka.server.ReplicaManager)
> {noformat}
> This can be easily and reliably reproduced using the {{toxiproxy-final}}
> branch of https://github.com/Shopify/sarama which includes a vagrant script
> for provisioning the appropriate cluster:
> - {{git clone https://github.com/Shopify/sarama.git}}
> - {{git checkout test-jira-kafka-2082}}
> - {{vagrant up}}
> - {{TEST_SEED=1427917826425719059 DEBUG=true go test -v}}
> After the test finishes (it fails because the cluster ends up in a bad
> state), you can log into the cluster machine with {{vagrant ssh}} and inspect
> the bad nodes. The vagrant script provisions five zookeepers and five brokers
> in {{/opt/kafka-9091/}} through {{/opt/kafka-9095/}}.
> Additional context: the test produces continually to the cluster while
> randomly cutting and restoring zookeeper connections (all connections to
> zookeeper are run through a simple proxy on the same vm to make this easy).
> The majority of the time this works very well and does a good job exercising
> our producer's retry and failover code. However, under certain patterns of
> connection loss (the {{TEST_SEED}} in the instructions is important), kafka
> gets confused. The test never cuts more than two connections at a time, so
> zookeeper should always have quorum, and the topic (with three replicas)
> should always be writable.
> Completely restarting the cluster via {{vagrant reload}} seems to put it back
> into a sane state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)