[ 
https://issues.apache.org/jira/browse/KAFKA-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Huus updated KAFKA-2082:
-----------------------------
    Description: 
While running integration tests for Sarama (the go client) we came across a 
pattern of connection losses that reliably puts kafka into a bad state: several 
of the brokers start spinning, chewing ~30% CPU and spamming the logs with 
hundreds of thousands of lines like:

{noformat}
[2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111094 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,1] failed due to Leader not local for partition 
[many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111094 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,6] failed due to Leader not local for partition 
[many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,21] failed due to Leader not local for partition 
[many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,26] failed due to Leader not local for partition 
[many_partition,26] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,1] failed due to Leader not local for partition 
[many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,6] failed due to Leader not local for partition 
[many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111096 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,21] failed due to Leader not local for partition 
[many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111096 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,26] failed due to Leader not local for partition 
[many_partition,26] on broker 9093 (kafka.server.ReplicaManager)
{noformat}

This can be easily and reliably reproduced using the {{toxiproxy-final}} branch 
of https://github.com/Shopify/sarama which includes a vagrant script for 
provisioning the appropriate cluster: 

- {{git clone https://github.com/Shopify/sarama.git}}
- {{git checkout test-jira-kafka-2082}}
- {{vagrant up}}
- {{TEST_SEED=1427917826425719059 DEBUG=true go test -v}}

After the test finishes (it fails because the cluster ends up in a bad state), 
you can log into the cluster machine with {{vagrant ssh}} and inspect the bad 
nodes. The vagrant script provisions five zookeepers and five brokers in 
{{/opt/kafka-9091/}} through {{/opt/kafka-9095/}}.

Additional context: the test produces continually to the cluster while randomly 
cutting and restoring zookeeper connections (all connections to zookeeper are 
run through a simple proxy on the same vm to make this easy). The majority of 
the time this works very well and does a good job exercising our producer's 
retry and failover code. However, under certain patterns of connection loss 
(the {{TEST_SEED}} in the instructions is important), kafka gets confused. The 
test never cuts more than two connections at a time, so zookeeper should always 
have quorum, and the topic (with three replicas) should always be writable.

Completely restarting the cluster via {{vagrant reload}} seems to put it back 
into a sane state.

  was:
While running integration tests for Sarama (the go client) we came across a 
pattern of connection losses that reliably puts kafka into a bad state: several 
of the brokers start spinning, chewing ~30% CPU and spamming the logs with 
hundreds of thousands of lines like:

{noformat}
[2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111094 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,1] failed due to Leader not local for partition 
[many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111094 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,6] failed due to Leader not local for partition 
[many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,21] failed due to Leader not local for partition 
[many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,26] failed due to Leader not local for partition 
[many_partition,26] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,1] failed due to Leader not local for partition 
[many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,6] failed due to Leader not local for partition 
[many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111096 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,21] failed due to Leader not local for partition 
[many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
[2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch request 
with correlation id 111096 from client ReplicaFetcherThread-0-9093 on partition 
[many_partition,26] failed due to Leader not local for partition 
[many_partition,26] on broker 9093 (kafka.server.ReplicaManager)
{noformat}

This can be easily and reliably reproduced using the {{toxiproxy-final}} branch 
of https://github.com/Shopify/sarama which includes a vagrant script for 
provisioning the appropriate cluster: 

- {{git clone https://github.com/Shopify/sarama.git}}
- {{git checkout toxiproxy-final}}
- {{vagrant up}}
- {{TEST_SEED=1427917826425719059 DEBUG=true go test -v}}

After the test finishes (it fails because the cluster ends up in a bad state), 
you can log into the cluster machine with {{vagrant ssh}} and inspect the bad 
nodes. The vagrant script provisions five zookeepers and five brokers in 
{{/opt/kafka-9091/}} through {{/opt/kafka-9095/}}.

Additional context: the test produces continually to the cluster while randomly 
cutting and restoring zookeeper connections (all connections to zookeeper are 
run through a simple proxy on the same vm to make this easy). The majority of 
the time this works very well and does a good job exercising our producer's 
retry and failover code. However, under certain patterns of connection loss 
(the {{TEST_SEED}} in the instructions is important), kafka gets confused. The 
test never cuts more than two connections at a time, so zookeeper should always 
have quorum, and the topic (with three replicas) should always be writable.

Completely restarting the cluster via {{vagrant reload}} seems to put it back 
into a sane state.


> Kafka Replication ends up in a bad state
> ----------------------------------------
>
>                 Key: KAFKA-2082
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2082
>             Project: Kafka
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 0.8.2.1
>            Reporter: Evan Huus
>            Assignee: Neha Narkhede
>            Priority: Critical
>
> While running integration tests for Sarama (the go client) we came across a 
> pattern of connection losses that reliably puts kafka into a bad state: 
> several of the brokers start spinning, chewing ~30% CPU and spamming the logs 
> with hundreds of thousands of lines like:
> {noformat}
> [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,1] failed due to Leader not local for partition 
> [many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,6] failed due to Leader not local for partition 
> [many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,21] failed due to Leader not local for partition 
> [many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,26] failed due to Leader not local for partition 
> [many_partition,26] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,1] failed due to Leader not local for partition 
> [many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,6] failed due to Leader not local for partition 
> [many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,21] failed due to Leader not local for partition 
> [many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,26] failed due to Leader not local for partition 
> [many_partition,26] on broker 9093 (kafka.server.ReplicaManager)
> {noformat}
> This can be easily and reliably reproduced using the {{toxiproxy-final}} 
> branch of https://github.com/Shopify/sarama which includes a vagrant script 
> for provisioning the appropriate cluster: 
> - {{git clone https://github.com/Shopify/sarama.git}}
> - {{git checkout test-jira-kafka-2082}}
> - {{vagrant up}}
> - {{TEST_SEED=1427917826425719059 DEBUG=true go test -v}}
> After the test finishes (it fails because the cluster ends up in a bad 
> state), you can log into the cluster machine with {{vagrant ssh}} and inspect 
> the bad nodes. The vagrant script provisions five zookeepers and five brokers 
> in {{/opt/kafka-9091/}} through {{/opt/kafka-9095/}}.
> Additional context: the test produces continually to the cluster while 
> randomly cutting and restoring zookeeper connections (all connections to 
> zookeeper are run through a simple proxy on the same vm to make this easy). 
> The majority of the time this works very well and does a good job exercising 
> our producer's retry and failover code. However, under certain patterns of 
> connection loss (the {{TEST_SEED}} in the instructions is important), kafka 
> gets confused. The test never cuts more than two connections at a time, so 
> zookeeper should always have quorum, and the topic (with three replicas) 
> should always be writable.
> Completely restarting the cluster via {{vagrant reload}} seems to put it back 
> into a sane state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to