Victor Garcia created KAFKA-4940:
------------------------------------

             Summary: Cluster partially working if broker blocked with IO
                 Key: KAFKA-4940
                 URL: https://issues.apache.org/jira/browse/KAFKA-4940
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 0.10.2.0, 0.10.1.1
            Reporter: Victor Garcia


A cluster can partially work if there is an IO issue that blocks the broker 
process and leaves it in a state of uninterruptible sleep.

All the threads connected to this bad broker will hang and the cluster ends up 
partially working.

I reproduced it and this is what happened:

Let's say we have brokers 1, 2 and 3 and broker 2 is IO blocked, non 
responsive, you can't even kill it unless -9.
Let's say we have a topic with replication 3. The partitions with leader 1 and 
3, will see that broker 2 has issues and will take it out from ISR. That's fine.

But the partitions where the leader is 2, think the problematic brokers are 1 
and 3 and will take these replicas out of the ISR. And this is a problem.

The consumers and producers will only work with the ones that don't have the 
broker 2 in their ISR.

This is an example of the output for 2 topics after provoking this:

{code}
./kafka-topics.sh --describe --zookeeper 127.0.0.1:2181 --unavailable-partitions

        Topic: agent_ping       Partition: 0    Leader: 2       Replicas: 2,1,3 
Isr: 2
        Topic: agent_ping       Partition: 1    Leader: 3       Replicas: 3,2,1 
Isr: 1,3
        Topic: agent_ping       Partition: 2    Leader: 1       Replicas: 1,3,2 
Isr: 3,1
        Topic: agent_ping       Partition: 3    Leader: 2       Replicas: 2,3,1 
Isr: 2
        Topic: agent_ping       Partition: 4    Leader: 3       Replicas: 3,1,2 
Isr: 1,3
        Topic: agent_ping       Partition: 5    Leader: 1       Replicas: 1,2,3 
Isr: 3,1
        Topic: agent_ping       Partition: 6    Leader: 2       Replicas: 2,1,3 
Isr: 2
        Topic: agent_ping       Partition: 9    Leader: 2       Replicas: 2,3,1 
Isr: 2
        Topic: agent_ping       Partition: 12   Leader: 2       Replicas: 2,1,3 
Isr: 2
        Topic: agent_ping       Partition: 13   Leader: 3       Replicas: 3,2,1 
Isr: 1,3
        Topic: agent_ping       Partition: 14   Leader: 1       Replicas: 1,3,2 
Isr: 3,1
        Topic: agent_ping       Partition: 15   Leader: 2       Replicas: 2,3,1 
Isr: 2
        Topic: agent_ping       Partition: 16   Leader: 3       Replicas: 3,1,2 
Isr: 1,3
        Topic: agent_ping       Partition: 17   Leader: 1       Replicas: 1,2,3 
Isr: 3,1
        Topic: agent_ping       Partition: 18   Leader: 2       Replicas: 2,1,3 
Isr: 2
        Topic: imback   Partition: 0    Leader: 3       Replicas: 3,1,2 Isr: 1,3
        Topic: imback   Partition: 1    Leader: 1       Replicas: 1,2,3 Isr: 3,1
        Topic: imback   Partition: 2    Leader: 2       Replicas: 2,3,1 Isr: 2
        Topic: imback   Partition: 3    Leader: 3       Replicas: 3,2,1 Isr: 1,3
        Topic: imback   Partition: 4    Leader: 1       Replicas: 1,3,2 Isr: 3,1
{code}

Kafka should be able to handle this in a better way and find out what are the 
problematic brokers and remove its replicas accordingly.
IO problems can be caused by hardware issues, kernel misconfiguration or 
others, and are not that infrequent.

Kafka is highly available, but in this case it is not.

To reproduce this, creating IO to block a process is not easy but the same 
symptoms can be easily reproducible using NFS.
Create an simple NFS server 
(https://help.ubuntu.com/community/SettingUpNFSHowTo), mount a NFS partition in 
the broker log.dirs and once the cluster is working, stop NFS in the server 
(service nfs-kernel-server stop)

This will make broker hang waiting for IO.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to