[jira] [Updated] (KAFKA-6042) Kafka Request Handler deadlocks and brings down the cluster.
[ https://issues.apache.org/jira/browse/KAFKA-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma updated KAFKA-6042: --- Affects Version/s: (was: 1.0.0) > Kafka Request Handler deadlocks and brings down the cluster. > > > Key: KAFKA-6042 > URL: https://issues.apache.org/jira/browse/KAFKA-6042 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.0, 0.11.0.1 > Environment: kafka version: 0.11.0.1 > client versions: 0.8.2.1-0.10.2.1 > platform: aws (eu-west-1a) > nodes: 36 x r4.xlarge > disk storage: 2.5 tb per node (~73% usage per node) > topics: 250 > number of partitions: 48k (approx) > os: ubuntu 14.04 > jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) > 64-Bit Server VM (build 25.131-b11, mixed mode) >Reporter: Ben Corlett >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 0.11.0.2, 1.0.0 > > Attachments: heapusage.png, thread_dump.txt.gz > > > We have been experiencing a deadlock that happens on a consistent server > within our cluster. This happens multiple times a week currently. It first > started happening when we upgraded to 0.11.0.0. Sadly 0.11.0.1 failed to > resolve the issue. > Sequence of events: > At a seemingly random time broker 125 goes into a deadlock. As soon as it is > deadlocked it will remove all the ISR's for any partition is its the leader > for. > [2017-10-10 00:06:10,061] INFO Partition [XX,24] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,073] INFO Partition [XX,974] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,079] INFO Partition [XX,64] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,081] INFO Partition [XX,21] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,084] INFO Partition [XX,12] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,085] INFO Partition [XX,61] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,086] INFO Partition [XX,53] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,088] INFO Partition [XX,27] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,090] INFO Partition [XX,182] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,091] INFO Partition [XX,16] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > > The other nodes fail to connect to the node 125 > [2017-10-10 00:08:42,318] WARN [ReplicaFetcherThread-0-125]: Error in fetch > to broker 125, request (type=FetchRequest, replicaId=101, maxWait=500, > minBytes=1, maxBytes=10485760, fetchData={XX-94=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-22=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-58=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-11=(offset=78932482, > logStartOffset=50881481, maxBytes=1048576), XX-55=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-19=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-91=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-5=(offset=903857106, > logStartOffset=0, maxBytes=1048576), XX-80=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-88=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-34=(offset=308, > logStartOffset=308, maxBytes=1048576), XX-7=(offset=369990, > logStartOffset=369990, maxBytes=1048576), XX-0=(offset=57965795, > logStartOffset=0, maxBytes=1048576)}) (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 125 was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) > As node 125 removed all the ISRs as it was locking up, a failover for any > partition without an uncl
[jira] [Updated] (KAFKA-6042) Kafka Request Handler deadlocks and brings down the cluster.
[ https://issues.apache.org/jira/browse/KAFKA-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Corlett updated KAFKA-6042: --- Attachment: heapusage.png heapusage > Kafka Request Handler deadlocks and brings down the cluster. > > > Key: KAFKA-6042 > URL: https://issues.apache.org/jira/browse/KAFKA-6042 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.0, 0.11.0.1, 1.0.0 > Environment: kafka version: 0.11.0.1 > client versions: 0.8.2.1-0.10.2.1 > platform: aws (eu-west-1a) > nodes: 36 x r4.xlarge > disk storage: 2.5 tb per node (~73% usage per node) > topics: 250 > number of partitions: 48k (approx) > os: ubuntu 14.04 > jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) > 64-Bit Server VM (build 25.131-b11, mixed mode) >Reporter: Ben Corlett >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 1.0.0, 0.11.0.2 > > Attachments: heapusage.png, thread_dump.txt.gz > > > We have been experiencing a deadlock that happens on a consistent server > within our cluster. This happens multiple times a week currently. It first > started happening when we upgraded to 0.11.0.0. Sadly 0.11.0.1 failed to > resolve the issue. > Sequence of events: > At a seemingly random time broker 125 goes into a deadlock. As soon as it is > deadlocked it will remove all the ISR's for any partition is its the leader > for. > [2017-10-10 00:06:10,061] INFO Partition [XX,24] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,073] INFO Partition [XX,974] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,079] INFO Partition [XX,64] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,081] INFO Partition [XX,21] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,084] INFO Partition [XX,12] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,085] INFO Partition [XX,61] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,086] INFO Partition [XX,53] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,088] INFO Partition [XX,27] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,090] INFO Partition [XX,182] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,091] INFO Partition [XX,16] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > > The other nodes fail to connect to the node 125 > [2017-10-10 00:08:42,318] WARN [ReplicaFetcherThread-0-125]: Error in fetch > to broker 125, request (type=FetchRequest, replicaId=101, maxWait=500, > minBytes=1, maxBytes=10485760, fetchData={XX-94=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-22=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-58=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-11=(offset=78932482, > logStartOffset=50881481, maxBytes=1048576), XX-55=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-19=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-91=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-5=(offset=903857106, > logStartOffset=0, maxBytes=1048576), XX-80=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-88=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-34=(offset=308, > logStartOffset=308, maxBytes=1048576), XX-7=(offset=369990, > logStartOffset=369990, maxBytes=1048576), XX-0=(offset=57965795, > logStartOffset=0, maxBytes=1048576)}) (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 125 was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) > As node 125 removed all the ISRs as it was locking up, a failover for any > partition without
[jira] [Updated] (KAFKA-6042) Kafka Request Handler deadlocks and brings down the cluster.
[ https://issues.apache.org/jira/browse/KAFKA-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismael Juma updated KAFKA-6042: --- Fix Version/s: 0.11.0.2 > Kafka Request Handler deadlocks and brings down the cluster. > > > Key: KAFKA-6042 > URL: https://issues.apache.org/jira/browse/KAFKA-6042 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.0, 0.11.0.1, 1.0.0 > Environment: kafka version: 0.11.0.1 > client versions: 0.8.2.1-0.10.2.1 > platform: aws (eu-west-1a) > nodes: 36 x r4.xlarge > disk storage: 2.5 tb per node (~73% usage per node) > topics: 250 > number of partitions: 48k (approx) > os: ubuntu 14.04 > jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) > 64-Bit Server VM (build 25.131-b11, mixed mode) >Reporter: Ben Corlett >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 1.0.0, 0.11.0.2 > > Attachments: thread_dump.txt.gz > > > We have been experiencing a deadlock that happens on a consistent server > within our cluster. This happens multiple times a week currently. It first > started happening when we upgraded to 0.11.0.0. Sadly 0.11.0.1 failed to > resolve the issue. > Sequence of events: > At a seemingly random time broker 125 goes into a deadlock. As soon as it is > deadlocked it will remove all the ISR's for any partition is its the leader > for. > [2017-10-10 00:06:10,061] INFO Partition [XX,24] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,073] INFO Partition [XX,974] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,079] INFO Partition [XX,64] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,081] INFO Partition [XX,21] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,084] INFO Partition [XX,12] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,085] INFO Partition [XX,61] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,086] INFO Partition [XX,53] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,088] INFO Partition [XX,27] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,090] INFO Partition [XX,182] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,091] INFO Partition [XX,16] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > > The other nodes fail to connect to the node 125 > [2017-10-10 00:08:42,318] WARN [ReplicaFetcherThread-0-125]: Error in fetch > to broker 125, request (type=FetchRequest, replicaId=101, maxWait=500, > minBytes=1, maxBytes=10485760, fetchData={XX-94=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-22=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-58=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-11=(offset=78932482, > logStartOffset=50881481, maxBytes=1048576), XX-55=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-19=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-91=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-5=(offset=903857106, > logStartOffset=0, maxBytes=1048576), XX-80=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-88=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-34=(offset=308, > logStartOffset=308, maxBytes=1048576), XX-7=(offset=369990, > logStartOffset=369990, maxBytes=1048576), XX-0=(offset=57965795, > logStartOffset=0, maxBytes=1048576)}) (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 125 was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) > As node 125 removed all the ISRs as it was locking up, a failover for any > partition without an unclean leader election i
[jira] [Updated] (KAFKA-6042) Kafka Request Handler deadlocks and brings down the cluster.
[ https://issues.apache.org/jira/browse/KAFKA-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-6042: -- Fix Version/s: 0.11.0.2 1.0.0 > Kafka Request Handler deadlocks and brings down the cluster. > > > Key: KAFKA-6042 > URL: https://issues.apache.org/jira/browse/KAFKA-6042 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.0, 0.11.0.1, 1.0.0 > Environment: kafka version: 0.11.0.1 > client versions: 0.8.2.1-0.10.2.1 > platform: aws (eu-west-1a) > nodes: 36 x r4.xlarge > disk storage: 2.5 tb per node (~73% usage per node) > topics: 250 > number of partitions: 48k (approx) > os: ubuntu 14.04 > jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) > 64-Bit Server VM (build 25.131-b11, mixed mode) >Reporter: Ben Corlett >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 1.0.0, 0.11.0.2 > > Attachments: thread_dump.txt.gz > > > We have been experiencing a deadlock that happens on a consistent server > within our cluster. This happens multiple times a week currently. It first > started happening when we upgraded to 0.11.0.0. Sadly 0.11.0.1 failed to > resolve the issue. > Sequence of events: > At a seemingly random time broker 125 goes into a deadlock. As soon as it is > deadlocked it will remove all the ISR's for any partition is its the leader > for. > [2017-10-10 00:06:10,061] INFO Partition [XX,24] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,073] INFO Partition [XX,974] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,079] INFO Partition [XX,64] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,081] INFO Partition [XX,21] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,084] INFO Partition [XX,12] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,085] INFO Partition [XX,61] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,086] INFO Partition [XX,53] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,088] INFO Partition [XX,27] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,090] INFO Partition [XX,182] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,091] INFO Partition [XX,16] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > > The other nodes fail to connect to the node 125 > [2017-10-10 00:08:42,318] WARN [ReplicaFetcherThread-0-125]: Error in fetch > to broker 125, request (type=FetchRequest, replicaId=101, maxWait=500, > minBytes=1, maxBytes=10485760, fetchData={XX-94=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-22=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-58=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-11=(offset=78932482, > logStartOffset=50881481, maxBytes=1048576), XX-55=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-19=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-91=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-5=(offset=903857106, > logStartOffset=0, maxBytes=1048576), XX-80=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-88=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-34=(offset=308, > logStartOffset=308, maxBytes=1048576), XX-7=(offset=369990, > logStartOffset=369990, maxBytes=1048576), XX-0=(offset=57965795, > logStartOffset=0, maxBytes=1048576)}) (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 125 was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) > As node 125 removed all the ISRs as it was locking up, a failover for any > partition witho
[jira] [Updated] (KAFKA-6042) Kafka Request Handler deadlocks and brings down the cluster.
[ https://issues.apache.org/jira/browse/KAFKA-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-6042: -- Affects Version/s: 1.0.0 Priority: Blocker (was: Critical) > Kafka Request Handler deadlocks and brings down the cluster. > > > Key: KAFKA-6042 > URL: https://issues.apache.org/jira/browse/KAFKA-6042 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.0, 0.11.0.1, 1.0.0 > Environment: kafka version: 0.11.0.1 > client versions: 0.8.2.1-0.10.2.1 > platform: aws (eu-west-1a) > nodes: 36 x r4.xlarge > disk storage: 2.5 tb per node (~73% usage per node) > topics: 250 > number of partitions: 48k (approx) > os: ubuntu 14.04 > jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) > 64-Bit Server VM (build 25.131-b11, mixed mode) >Reporter: Ben Corlett >Assignee: Rajini Sivaram >Priority: Blocker > Attachments: thread_dump.txt.gz > > > We have been experiencing a deadlock that happens on a consistent server > within our cluster. This happens multiple times a week currently. It first > started happening when we upgraded to 0.11.0.0. Sadly 0.11.0.1 failed to > resolve the issue. > Sequence of events: > At a seemingly random time broker 125 goes into a deadlock. As soon as it is > deadlocked it will remove all the ISR's for any partition is its the leader > for. > [2017-10-10 00:06:10,061] INFO Partition [XX,24] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,073] INFO Partition [XX,974] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,079] INFO Partition [XX,64] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,081] INFO Partition [XX,21] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,084] INFO Partition [XX,12] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,085] INFO Partition [XX,61] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,086] INFO Partition [XX,53] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,088] INFO Partition [XX,27] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,090] INFO Partition [XX,182] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > [2017-10-10 00:06:10,091] INFO Partition [XX,16] on broker 125: > Shrinking ISR from 117,125 to 125 (kafka.cluster.Partition) > > The other nodes fail to connect to the node 125 > [2017-10-10 00:08:42,318] WARN [ReplicaFetcherThread-0-125]: Error in fetch > to broker 125, request (type=FetchRequest, replicaId=101, maxWait=500, > minBytes=1, maxBytes=10485760, fetchData={XX-94=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-22=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-58=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-11=(offset=78932482, > logStartOffset=50881481, maxBytes=1048576), XX-55=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-19=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-91=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-5=(offset=903857106, > logStartOffset=0, maxBytes=1048576), XX-80=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-88=(offset=0, > logStartOffset=0, maxBytes=1048576), XX-34=(offset=308, > logStartOffset=308, maxBytes=1048576), XX-7=(offset=369990, > logStartOffset=369990, maxBytes=1048576), XX-0=(offset=57965795, > logStartOffset=0, maxBytes=1048576)}) (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 125 was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) > As node 125 removed all the ISRs as it was locking up, a failover for any > partition without an unclean lea