Re: One Kafka Broker Went Rogue

2016-12-06 Thread Thomas DeVoe
Hi All,

This happened again on our kafka cluster - a single kafka broker seems to
"forget" the existence of the rest of the cluster and shrinks all of its
ISRs to only exist on that node. The other two nodes get stuck in a loop
trying to connect to this rogue node and never even register that it is no
longer part of the cluster. Strangely network connection between all of
these nodes is fine at that time and restarting the node resolves it
(though with some data loss due to unclean leader elections)

Anyone have any ideas? Help would be greatly appreciated.

Thanks,



<http://dataminr.com/>


*Tom DeVoe*
Sofware Engineer, Data

6 East 32nd Street, 2nd Floor
New York, NY 10016



Dataminr is a Twitter Official Partner.
Dataminr in the news: The Economist
<http://www.economist.com/news/business/21705369-alternative-data-firms-are-shedding-new-light-corporate-performance-watchers>
 | International Business Times
<http://www.ibtimes.co.uk/dataminr-solves-twitters-needle-haystack-problem-hedge-funds-banks-1576692>
 | Equities.com
<https://www.equities.com/news/from-novelty-to-utility-how-dataminr-and-the-alternative-data-industry-is-becoming-mainstream>
 | SIA
<http://newsmanager.commpartners.com/sianews2/issues/2016-08-19/11.html>


On Tue, Nov 29, 2016 at 1:29 PM, Thomas DeVoe  wrote:

> Hi,
>
> I encountered a strange issue in our kafka cluster, where randomly a
> single broker entered a state where it seemed to think it was the only
> broker in the cluster (it shrank all of its ISRs to just existing on
> itself). Some details about the kafka cluster:
>
> - running in an EC2 VPC on AWS
> - 3 nodes (d2.xlarge)
> - Kafka version : 0.10.1.0
>
> More information about the incident:
>
> Around 19:57 yesterday, one of the nodes somehow lost its connection to
> the cluster and started reporting messages like this for what seemed to be
> all of its hosted topic partitions:
>
> [2016-11-28 19:57:05,426] INFO Partition [arches_stage,0] on broker 1002:
>> Shrinking ISR for partition [arches_stage,0] from 1003,1002,1001 to 1002
>> (kafka.cluster.Partition)
>> [2016-11-28 19:57:05,466] INFO Partition [connect-offsets,13] on broker
>> 1002: Shrinking ISR for partition [connect-offsets,13] from 1003,1002,1001
>> to 1002 (kafka.cluster.Partition)
>> [2016-11-28 19:57:05,489] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
>> 1003,1002,1001 to 1002 (kafka.cluster.Partition)
>> ...
>>
>
> It then added the ISRs from the other machines back in:
>
> [2016-11-28 19:57:18,013] INFO Partition [arches_stage,0] on broker 1002:
>> Expanding ISR for partition [arches_stage,0] from 1002 to 1002,1003
>> (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,015] INFO Partition [connect-offsets,13] on broker
>> 1002: Expanding ISR for partition [connect-offsets,13] from 1002 to
>> 1002,1003 (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,018] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
>> 1002 to 1002,1003 (kafka.cluster.Partition)
>> ...
>> [2016-11-28 19:57:18,222] INFO Partition [arches_stage,0] on broker 1002:
>> Expanding ISR for partition [arches_stage,0] from 1002,1003 to
>> 1002,1003,1001 (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,224] INFO Partition [connect-offsets,13] on broker
>> 1002: Expanding ISR for partition [connect-offsets,13] from 1002,1003 to
>> 1002,1003,1001 (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,227] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
>> 1002,1003 to 1002,1003,1001 (kafka.cluster.Partition)
>
>
> and eventually removed them again before going on its merry way:
>
> [2016-11-28 19:58:05,408] INFO Partition [arches_stage,0] on broker 1002:
>> Shrinking ISR for partition [arches_stage,0] from 1002,1003,1001 to 1002
>> (kafka.cluster.Partition)
>> [2016-11-28 19:58:05,415] INFO Partition [connect-offsets,13] on broker
>> 1002: Shrinking ISR for partition [connect-offsets,13] from 1002,1003,1001
>> to 1002 (kafka.cluster.Partition)
>> [2016-11-28 19:58:05,416] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
>> 1002,1003,1001 to 1002 (kafka.cluster.Partition)
>
>
> Node 1002 continued running from that point on normally (outside of the
> fact that all of it's partitions were under replicated). Also there were no
> WARN/ERROR before/after this.
>
>
> The other two nodes were not so happy however, with both failing to
> connect to via the Rep

One Kafka Broker Went Rogue

2016-11-29 Thread Thomas DeVoe
Hi,

I encountered a strange issue in our kafka cluster, where randomly a single
broker entered a state where it seemed to think it was the only broker in
the cluster (it shrank all of its ISRs to just existing on itself). Some
details about the kafka cluster:

- running in an EC2 VPC on AWS
- 3 nodes (d2.xlarge)
- Kafka version : 0.10.1.0

More information about the incident:

Around 19:57 yesterday, one of the nodes somehow lost its connection to the
cluster and started reporting messages like this for what seemed to be all
of its hosted topic partitions:

[2016-11-28 19:57:05,426] INFO Partition [arches_stage,0] on broker 1002:
> Shrinking ISR for partition [arches_stage,0] from 1003,1002,1001 to 1002
> (kafka.cluster.Partition)
> [2016-11-28 19:57:05,466] INFO Partition [connect-offsets,13] on broker
> 1002: Shrinking ISR for partition [connect-offsets,13] from 1003,1002,1001
> to 1002 (kafka.cluster.Partition)
> [2016-11-28 19:57:05,489] INFO Partition [lasagna_prod_memstore,2] on
> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
> 1003,1002,1001 to 1002 (kafka.cluster.Partition)
> ...
>

It then added the ISRs from the other machines back in:

[2016-11-28 19:57:18,013] INFO Partition [arches_stage,0] on broker 1002:
> Expanding ISR for partition [arches_stage,0] from 1002 to 1002,1003
> (kafka.cluster.Partition)
> [2016-11-28 19:57:18,015] INFO Partition [connect-offsets,13] on broker
> 1002: Expanding ISR for partition [connect-offsets,13] from 1002 to
> 1002,1003 (kafka.cluster.Partition)
> [2016-11-28 19:57:18,018] INFO Partition [lasagna_prod_memstore,2] on
> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
> 1002 to 1002,1003 (kafka.cluster.Partition)
> ...
> [2016-11-28 19:57:18,222] INFO Partition [arches_stage,0] on broker 1002:
> Expanding ISR for partition [arches_stage,0] from 1002,1003 to
> 1002,1003,1001 (kafka.cluster.Partition)
> [2016-11-28 19:57:18,224] INFO Partition [connect-offsets,13] on broker
> 1002: Expanding ISR for partition [connect-offsets,13] from 1002,1003 to
> 1002,1003,1001 (kafka.cluster.Partition)
> [2016-11-28 19:57:18,227] INFO Partition [lasagna_prod_memstore,2] on
> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
> 1002,1003 to 1002,1003,1001 (kafka.cluster.Partition)


and eventually removed them again before going on its merry way:

[2016-11-28 19:58:05,408] INFO Partition [arches_stage,0] on broker 1002:
> Shrinking ISR for partition [arches_stage,0] from 1002,1003,1001 to 1002
> (kafka.cluster.Partition)
> [2016-11-28 19:58:05,415] INFO Partition [connect-offsets,13] on broker
> 1002: Shrinking ISR for partition [connect-offsets,13] from 1002,1003,1001
> to 1002 (kafka.cluster.Partition)
> [2016-11-28 19:58:05,416] INFO Partition [lasagna_prod_memstore,2] on
> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
> 1002,1003,1001 to 1002 (kafka.cluster.Partition)


Node 1002 continued running from that point on normally (outside of the
fact that all of it's partitions were under replicated). Also there were no
WARN/ERROR before/after this.


The other two nodes were not so happy however, with both failing to connect
to via the ReplicaFetcherThread to the node in question. The reported this
around the same time as that error:

[2016-11-28 19:57:16,087] WARN [ReplicaFetcherThread-0-1002], Error in
> fetch kafka.server.ReplicaFetcherThread$FetchRequest@6eb44718
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 1002 was disconnected before the
> response was read
> at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
> at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
> at scala.Option.foreach(Option.scala:257)
> at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
> at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
> at
> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
> at
> kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
> at
> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
> at
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
> at
> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
> at
> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
> at
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
> at
> kafka.server