Re: One Kafka Broker Went Rogue

2016-12-29 Thread Tony Liu
​Hi,

I just found a reported issue,
https://issues.apache.org/jira/browse/KAFKA-4477, hopefully, it's useful
for you.​

On Thu, Dec 29, 2016 at 12:08 PM, Tony Liu  wrote:

> Hi Thomas or Anyone,
>
> I also encountered the same issue like you reported, the only workaround
> is to restart that broken node, but I did not find root cause how to solve
> it right now, so I wonder do you have some progress how to solve that issue
> right now?
>
> i.e, at the beginning, I thought this issue was caused by `ulimit`, after
> I increase to 100k, that random error still happen after several days, so I
> think this issue may be related with kafka itself.
>
> thanks.
>
> On Tue, Dec 6, 2016 at 12:59 PM, Thomas DeVoe  wrote:
>
>> Hi All,
>>
>> This happened again on our kafka cluster - a single kafka broker seems to
>> "forget" the existence of the rest of the cluster and shrinks all of its
>> ISRs to only exist on that node. The other two nodes get stuck in a loop
>> trying to connect to this rogue node and never even register that it is no
>> longer part of the cluster. Strangely network connection between all of
>> these nodes is fine at that time and restarting the node resolves it
>> (though with some data loss due to unclean leader elections)
>>
>> Anyone have any ideas? Help would be greatly appreciated.
>>
>> Thanks,
>>
>>
>>
>> 
>>
>>
>> *Tom DeVoe*
>> Sofware Engineer, Data
>>
>> 6 East 32nd Street, 2nd Floor
>> New York, NY 10016
>>
>>
>>
>> Dataminr is a Twitter Official Partner.
>> Dataminr in the news: The Economist
>> > -data-firms-are-shedding-new-light-corporate-performance-watchers>
>>  | International Business Times
>> > haystack-problem-hedge-funds-banks-1576692>
>>  | Equities.com
>> > dataminr-and-the-alternative-data-industry-is-becoming-mainstream>
>>  | SIA
>> 
>>
>>
>> On Tue, Nov 29, 2016 at 1:29 PM, Thomas DeVoe 
>> wrote:
>>
>> > Hi,
>> >
>> > I encountered a strange issue in our kafka cluster, where randomly a
>> > single broker entered a state where it seemed to think it was the only
>> > broker in the cluster (it shrank all of its ISRs to just existing on
>> > itself). Some details about the kafka cluster:
>> >
>> > - running in an EC2 VPC on AWS
>> > - 3 nodes (d2.xlarge)
>> > - Kafka version : 0.10.1.0
>> >
>> > More information about the incident:
>> >
>> > Around 19:57 yesterday, one of the nodes somehow lost its connection to
>> > the cluster and started reporting messages like this for what seemed to
>> be
>> > all of its hosted topic partitions:
>> >
>> > [2016-11-28 19:57:05,426] INFO Partition [arches_stage,0] on broker
>> 1002:
>> >> Shrinking ISR for partition [arches_stage,0] from 1003,1002,1001 to
>> 1002
>> >> (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:05,466] INFO Partition [connect-offsets,13] on broker
>> >> 1002: Shrinking ISR for partition [connect-offsets,13] from
>> 1003,1002,1001
>> >> to 1002 (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:05,489] INFO Partition [lasagna_prod_memstore,2] on
>> >> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
>> >> 1003,1002,1001 to 1002 (kafka.cluster.Partition)
>> >> ...
>> >>
>> >
>> > It then added the ISRs from the other machines back in:
>> >
>> > [2016-11-28 19:57:18,013] INFO Partition [arches_stage,0] on broker
>> 1002:
>> >> Expanding ISR for partition [arches_stage,0] from 1002 to 1002,1003
>> >> (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:18,015] INFO Partition [connect-offsets,13] on broker
>> >> 1002: Expanding ISR for partition [connect-offsets,13] from 1002 to
>> >> 1002,1003 (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:18,018] INFO Partition [lasagna_prod_memstore,2] on
>> >> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
>> >> 1002 to 1002,1003 (kafka.cluster.Partition)
>> >> ...
>> >> [2016-11-28 19:57:18,222] INFO Partition [arches_stage,0] on broker
>> 1002:
>> >> Expanding ISR for partition [arches_stage,0] from 1002,1003 to
>> >> 1002,1003,1001 (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:18,224] INFO Partition [connect-offsets,13] on broker
>> >> 1002: Expanding ISR for partition [connect-offsets,13] from 1002,1003
>> to
>> >> 1002,1003,1001 (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:18,227] INFO Partition [lasagna_prod_memstore,2] on
>> >> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
>> >> 1002,1003 to 1002,1003,1001 (kafka.cluster.Partition)
>> >
>> >
>> > and eventually removed them again before going on its merry way:
>> >
>> > [2016-11-28 19:58:05,408] INFO Partition [arches_stage,0] on broker
>> 1002:
>> >> Shrinking ISR for partition [arches_stage,0] from 1002,1003,1001 to
>> 1002
>> >> (kafka.cluster.Partition)
>> >> [2016-

Re: One Kafka Broker Went Rogue

2016-12-29 Thread Tony Liu
Hi Thomas or Anyone,

I also encountered the same issue like you reported, the only workaround is
to restart that broken node, but I did not find root cause how to solve it
right now, so I wonder do you have some progress how to solve that issue
right now?

i.e, at the beginning, I thought this issue was caused by `ulimit`, after I
increase to 100k, that random error still happen after several days, so I
think this issue may be related with kafka itself.

thanks.

On Tue, Dec 6, 2016 at 12:59 PM, Thomas DeVoe  wrote:

> Hi All,
>
> This happened again on our kafka cluster - a single kafka broker seems to
> "forget" the existence of the rest of the cluster and shrinks all of its
> ISRs to only exist on that node. The other two nodes get stuck in a loop
> trying to connect to this rogue node and never even register that it is no
> longer part of the cluster. Strangely network connection between all of
> these nodes is fine at that time and restarting the node resolves it
> (though with some data loss due to unclean leader elections)
>
> Anyone have any ideas? Help would be greatly appreciated.
>
> Thanks,
>
>
>
> 
>
>
> *Tom DeVoe*
> Sofware Engineer, Data
>
> 6 East 32nd Street, 2nd Floor
> New York, NY 10016
>
>
>
> Dataminr is a Twitter Official Partner.
> Dataminr in the news: The Economist
>  alternative-data-firms-are-shedding-new-light-corporate-
> performance-watchers>
>  | International Business Times
>  needle-haystack-problem-hedge-funds-banks-1576692>
>  | Equities.com
>  how-dataminr-and-the-alternative-data-industry-is-becoming-mainstream>
>  | SIA
> 
>
>
> On Tue, Nov 29, 2016 at 1:29 PM, Thomas DeVoe  wrote:
>
> > Hi,
> >
> > I encountered a strange issue in our kafka cluster, where randomly a
> > single broker entered a state where it seemed to think it was the only
> > broker in the cluster (it shrank all of its ISRs to just existing on
> > itself). Some details about the kafka cluster:
> >
> > - running in an EC2 VPC on AWS
> > - 3 nodes (d2.xlarge)
> > - Kafka version : 0.10.1.0
> >
> > More information about the incident:
> >
> > Around 19:57 yesterday, one of the nodes somehow lost its connection to
> > the cluster and started reporting messages like this for what seemed to
> be
> > all of its hosted topic partitions:
> >
> > [2016-11-28 19:57:05,426] INFO Partition [arches_stage,0] on broker 1002:
> >> Shrinking ISR for partition [arches_stage,0] from 1003,1002,1001 to 1002
> >> (kafka.cluster.Partition)
> >> [2016-11-28 19:57:05,466] INFO Partition [connect-offsets,13] on broker
> >> 1002: Shrinking ISR for partition [connect-offsets,13] from
> 1003,1002,1001
> >> to 1002 (kafka.cluster.Partition)
> >> [2016-11-28 19:57:05,489] INFO Partition [lasagna_prod_memstore,2] on
> >> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
> >> 1003,1002,1001 to 1002 (kafka.cluster.Partition)
> >> ...
> >>
> >
> > It then added the ISRs from the other machines back in:
> >
> > [2016-11-28 19:57:18,013] INFO Partition [arches_stage,0] on broker 1002:
> >> Expanding ISR for partition [arches_stage,0] from 1002 to 1002,1003
> >> (kafka.cluster.Partition)
> >> [2016-11-28 19:57:18,015] INFO Partition [connect-offsets,13] on broker
> >> 1002: Expanding ISR for partition [connect-offsets,13] from 1002 to
> >> 1002,1003 (kafka.cluster.Partition)
> >> [2016-11-28 19:57:18,018] INFO Partition [lasagna_prod_memstore,2] on
> >> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
> >> 1002 to 1002,1003 (kafka.cluster.Partition)
> >> ...
> >> [2016-11-28 19:57:18,222] INFO Partition [arches_stage,0] on broker
> 1002:
> >> Expanding ISR for partition [arches_stage,0] from 1002,1003 to
> >> 1002,1003,1001 (kafka.cluster.Partition)
> >> [2016-11-28 19:57:18,224] INFO Partition [connect-offsets,13] on broker
> >> 1002: Expanding ISR for partition [connect-offsets,13] from 1002,1003 to
> >> 1002,1003,1001 (kafka.cluster.Partition)
> >> [2016-11-28 19:57:18,227] INFO Partition [lasagna_prod_memstore,2] on
> >> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
> >> 1002,1003 to 1002,1003,1001 (kafka.cluster.Partition)
> >
> >
> > and eventually removed them again before going on its merry way:
> >
> > [2016-11-28 19:58:05,408] INFO Partition [arches_stage,0] on broker 1002:
> >> Shrinking ISR for partition [arches_stage,0] from 1002,1003,1001 to 1002
> >> (kafka.cluster.Partition)
> >> [2016-11-28 19:58:05,415] INFO Partition [connect-offsets,13] on broker
> >> 1002: Shrinking ISR for partition [connect-offsets,13] from
> 1002,1003,1001
> >> to 1002 (kafka.cluster.Partition)
> >> [2016-11-28 19:58:05,416] INFO Partition [lasagna_prod_memstore,2] on
> >> broker 1002: Shrinking ISR for partition [lasagna_pro

Re: One Kafka Broker Went Rogue

2016-12-06 Thread Thomas DeVoe
Hi All,

This happened again on our kafka cluster - a single kafka broker seems to
"forget" the existence of the rest of the cluster and shrinks all of its
ISRs to only exist on that node. The other two nodes get stuck in a loop
trying to connect to this rogue node and never even register that it is no
longer part of the cluster. Strangely network connection between all of
these nodes is fine at that time and restarting the node resolves it
(though with some data loss due to unclean leader elections)

Anyone have any ideas? Help would be greatly appreciated.

Thanks,






*Tom DeVoe*
Sofware Engineer, Data

6 East 32nd Street, 2nd Floor
New York, NY 10016



Dataminr is a Twitter Official Partner.
Dataminr in the news: The Economist

 | International Business Times

 | Equities.com

 | SIA



On Tue, Nov 29, 2016 at 1:29 PM, Thomas DeVoe  wrote:

> Hi,
>
> I encountered a strange issue in our kafka cluster, where randomly a
> single broker entered a state where it seemed to think it was the only
> broker in the cluster (it shrank all of its ISRs to just existing on
> itself). Some details about the kafka cluster:
>
> - running in an EC2 VPC on AWS
> - 3 nodes (d2.xlarge)
> - Kafka version : 0.10.1.0
>
> More information about the incident:
>
> Around 19:57 yesterday, one of the nodes somehow lost its connection to
> the cluster and started reporting messages like this for what seemed to be
> all of its hosted topic partitions:
>
> [2016-11-28 19:57:05,426] INFO Partition [arches_stage,0] on broker 1002:
>> Shrinking ISR for partition [arches_stage,0] from 1003,1002,1001 to 1002
>> (kafka.cluster.Partition)
>> [2016-11-28 19:57:05,466] INFO Partition [connect-offsets,13] on broker
>> 1002: Shrinking ISR for partition [connect-offsets,13] from 1003,1002,1001
>> to 1002 (kafka.cluster.Partition)
>> [2016-11-28 19:57:05,489] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
>> 1003,1002,1001 to 1002 (kafka.cluster.Partition)
>> ...
>>
>
> It then added the ISRs from the other machines back in:
>
> [2016-11-28 19:57:18,013] INFO Partition [arches_stage,0] on broker 1002:
>> Expanding ISR for partition [arches_stage,0] from 1002 to 1002,1003
>> (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,015] INFO Partition [connect-offsets,13] on broker
>> 1002: Expanding ISR for partition [connect-offsets,13] from 1002 to
>> 1002,1003 (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,018] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
>> 1002 to 1002,1003 (kafka.cluster.Partition)
>> ...
>> [2016-11-28 19:57:18,222] INFO Partition [arches_stage,0] on broker 1002:
>> Expanding ISR for partition [arches_stage,0] from 1002,1003 to
>> 1002,1003,1001 (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,224] INFO Partition [connect-offsets,13] on broker
>> 1002: Expanding ISR for partition [connect-offsets,13] from 1002,1003 to
>> 1002,1003,1001 (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,227] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
>> 1002,1003 to 1002,1003,1001 (kafka.cluster.Partition)
>
>
> and eventually removed them again before going on its merry way:
>
> [2016-11-28 19:58:05,408] INFO Partition [arches_stage,0] on broker 1002:
>> Shrinking ISR for partition [arches_stage,0] from 1002,1003,1001 to 1002
>> (kafka.cluster.Partition)
>> [2016-11-28 19:58:05,415] INFO Partition [connect-offsets,13] on broker
>> 1002: Shrinking ISR for partition [connect-offsets,13] from 1002,1003,1001
>> to 1002 (kafka.cluster.Partition)
>> [2016-11-28 19:58:05,416] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
>> 1002,1003,1001 to 1002 (kafka.cluster.Partition)
>
>
> Node 1002 continued running from that point on normally (outside of the
> fact that all of it's partitions were under replicated). Also there were no
> WARN/ERROR before/after this.
>
>
> The other two nodes were not so happy however, with both failing to
> connect to via the ReplicaFetcherThread to the node in question. The
> reported this around the same time as that error:
>
> [2016-11-28 19:57:16,087] WARN [ReplicaFetcherThread-0-1002], Error in
>> fetch kafka.server.ReplicaFetcherThread$FetchRequest@6eb44718
>> (kafka.server.ReplicaFetcherThread)
>> java.io.IOException: Connection to 1002 was disconnected before the
>> response was read
>>

One Kafka Broker Went Rogue

2016-11-29 Thread Thomas DeVoe
Hi,

I encountered a strange issue in our kafka cluster, where randomly a single
broker entered a state where it seemed to think it was the only broker in
the cluster (it shrank all of its ISRs to just existing on itself). Some
details about the kafka cluster:

- running in an EC2 VPC on AWS
- 3 nodes (d2.xlarge)
- Kafka version : 0.10.1.0

More information about the incident:

Around 19:57 yesterday, one of the nodes somehow lost its connection to the
cluster and started reporting messages like this for what seemed to be all
of its hosted topic partitions:

[2016-11-28 19:57:05,426] INFO Partition [arches_stage,0] on broker 1002:
> Shrinking ISR for partition [arches_stage,0] from 1003,1002,1001 to 1002
> (kafka.cluster.Partition)
> [2016-11-28 19:57:05,466] INFO Partition [connect-offsets,13] on broker
> 1002: Shrinking ISR for partition [connect-offsets,13] from 1003,1002,1001
> to 1002 (kafka.cluster.Partition)
> [2016-11-28 19:57:05,489] INFO Partition [lasagna_prod_memstore,2] on
> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
> 1003,1002,1001 to 1002 (kafka.cluster.Partition)
> ...
>

It then added the ISRs from the other machines back in:

[2016-11-28 19:57:18,013] INFO Partition [arches_stage,0] on broker 1002:
> Expanding ISR for partition [arches_stage,0] from 1002 to 1002,1003
> (kafka.cluster.Partition)
> [2016-11-28 19:57:18,015] INFO Partition [connect-offsets,13] on broker
> 1002: Expanding ISR for partition [connect-offsets,13] from 1002 to
> 1002,1003 (kafka.cluster.Partition)
> [2016-11-28 19:57:18,018] INFO Partition [lasagna_prod_memstore,2] on
> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
> 1002 to 1002,1003 (kafka.cluster.Partition)
> ...
> [2016-11-28 19:57:18,222] INFO Partition [arches_stage,0] on broker 1002:
> Expanding ISR for partition [arches_stage,0] from 1002,1003 to
> 1002,1003,1001 (kafka.cluster.Partition)
> [2016-11-28 19:57:18,224] INFO Partition [connect-offsets,13] on broker
> 1002: Expanding ISR for partition [connect-offsets,13] from 1002,1003 to
> 1002,1003,1001 (kafka.cluster.Partition)
> [2016-11-28 19:57:18,227] INFO Partition [lasagna_prod_memstore,2] on
> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
> 1002,1003 to 1002,1003,1001 (kafka.cluster.Partition)


and eventually removed them again before going on its merry way:

[2016-11-28 19:58:05,408] INFO Partition [arches_stage,0] on broker 1002:
> Shrinking ISR for partition [arches_stage,0] from 1002,1003,1001 to 1002
> (kafka.cluster.Partition)
> [2016-11-28 19:58:05,415] INFO Partition [connect-offsets,13] on broker
> 1002: Shrinking ISR for partition [connect-offsets,13] from 1002,1003,1001
> to 1002 (kafka.cluster.Partition)
> [2016-11-28 19:58:05,416] INFO Partition [lasagna_prod_memstore,2] on
> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
> 1002,1003,1001 to 1002 (kafka.cluster.Partition)


Node 1002 continued running from that point on normally (outside of the
fact that all of it's partitions were under replicated). Also there were no
WARN/ERROR before/after this.


The other two nodes were not so happy however, with both failing to connect
to via the ReplicaFetcherThread to the node in question. The reported this
around the same time as that error:

[2016-11-28 19:57:16,087] WARN [ReplicaFetcherThread-0-1002], Error in
> fetch kafka.server.ReplicaFetcherThread$FetchRequest@6eb44718
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 1002 was disconnected before the
> response was read
> at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
> at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
> at scala.Option.foreach(Option.scala:257)
> at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
> at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
> at
> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
> at
> kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
> at
> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
> at
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
> at
> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
> at
> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
> at
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
> at
> kafka.server