Re: ISR churn

Radu Radutiu Thu, 23 Mar 2017 00:27:22 -0700

I see no errors related to zookeeper. I searched all the logs (kafka and
zookeeper) for all 4 servers for all entries in the minute with the ISR
change at 08:23:54 . Here are the logs:


Node n1
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Broker 1 cached leader info
(LeaderAndIsrInfo:(Leader:5,ISR:5,4,2,1,LeaderEpoch:56,ControllerEpoch:63),ReplicationFactor:4),AllReplicas:1,5,2,4)
for partition topic1-0 in response to UpdateMetadata request sent by
controller 4 epoch 64 with correlation id 456 (state.change.logger)

Node n2
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Broker 2 cached leader info
(LeaderAndIsrInfo:(Leader:5,ISR:5,4,2,1,LeaderEpoch:56,ControllerEpoch:63),ReplicationFactor:4),AllReplicas:1,5,2,4)
for partition topic1-0 in response to UpdateMetadata request sent by
controller 4 epoch 64 with correlation id 490 (state.change.logger)

Node n4 (controller)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,150] TRACE
Controller 4 epoch 64 sending UpdateMetadata request
(Leader:5,ISR:5,4,2,1,LeaderEpoch:56,ControllerEpoch:63) to brokers Set(1,
5, 2, 4) for partition topic1-0 (state.change.logger)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Broker 4 cached leader info
(LeaderAndIsrInfo:(Leader:5,ISR:5,4,2,1,LeaderEpoch:56,ControllerEpoch:63),ReplicationFactor:4),AllReplicas:1,5,2,4)
for partition topic1-0 in response to UpdateMetadata request sent by
controller 4 epoch 64 with correlation id 490 (state.change.logger)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Controller 4 epoch 64 received response {error_code=0} for a request sent
to broker n5:9090 (id: 5 rack: null) (state.change.logger)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Controller 4 epoch 64 received response {error_code=0} for a request sent
to broker n4:9090 (id: 4 rack: null) (state.change.logger)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Controller 4 epoch 64 received response {error_code=0} for a request sent
to broker n2:9090 (id: 2 rack: null) (state.change.logger)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Controller 4 epoch 64 received response {error_code=0} for a request sent
to broker n1:9090 (id: 1 rack: null) (state.change.logger)
kafka_2.12-0.10.2.0/logs/controller.log:[2017-03-23 08:23:55,149] DEBUG
[IsrChangeNotificationListener on Controller 4]: ISR change notification
listener fired (kafka.controller.IsrChangeNotificationListener)
kafka_2.12-0.10.2.0/logs/controller.log:[2017-03-23 08:23:55,150] DEBUG
[IsrChangeNotificationListener on Controller 4]: Sending MetadataRequest to
Brokers:Vector(1, 2, 4, 5) for TopicAndPartitions:Set([topic1,0])
(kafka.controller.IsrChangeNotificationListener)
kafka_2.12-0.10.2.0/logs/controller.log:[2017-03-23 08:23:55,152] DEBUG
[IsrChangeNotificationListener on Controller 4]: ISR change notification
listener fired (kafka.controller.IsrChangeNotificationListener)

Node n5 (leader for topic1-0)
kafka_2.12-0.10.2.0/logs/server.log:[2017-03-23 08:23:54,896] INFO
Partition [topic1,0] on broker 5: Shrinking ISR for partition [topic1,0]
from 5,4,1,2 to 5,4,2 (kafka.cluster.Partition)
kafka_2.12-0.10.2.0/logs/server.log:[2017-03-23 08:23:54,898] INFO
Partition [topic1,0] on broker 5: Expanding ISR for partition topic1-0 from
5,4,2 to 5,4,2,1 (kafka.cluster.Partition)

kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Broker 5 cached leader info
(LeaderAndIsrInfo:(Leader:5,ISR:5,4,2,1,LeaderEpoch:56,ControllerEpoch:63),ReplicationFactor:4),AllReplicas:1,5,2,4)
for partition topic1-0 in response to UpdateMetadata request sent by
controller 4 epoch 64 with correlation id 490 (state.change.logger)

There are no log entries before "Shrinking ISR for partition [topic1,0]
from 5,4,1,2 to 5,4,2". The cluster is idle, no network problems, 4 kafka
nodes for 2 topics with 1 partition each, 6 consumers, time synchronized.
I find it strange that in my case the time difference between shrinking and
expanding ISR is very small (3-4 ms).

Best regards,
Radu

On Thu, Mar 23, 2017 at 12:18 AM, James Cheng <wushuja...@gmail.com> wrote:

> Marcos, Radu,
>
> Are either of you seeing messages saying "Cached zkVersion [...] not equal
> to that in zookeeper"? If so, you may be hitting
> https://issues.apache.org/jira/browse/KAFKA-3042 <
> https://issues.apache.org/jira/browse/KAFKA-3042>
>
> Not sure if that helps you, since I haven't been able it isolate a cause.
> I think it's related to flaky connections to the zookeeper cluster.
>
> -James
>
>
> > On Mar 22, 2017, at 11:46 AM, Marcos Juarez <mjua...@gmail.com> wrote:
> >
> > We're seeing the same exact pattern of ISR shrinking/resizing, mostly on
> partitions with the largest volume, with thousands of messages per second.
> It happens at least a hundred times a day in our production cluster. We do
> have hundreds of topics in our cluster, most of them with 20 or more
> partitions, but most of them see only a few hundred messages per minute.
> >
> > We're running Kafka 0.10.0.1, and we thought upgrading to the 0.10.1.1
> version would fix the issue, but we've already deployed that version to our
> staging cluster, and we're seeing the same problem.  We still haven't tried
> out the latest 0.10.2.0 version, but I don't see any evidence pointing to a
> fix for that.
> >
> > This ticket seems to have some similar details, but it doesn't seem like
> there has been follow-up, and there's no target release for fixing:
> >
> > https://issues.apache.org/jira/browse/KAFKA-4674 <
> https://issues.apache.org/jira/browse/KAFKA-4674>
> >
> > Jun Ma, what exactly did you do to failover the controller to a new
> broker? If that works for you, I'd like to try it in our staging clusters.
> >
> > Thanks,
> >
> > Marcos Juarez
> >
> >
> >
> >
> >
> > On Wed, Mar 22, 2017 at 11:55 AM, Jun MA <mj.saber1...@gmail.com
> <mailto:mj.saber1...@gmail.com>> wrote:
> > I have similar issue with our cluster. We don’t know the root cause but
> we have some interesting observation.
> >
> > 1. We do see correlation between ISR churn and fetcher connection
> close/create.
> >
> >
> > 2. We’ve tried to add a broker which doesn’t have any partitions on it
> dedicate to the controller (rolling restart existing brokers and failover
> the controller to the newly added broker), and that indeed eliminate the
> random ISR churn. We have a cluster of 6 brokers (besides the dedicated
> controller) and each one has about 300 partitions on it. I suspect that
> kafka broker cannot handle running controller + 300 partitions.
> >
> > Anyway that’s so far what I got, I’d also like to know how to debug this.
> > We’re running kafka 0.9.0.1 with heap size 8G.
> >
> > Thanks,
> > Jun
> >
> >> On Mar 22, 2017, at 7:06 AM, Manikumar <manikumar.re...@gmail.com
> <mailto:manikumar.re...@gmail.com>> wrote:
> >>
> >> Any erros related to zookeeper seesion timeout? We can also check for
> >> excesssive GC.
> >> Some times this may due to forming multiple controllers due to soft
> >> failures.
> >> You can check ActiveControllerCount on brokers, only one broker in the
> >> cluster should have 1.
> >> Also check for network issues/partitions
> >>
> >> On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rradu...@gmail.com
> <mailto:rradu...@gmail.com>> wrote:
> >>
> >>> Hello,
> >>>
> >>> Does anyone know how I can debug high ISR churn on the kafka leader on
> a
> >>> cluster without traffic? I have 2 topics on a 4 node cluster  (replica
> 4
> >>> and replica 3) and both show constant changes of the number of insync
> >>> replicas:
> >>>
> >>> [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
> >>> (kafka.cluster.Partition)
> >>>
> >>> and
> >>>
> >>> [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5:
> Shrinking
> >>> ISR for partition [topic1,0] from 5,2,4 to 5,4
> (kafka.cluster.Partition)
> >>> [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5:
> Expanding
> >>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
> >>> [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5:
> Shrinking
> >>> ISR for partition [topic1,0] from 5,4,2 to 5,4
> (kafka.cluster.Partition)
> >>> [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5:
> Expanding
> >>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
> >>> [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5:
> Expanding
> >>> ISR for partition topic1-0 from 5,4,2 to 5,4,2,1
> (kafka.cluster.Partition)
> >>> [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5:
> Shrinking
> >>> ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5:
> Expanding
> >>> ISR for partition topic1-0 from 5,4,1 to 5,4,1,2
> (kafka.cluster.Partition)
> >>> [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5:
> Shrinking
> >>> ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5:
> Expanding
> >>> ISR for partition topic1-0 from 5,1,2 to 5,1,2,4
> (kafka.cluster.Partition)
> >>> (END)
> >>>
> >>> I have tried increasing the num.network.threads (now 8) and
> >>> num.replica.fetchers (now 2) but nothing has changed.
> >>>
> >>> The kafka server config is:
> >>>
> >>> default.replication.factor=4
> >>> log.retention.check.interval.ms <http://log.retention.check.
> interval.ms/>=300000
> >>> log.retention.hours=168
> >>> log.roll.hours=24
> >>> log.segment.bytes=104857600
> >>> min.insync.replicas=2
> >>> num.io.threads=8
> >>> num.network.threads=15
> >>> num.partitions=1
> >>> num.recovery.threads.per.data.dir=1
> >>> num.replica.fetchers=2
> >>> offsets.topic.num.partitions=1
> >>> offsets.topic.replication.factor=3
> >>> replica.lag.time.max.ms <http://replica.lag.time.max.ms/>=500
> >>> socket.receive.buffer.bytes=102400
> >>> socket.request.max.bytes=104857600
> >>> socket.send.buffer.bytes=102400
> >>> unclean.leader.election.enable=false
> >>> zookeeper.connection.timeout.ms <http://zookeeper.connection.
> timeout.ms/>=3000
> >>>
> >>> Best regards,
> >>> Radu
> >>>
> >
> >
>
>

Re: ISR churn

Reply via email to