Hi Kafka users,

Every ~3-4 days we are seeing a broker logging "Shrinking ISR for
partition..." only to log "Expanding ISR for partition..." a few seconds
later.

- The broker logging "Shrinking ISR..." is the leader of all partitions for
which the ISR is shrunk.
- The number of partitions is always a multiple of the total number of
partitions it is leading, divided by the number of fetcher threads.
- There seem to be no particular pattern to which broker kicks which broker
out of the ISR.
- After the broker has logged "Shrinking ISR..." for all affected
partitions, it logs "Closing socket connection to..." for the broker that
was kicked out, and then "Expanding ISR..." begins.
- The broker that got kicked out logs "Reconnect due to socket error:
java.net.SocketTimeoutException" when the leader broker closes the
connection.
- Consumers sometimes show latency spikes when this happens.


Example leader log:
...Shrinking...
[2015-06-28 05:24:51,141] INFO Partition [my-topic,24] on broker 762855627:
Shrinking ISR for partition [my-topic,24] from
762855627,729301197,1249886642 to 762855627,729301197
(kafka.cluster.Partition)
[2015-06-28 05:24:51,149] INFO Partition [my-topic,16] on broker 762855627:
Shrinking ISR for partition [my-topic,16] from
762855627,729301197,1249886642 to 762855627,729301197
(kafka.cluster.Partition)
[2015-06-28 05:25:03,295] INFO Closing socket connection to /10.95.197.138.
(kafka.network.Processor)
[2015-06-28 05:25:03,301] INFO Partition [my-topic,60] on broker 762855627:
Expanding ISR for partition [my-topic,60] from 762855627,729301197 to
762855627,729301197,1249886642 (kafka.cluster.Partition)
[2015-06-28 05:25:03,339] INFO Partition [my-topic,206] on broker
762855627: Expanding ISR for partition [my-topic,206] from
762855627,729301197 to 762855627,729301197,1249886642
(kafka.cluster.Partition)
...Expanding...

Example follower log:
[2015-06-28 05:25:03,299] INFO Reconnect due to socket error:
java.net.SocketTimeoutException (kafka.consumer.SimpleConsumer)


I am tempted to just increase "replica.lag.time.max.ms" from the default 10
seconds that we are currently using, but the SocketTimeoutException is
suggesting that there may be something else going on here. It also
surprises me that consumers are affected by this, since no broker partition
leadership re-election is taking place. Any advice on how to proceed would
be much appreciated.


The setup:
Three Kafka 0.8.2 brokers running on AWS EC2 c3.2xlarge instances.
Three ZooKeepers 3.3.6 running on AWS EC2 m1.large instances.

Broker configs:
auto.leader.rebalance.enable=true
log.segment.bytes=536870912
socket.receive.buffer.bytes=1048576
socket.send.buffer.bytes=1048576
log.flush.scheduler.interval.ms=3000
log.flush.interval.ms=3000
controlled.shutdown.enable=true
delete.topic.enable=true
zookeeper.connection.timeout.ms=1000000
replica.fetch.max.bytes=10485760
num.replica.fetchers=4
log.dirs=XXX
log.retention.minutes=5256000
message.max.bytes=10485760
log.flush.interval.messages=10000
auto.create.topics.enable=false
port=9092
zookeeper.connect=XXX
broker.id=XXX
log.roll.hours=24

Peak load:
total messages per second: 80
total bytes in per second: 110 KiB
total bytes out per second: 440 KiB (220 of which is replication)


Thanks,
Christofer

Reply via email to