Hi Kafka users, Every ~3-4 days we are seeing a broker logging "Shrinking ISR for partition..." only to log "Expanding ISR for partition..." a few seconds later.
- The broker logging "Shrinking ISR..." is the leader of all partitions for which the ISR is shrunk. - The number of partitions is always a multiple of the total number of partitions it is leading, divided by the number of fetcher threads. - There seem to be no particular pattern to which broker kicks which broker out of the ISR. - After the broker has logged "Shrinking ISR..." for all affected partitions, it logs "Closing socket connection to..." for the broker that was kicked out, and then "Expanding ISR..." begins. - The broker that got kicked out logs "Reconnect due to socket error: java.net.SocketTimeoutException" when the leader broker closes the connection. - Consumers sometimes show latency spikes when this happens. Example leader log: ...Shrinking... [2015-06-28 05:24:51,141] INFO Partition [my-topic,24] on broker 762855627: Shrinking ISR for partition [my-topic,24] from 762855627,729301197,1249886642 to 762855627,729301197 (kafka.cluster.Partition) [2015-06-28 05:24:51,149] INFO Partition [my-topic,16] on broker 762855627: Shrinking ISR for partition [my-topic,16] from 762855627,729301197,1249886642 to 762855627,729301197 (kafka.cluster.Partition) [2015-06-28 05:25:03,295] INFO Closing socket connection to /10.95.197.138. (kafka.network.Processor) [2015-06-28 05:25:03,301] INFO Partition [my-topic,60] on broker 762855627: Expanding ISR for partition [my-topic,60] from 762855627,729301197 to 762855627,729301197,1249886642 (kafka.cluster.Partition) [2015-06-28 05:25:03,339] INFO Partition [my-topic,206] on broker 762855627: Expanding ISR for partition [my-topic,206] from 762855627,729301197 to 762855627,729301197,1249886642 (kafka.cluster.Partition) ...Expanding... Example follower log: [2015-06-28 05:25:03,299] INFO Reconnect due to socket error: java.net.SocketTimeoutException (kafka.consumer.SimpleConsumer) I am tempted to just increase "replica.lag.time.max.ms" from the default 10 seconds that we are currently using, but the SocketTimeoutException is suggesting that there may be something else going on here. It also surprises me that consumers are affected by this, since no broker partition leadership re-election is taking place. Any advice on how to proceed would be much appreciated. The setup: Three Kafka 0.8.2 brokers running on AWS EC2 c3.2xlarge instances. Three ZooKeepers 3.3.6 running on AWS EC2 m1.large instances. Broker configs: auto.leader.rebalance.enable=true log.segment.bytes=536870912 socket.receive.buffer.bytes=1048576 socket.send.buffer.bytes=1048576 log.flush.scheduler.interval.ms=3000 log.flush.interval.ms=3000 controlled.shutdown.enable=true delete.topic.enable=true zookeeper.connection.timeout.ms=1000000 replica.fetch.max.bytes=10485760 num.replica.fetchers=4 log.dirs=XXX log.retention.minutes=5256000 message.max.bytes=10485760 log.flush.interval.messages=10000 auto.create.topics.enable=false port=9092 zookeeper.connect=XXX broker.id=XXX log.roll.hours=24 Peak load: total messages per second: 80 total bytes in per second: 110 KiB total bytes out per second: 440 KiB (220 of which is replication) Thanks, Christofer
