3 Node Kafka - 0.8.2.1 3 node ZK - 3.4.6 We experienced a soft-node failure from one of our brokers (#2). The process was still running but no logs were being generated, it was not responding to JMX queries etc.
Several consumers were unable to read from certain partitions while this was occurring, partitions that have a replication factor of 3. I have all the server, controller and state-change logs from the event, and am trying to filter out the relevant information. But this sequence of events about one partition in the ISR struck me (just kafka.cluster.Partition): There are others just like it. "_time",sourcetype,host,"_raw" "2015-06-05T07:31:47.878-0600","kafka_server_log","qd-kafka8-01","[2015-06-05 07:31:47,878] INFO Partition [birdseed-user-stream,5] on broker 1: Shrinking ISR for partition [birdseed-user-stream,5] from 3,2,1 to 3,1 (kafka.cluster.Partition)" "2015-06-05T07:31:47.884-0600","kafka_server_log","qd-kafka8-01","[2015-06-05 07:31:47,884] INFO Partition [birdseed-user-stream,5] on broker 1: Cached zkVersion [537] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)" "2015-06-05T07:31:57.934-0600","kafka_server_log","qd-kafka8-03","[2015-06-05 07:31:57,934] INFO Partition [birdseed-user-stream,5] on broker 3: Shrinking ISR for partition [birdseed-user-stream,5] from 3,2 to 3 (kafka.cluster.Partition)" "2015-06-05T07:31:59.454-0600","kafka_server_log","qd-kafka8-01","[2015-06-05 07:31:59,454] INFO Partition [birdseed-user-stream,5] on broker 1: Shrinking ISR for partition [birdseed-user-stream,5] from 3,2,1 to 1 (kafka.cluster.Partition)" "2015-06-05T07:31:59.462-0600","kafka_server_log","qd-kafka8-01","[2015-06-05 07:31:59,462] INFO Partition [birdseed-user-stream,5] on broker 1: Cached zkVersion [537] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)" "2015-06-05T07:32:07.387-0600","kafka_server_log","qd-kafka8-01","[2015-06-05 07:32:07,387] INFO Partition [birdseed-user-stream,5] on broker 1: Shrinking ISR for partition [birdseed-user-stream,5] from 3,2,1 to 2,1 (kafka.cluster.Partition)" "2015-06-05T07:32:07.397-0600","kafka_server_log","qd-kafka8-01","[2015-06-05 07:32:07,397] INFO Partition [birdseed-user-stream,5] on broker 1: Cached zkVersion [537] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)" "2015-06-05T07:32:33.619-0600","kafka_server_log","qd-kafka8-03","[2015-06-05 07:32:33,619] INFO Partition [birdseed-user-stream,5] on broker 3: Expanding ISR for partition [birdseed-user-stream,5] from 3 to 3,2 (kafka.cluster.Partition)" "2015-06-05T07:33:12.038-0600","kafka_server_log","qd-kafka8-03","[2015-06-05 07:33:12,038] INFO Partition [birdseed-user-stream,5] on broker 3: Expanding ISR for partition [birdseed-user-stream,5] from 3,2 to 3,2,1 (kafka.cluster.Partition)" "2015-06-05T07:33:38.959-0600","kafka_server_log","qd-kafka8-03","[2015-06-05 07:33:38,959] INFO Partition [birdseed-user-stream,5] on broker 3: Shrinking ISR for partition [birdseed-user-stream,5] from 3,2,1 to 3,1 (kafka.cluster.Partition)" "2015-06-05T07:34:47.266-0600","kafka_server_log","qd-kafka8-03","[2015-06-05 07:34:47,266] INFO Partition [birdseed-user-stream,5] on broker 3: Expanding ISR for partition [birdseed-user-stream,5] from 3,1 to 3,1,2 (kafka.cluster.Partition)" "2015-06-05T07:37:00.584-0600","kafka_server_log","qd-kafka8-03","[2015-06-05 07:37:00,584] INFO Partition [birdseed-user-stream,5] on broker 3: Shrinking ISR for partition [birdseed-user-stream,5] from 3,1,2 to 3 (kafka.cluster.Partition)" "2015-06-05T07:37:00.590-0600","kafka_server_log","qd-kafka8-03","[2015-06-05 07:37:00,590] INFO Partition [birdseed-user-stream,5] on broker 3: Cached zkVersion [543] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)" "2015-06-05T07:37:09.801-0600","kafka_server_log","qd-kafka8-03","[2015-06-05 07:37:09,801] INFO Partition [birdseed-user-stream,5] on broker 3: Shrinking ISR for partition [birdseed-user-stream,5] from 3,1,2 to 3 (kafka.cluster.Partition)" "2015-06-05T07:37:09.804-0600","kafka_server_log","qd-kafka8-03","[2015-06-05 07:37:09,804] INFO Partition [birdseed-user-stream,5] on broker 3: Cached zkVersion [543] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)" The last 2 lines repeat every 10 seconds until broker #2 was bounced. We've seen this twice now in production, and unfortunately did not get a jstack from the frozen VM. What more information can I provide to help with this? Thanks Bob Cotton Rally Software