[jira] [Comment Edited] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"

2019-07-02 Thread sandeep gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877122#comment-16877122
 ] 

sandeep gupta edited comment on KAFKA-3410 at 7/2/19 4:37 PM:
--

I also encountered the same issue. Is there any solution for this. I don't want 
to lose any data.
 In our kafka based network, there are 4 kafka brokers and 3 zookeepers running 
as docker containers, we have 3 channels and one orderer system channel 
testchainid in our network. Everything was working fine till sunday. However 
after that we saw errors in ordering service during invokations. After that we 
restarted our zookeepers services and then restarted kafka0, kafka1, kafka2 and 
kafka3 in the same order maintaining 10 secs gap after every kafka restart. 
This process we used to do (roughly every 3 weeks) whenever we faced such 
issue. 
 However this time when we did the same process, kafka2 and kafka1 got shut 
down after restart and when we checked the logs we found this error *FATAL 
[ReplicaFetcher replicaId=2, leaderId=0, fetcherId=0] Exiting because log 
truncation is not allowed for partition testchainid-0, current leader's latest 
offset 96672 is less than replica's latest offset 96674 
(kafka.server.ReplicaFetcherThread)* on broker2 and same error we found on 
broker1 as well. so basically we have two channels on broker0 as leader - ort 
and testchainid and rest of the channels are present on other brokers. 
 Also when we stopped kafka0 broker and then restart kafka1, kafka2 and kafka3, 
then kafka1 and kafka2 didn't get shut down. So the problem is as soon as i 
restart kafka0 broker, then the brokers kafka1 and kafka2 get shut down 
immediately.
 Now with kafka0 stopped and rest of the brokers kafka1, kafka2, and kafka3 
running, though I am able to invoke on rest of the other channels but I am able 
to see error in orderer logs for the ort channel *[orderer/consensus/kafka] 
processMessagesToBlocks -> ERRO 3a0123d [channel: ort] Error during 
consumption: kafka: error while consuming ort/0: kafka server: In the middle of 
a leadership election, there is currently no leader for this partition and 
hence it is unavailable for writes.*

I am using docker-compose to start zookeeper and kafka brokers. Below given is 
the docker-compose for one of the zookeeper and kafka 

zookeeper0:
 container_name: zookeeper0
 image: hyperledger/fabric-zookeeper:latest
 dns_search: .
 ports:
 - 2181:2181
 - 2888:2888
 - 3888:3888
 environment:
 - ZOO_MY_ID=1
 - ZOO_SERVERS=server.1=zookeeper0:2888:3888 server.2=zookeeper1:2888:3888 
server.3=zookeeper2:2888:3888
 networks:
 - fabric-ca
 volumes:
 - ./hosts/zookeeper0hosts/hosts:/etc/hosts

kafka0:
 container_name: kafka0
 image: hyperledger/fabric-kafka:latest
 dns_search: .
 environment:
 - KAFKA_MESSAGE_MAX_BYTES=103809024 # 99 * 1024 * 1024 B
 - KAFKA_REPLICA_FETCH_MAX_BYTES=103809024 # 99 * 1024 * 1024 B
 - KAFKA_UNCLEAN_LEADER_ELECTION_ENABLE=false
 - KAFKA_BROKER_ID=0
 - KAFKA_HOST_NAME=kafka0
 - KAFKA_LISTENERS=EXTERNAL://0.0.0.0:9092,REPLICATION://0.0.0.0:9093
 - 
KAFKA_ADVERTISED_LISTENERS=EXTERNAL://10.64.67.212:9092,REPLICATION://kafka0:9093
 - KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=EXTERNAL:PLAINTEXT,REPLICATION:PLAINTEXT
 - KAFKA_INTER_BROKER_LISTENER_NAME=REPLICATION
 - KAFKA_MIN_INSYNC_REPLICAS=2
 - KAFKA_DEFAULT_REPLICATION_FACTOR=3
 - KAFKA_ZOOKEEPER_CONNECT=zookeeper0:2181,zookeeper1:2181,zookeeper2:2181
 ports:
 - 9092:9092
 - 9093:9093
 networks:
 - fabric-ca
 volumes:
 - ./hosts/kafka0hosts/hosts:/etc/hosts

Also below given are kafka brokers logs after restart. 
Broker0 - [https://hastebin.com/zavocatace.sql]
Broker1 - [https://hastebin.com/latojedemu.sql]
Broker2 - [https://hastebin.com/poxudijepi.sql]
Broker3 - [https://hastebin.com/doliqohufa.sql]

 

 

 


was (Author: javrevasandeep):
I also encountered the same issue. Is there any solution for this. I don't want 
to lose any data.
 In our kafka based network, there are 4 kafka brokers and 3 zookeepers running 
as docker containers, we have 3 channels and one orderer system channel 
testchainid in our network. Everything was working fine till sunday. However 
after that we saw errors in ordering service during invokations. After that we 
restarted our zookeepers services and then restarted kafka0, kafka1, kafka2 and 
kafka3 in the same order maintaining 10 secs gap after every kafka restart. 
This process we used to do (roughly every 3 weeks) whenever we faced such 
issue. 
 However this time when we did the same process, kafka2 and kafka1 got shut 
down after restart and when we checked the logs we found this error *FATAL 
[ReplicaFetcher replicaId=2, leaderId=0, fetcherId=0] Exiting because log 
truncation is not allowed for partition testchainid-0, current leader's latest 
offset 96672 is less than replica's latest offset 96674 
(kafka.server.ReplicaFetcherThread)* on broker2 and sa

[jira] [Comment Edited] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"

2019-07-02 Thread sandeep gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877122#comment-16877122
 ] 

sandeep gupta edited comment on KAFKA-3410 at 7/2/19 4:35 PM:
--

I also encountered the same issue. Is there any solution for this. I don't want 
to lose any data.
 In our kafka based network, there are 4 kafka brokers and 3 zookeepers running 
as docker containers, we have 3 channels and one orderer system channel 
testchainid in our network. Everything was working fine till sunday. However 
after that we saw errors in ordering service during invokations. After that we 
restarted our zookeepers services and then restarted kafka0, kafka1, kafka2 and 
kafka3 in the same order maintaining 10 secs gap after every kafka restart. 
This process we used to do (roughly every 3 weeks) whenever we faced such 
issue. 
 However this time when we did the same process, kafka2 and kafka1 got shut 
down after restart and when we checked the logs we found this error *FATAL 
[ReplicaFetcher replicaId=2, leaderId=0, fetcherId=0] Exiting because log 
truncation is not allowed for partition testchainid-0, current leader's latest 
offset 96672 is less than replica's latest offset 96674 
(kafka.server.ReplicaFetcherThread)* on broker2 and same error we found on 
broker1 as well. so basically we have two channels on broker0 as leader - ort 
and testchainid and rest of the channels are present on other brokers. 
 Also when we stopped kafka0 broker and then restart kafka1, kafka2 and kafka3, 
then kafka1 and kafka2 didn't get shut down. So the problem is as soon as i 
restart kafka0 broker, then the brokers kafka1 and kafka2 get shut down 
immediately.
 Now with kafka0 stopped and rest of the brokers kafka1, kafka2, and kafka3 
running, though I am able to invoke on rest of the other channels but I am able 
to see error in orderer logs for the ort channel *[orderer/consensus/kafka] 
processMessagesToBlocks -> ERRO 3a0123d [channel: ort] Error during 
consumption: kafka: error while consuming ort/0: kafka server: In the middle of 
a leadership election, there is currently no leader for this partition and 
hence it is unavailable for writes.*

I am using docker-compose to start zookeeper and kafka brokers. Below given is 
the docker-compose for one of the zookeeper and kafka 

zookeeper0:
 container_name: zookeeper0
 image: hyperledger/fabric-zookeeper:latest
 dns_search: .
 ports:
 - 2181:2181
 - 2888:2888
 - 3888:3888
 environment:
 - ZOO_MY_ID=1
 - ZOO_SERVERS=server.1=zookeeper0:2888:3888 server.2=zookeeper1:2888:3888 
server.3=zookeeper2:2888:3888
 networks:
 - fabric-ca
 volumes:
 - ./hosts/zookeeper0hosts/hosts:/etc/hosts

kafka0:
 container_name: kafka0
 image: hyperledger/fabric-kafka:latest
 dns_search: .
 environment:
 - KAFKA_MESSAGE_MAX_BYTES=103809024 # 99 * 1024 * 1024 B
 - KAFKA_REPLICA_FETCH_MAX_BYTES=103809024 # 99 * 1024 * 1024 B
 - KAFKA_UNCLEAN_LEADER_ELECTION_ENABLE=false
 - KAFKA_BROKER_ID=0
 - KAFKA_HOST_NAME=kafka0
 - KAFKA_LISTENERS=EXTERNAL://0.0.0.0:9092,REPLICATION://0.0.0.0:9093
 - 
KAFKA_ADVERTISED_LISTENERS=EXTERNAL://10.64.67.212:9092,REPLICATION://kafka0:9093
 - KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=EXTERNAL:PLAINTEXT,REPLICATION:PLAINTEXT
 - KAFKA_INTER_BROKER_LISTENER_NAME=REPLICATION
 - KAFKA_MIN_INSYNC_REPLICAS=2
 - KAFKA_DEFAULT_REPLICATION_FACTOR=3
 - KAFKA_ZOOKEEPER_CONNECT=zookeeper0:2181,zookeeper1:2181,zookeeper2:2181
 ports:
 - 9092:9092
 - 9093:9093
 networks:
 - fabric-ca
 volumes:
 - ./hosts/kafka0hosts/hosts:/etc/hosts

 


was (Author: javrevasandeep):
I also encountered the same issue. Is there any solution for this. I don't want 
to lose any data.
 In our kafka based network, there are 4 kafka brokers and 3 zookeepers running 
as docker containers, we have 3 channels and one orderer system channel 
testchainid in our network. Everything was working fine till sunday. However 
after that we saw errors in ordering service during invokations. After that we 
restarted our zookeepers services and then restarted kafka0, kafka1, kafka2 and 
kafka3 in the same order maintaining 10 secs gap after every kafka restart. 
This process we used to do (roughly every 3 weeks) whenever we faced such 
issue. 
 However this time when we did the same process, kafka2 and kafka1 got shut 
down after restart and when we checked the logs we found this error *FATAL 
[ReplicaFetcher replicaId=2, leaderId=0, fetcherId=0] Exiting because log 
truncation is not allowed for partition testchainid-0, current leader's latest 
offset 96672 is less than replica's latest offset 96674 
(kafka.server.ReplicaFetcherThread)* on broker2 and same error we found on 
broker1 as well. so basically we have two channels on broker0 as leader - ort 
and testchainid and rest of the channels are present on other brokers. 
 Also when we stopped kafka0 broker and then restart kafka1, kafka2 and kafka3, 
t

[jira] [Comment Edited] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"

2019-07-02 Thread sandeep gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877122#comment-16877122
 ] 

sandeep gupta edited comment on KAFKA-3410 at 7/2/19 4:33 PM:
--

I also encountered the same issue. Is there any solution for this. I don't want 
to lose any data.
 In our kafka based network, there are 4 kafka brokers and 3 zookeepers running 
as docker containers, we have 3 channels and one orderer system channel 
testchainid in our network. Everything was working fine till sunday. However 
after that we saw errors in ordering service during invokations. After that we 
restarted our zookeepers services and then restarted kafka0, kafka1, kafka2 and 
kafka3 in the same order maintaining 10 secs gap after every kafka restart. 
This process we used to do (roughly every 3 weeks) whenever we faced such 
issue. 
 However this time when we did the same process, kafka2 and kafka1 got shut 
down after restart and when we checked the logs we found this error *FATAL 
[ReplicaFetcher replicaId=2, leaderId=0, fetcherId=0] Exiting because log 
truncation is not allowed for partition testchainid-0, current leader's latest 
offset 96672 is less than replica's latest offset 96674 
(kafka.server.ReplicaFetcherThread)* on broker2 and same error we found on 
broker1 as well. so basically we have two channels on broker0 as leader - ort 
and testchainid and rest of the channels are present on other brokers. 
 Also when we stopped kafka0 broker and then restart kafka1, kafka2 and kafka3, 
then kafka1 and kafka2 didn't get shut down. So the problem is as soon as i 
restart kafka0 broker, then the brokers kafka1 and kafka2 get shut down 
immediately.
 Now with kafka0 stopped and rest of the brokers kafka1, kafka2, and kafka3 
running, though I am able to invoke on rest of the other channels but I am able 
to see error in orderer logs for the ort channel *[orderer/consensus/kafka] 
processMessagesToBlocks -> ERRO 3a0123d [channel: ort] Error during 
consumption: kafka: error while consuming ort/0: kafka server: In the middle of 
a leadership election, there is currently no leader for this partition and 
hence it is unavailable for writes.*

I am using docker-compose to start zookeeper and kafka brokers. Below given is 
the docker-compose for one of the zookeeper and kafka 
zookeeper0:
container_name: zookeeper0
image: hyperledger/fabric-zookeeper:latest
dns_search: .
# restart: always
ports:
- 2181:2181
- 2888:2888
- 3888:3888
environment:
- ZOO_MY_ID=1
- ZOO_SERVERS=server.1=zookeeper0:2888:3888 server.2=zookeeper1:2888:3888 
server.3=zookeeper2:2888:3888
networks:
- fabric-ca
volumes:
- ./hosts/zookeeper0hosts/hosts:/etc/hosts

kafka0:
container_name: kafka0
image: hyperledger/fabric-kafka:latest
dns_search: .
# restart: always
environment:
- KAFKA_MESSAGE_MAX_BYTES=103809024 # 99 * 1024 * 1024 B
- KAFKA_REPLICA_FETCH_MAX_BYTES=103809024 # 99 * 1024 * 1024 B
- KAFKA_UNCLEAN_LEADER_ELECTION_ENABLE=false
- KAFKA_BROKER_ID=0
- KAFKA_HOST_NAME=kafka0
- KAFKA_LISTENERS=EXTERNAL://0.0.0.0:9092,REPLICATION://0.0.0.0:9093
- 
KAFKA_ADVERTISED_LISTENERS=EXTERNAL://10.64.67.212:9092,REPLICATION://kafka0:9093
- KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=EXTERNAL:PLAINTEXT,REPLICATION:PLAINTEXT
- KAFKA_INTER_BROKER_LISTENER_NAME=REPLICATION
- KAFKA_MIN_INSYNC_REPLICAS=2
- KAFKA_DEFAULT_REPLICATION_FACTOR=3
- KAFKA_ZOOKEEPER_CONNECT=zookeeper0:2181,zookeeper1:2181,zookeeper2:2181
ports:
- 9092:9092
- 9093:9093
networks:
- fabric-ca
volumes:
- ./hosts/kafka0hosts/hosts:/etc/hosts

  


was (Author: javrevasandeep):
I also encountered the same issue. Is there any solution for this. I don't want 
to lose any data.
In our kafka based network, there are 4 kafka brokers and 3 zookeepers running 
as docker containers, we have 3 channels and one orderer system channel 
testchainid in our network. Everything was working fine till sunday. However 
after that we saw errors in ordering service during invokations. After that we 
restarted our zookeepers services and then restarted kafka0, kafka1, kafka2 and 
kafka3 in the same order maintaining 10 secs gap after every kafka restart. 
This process we used to do (roughly every 3 weeks) whenever we faced such 
issue. 
However this time when we did the same process, kafka2 and kafka1 got shut down 
after restart and when we checked the logs we found this error *FATAL 
[ReplicaFetcher replicaId=2, leaderId=0, fetcherId=0] Exiting because log 
truncation is not allowed for partition testchainid-0, current leader's latest 
offset 96672 is less than replica's latest offset 96674 
(kafka.server.ReplicaFetcherThread)* on broker2 and same error we found on 
broker1 as well. so basically we have two channels on broker0 as leader - ort 
and testchainid and rest of the channels are present on other brokers. 
Also when we stopped kafka0 broker and then restart kafka1, kafka2 and kafka3, 
then 

[jira] [Comment Edited] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"

2018-10-09 Thread Nico Meyer (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643575#comment-16643575
 ] 

Nico Meyer edited comment on KAFKA-3410 at 10/9/18 3:01 PM:


This problem is not fixed by KAFKA-1211, at least not for a failed disk that 
needs to be replaced. Not if the broker with the problem becomes the only ISR 
just before failing and unclean leader election is disabled.

I would assume that this is not an uncommon problem. In our case it seems that 
a faulty disk on the leader was extremely slow just before producing hard I/O 
errors, which in turn blocked the fetches from the followers longer than 
replica.lag.time.max.ms. Therefore the leader removed both of the followers 
from the ISR about half a second before taking the partitions on the faulty 
disk offline. I think this not desirable, since the followers where actually up 
to date. I believe the error is to compare the last time the follower fetched 
to the LEO with the wall clock, instead it should be compared with the last 
time something was written to the log.

But lets say the followers were removed from the ISR for some other reason. At 
that point the partition is offline and in one option to proceed is to enable 
unclean leader election. This can lead to lost data, even if 
min.insync.replicas=2 is used and all producers use acks=all. :(.

Another option is to replace the faulty disk and restart the broker, which 
leads to the shutdown of the followers described in this issue. Even worse, if 
the unclean leader election is enabled to mitigate that problem, all the logs 
are truncated to offset 0. I would prefer if the broker could not become leader 
for a partition after its log dir has been deleted.

I just confirmed that the same problem still exists in 2.0.0.

Properly recovering from this problem in the middle of the night without 
messing up is pretty hard:
 * shutdown the failed leader
 * for each affected partition check which broker has the highest LEO. Is there 
a tool for that? kafka-log-dirs.sh is a good start, but only returns the number 
of bytes in the log which is not equivalent
 * Write a JSON file for kafka-reassign-partitions.sh that lists only the 
broker with the highest LEO for each partition, or at least lists it as the 
preferred leader. Execute it and save the original partition assignment.
 * enable unclean leader election for all affected topics
 * wait for the new leaders to be elected
 * disable unclean leader election
 * Replace disk on failed broker
 * Reinstate the original partition assignment saved in step 3

 

 


was (Author: nico.meyer):
This problem is not fixed by KAFKA-1211, at least not for a failed disk that 
needs to be replaced. Not if the broker with the problem becomes the only ISR 
just before failing and unclean leader election is disabled.

I would assume that this is not an uncommon problem. In our case it seems that 
a faulty disk on the leader was extremely slow just before producing hard I/O 
errors, which in turn blocked the fetches from the followers longer than 
replica.lag.time.max.ms. Therefore the leader removed both of the followers 
from the ISR about half a second before taking the partitions on the faulty 
disk offline. I think this not desirable, since the followers where actually up 
to date. I believe the error is to compare the last time the follower fetched 
to the LEO with the wall clock, instead it should be compared with the last 
time something was written to the log.

But lets say the followers were removed from the ISR for some other reason. At 
that point the partition is offline and in one option to proceed is to enable 
unclean leader election. This can lead to lost data, even if 
min.insync.replicas=2 is used and all producers use acks=all. :(.

Another option is to replace the faulty disk and restart the broker, which 
leads to the shutdown of the followers described in this issue. Even worse, if 
the unclean leader election is enabled to that problem, all the logs are 
truncated to offset 0. I would prefer if the broker could not become leader for 
a partition after its log dir has been deleted.

I just confirmed that the same problem still exists in 2.0.0.

Properly recovering from this problem in the middle of the night without 
messing up is pretty hard:
 * shutdown the failed leader
 * for each affected partition check which broker has the highest LEO. Is there 
a tool for that? kafka-log-dirs.sh is a good start, but only returns the number 
of bytes in the log which is not equivalent
 * Write a JSON file for kafka-reassign-partitions.sh that lists only the 
broker with the highest LEO for each partition, or at least lists it as the 
preferred leader. Execute it and save the original partition assignment.
 * enable unclean leader election for all affected topics
 * wait for the new leaders to be elected
 * disable uncl

[jira] [Comment Edited] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"

2018-04-30 Thread Ashish (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458652#comment-16458652
 ] 

Ashish edited comment on KAFKA-3410 at 4/30/18 4:29 PM:


We had a similar issue with 0.10.2.1 version . Is there a fix for this issue in 
later versions ?

The broker process was still running and to fix we had to hard kill the broker 
and then start it .

Cluster is a very low volume with total 5 topics and 10 brokers .

 

This is what we saw:

 

ZK session expired on broker 1
Controller resigned .
Broker 1 re-registered itself in ZK
All the partitions on the broker shrinked to broker 1 ( for topic 
__consumer_offsets )
2018-04-30 03:35:01,018 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,26] on broker 1: Shrinking ISR for partition 
[__consumer_offsets,26] from 9,1,5 to 1
2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,26] on broker 1: Cached zkVersion [205] not equal to that 
in zookeeper, skip updating ISR
2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,36] on broker 1: Shrinking ISR for partition 
[__consumer_offsets,36] from 10,6,1 to 1

Broker 1 said :

2018-04-30 03:35:19,164 INFO [log.Log:kafka-request-handler-4] Truncating log 
__consumer_offsets-46 to offset 146800439.

2018-04-30 03:35:20,020 FATAL 
[server.ReplicaFetcherThread:ReplicaFetcherThread-1-3] 
[ReplicaFetcherThread-1-3], Exiting because log truncation is not allowed for 
partition __consumer_offsets-46, Current leader 3's latest offset 146800434 is 
less than replica 1's latest offset 146800439

018-04-30 03:35:20,112 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
shutting down
2018-04-30 03:35:20,123 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
Starting controlled shutdown
2018-04-30 03:35:26,183 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
Controlled shutdown succeeded
2018-04-30 03:35:26,258 INFO [network.SocketServer:Thread-3] [Socket Server on 
Broker 1], Shutting down
2018-04-30 03:35:26,802 INFO [network.SocketServer:Thread-3] [Socket Server on 
Broker 1], Shutdown completed
2018-04-30 03:35:26,813 INFO [server.KafkaRequestHandlerPool:Thread-3] [Kafka 
Request Handler on Broker 1], shutting down
2018-04-30 03:35:29,088 INFO [coordinator.GroupCoordinator:executor-Heartbeat] 
[GroupCoordinator 1]: Group TEST_TOPIC_CG with generation 106 is now empty
2018-04-30 03:35:29,121 WARN [coordinator.GroupCoordinator:executor-Heartbeat] 
[GroupCoordinator 1]: Failed to write empty metadata for group TEST_TOPIC_CG: 
This is not the correct coordinator for this group.

 

Its was a controlled shutdown , however the broker java process was still 
running .

Does this suggests that leader has a older offset than a follower ?

we have :

controlled.shutdown.enable=true

unclean.leader.election.enable=false


was (Author: ashish6785):
We had a similar issue with 0.10.2.1 version . Is there a fix for this issue in 
later versions ?

The broker process was still running and to fix we had to hard kill the broker 
and then start it .

Cluster is a very low volume with total 5 topics and 10 brokers .

 

This is what we saw:

 
 # ZK session expired on broker 1
 # Controller resigned .
 # Broker 1 re-registered itself in ZK
 # All the partitions on the broker shrinked to broker 1 ( for topic 
__consumer_offsets )

2018-04-30 03:35:01,018 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,26] on broker 1: Shrinking ISR for partition 
[__consumer_offsets,26] from 9,1,5 to 1
*2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,26] on broker 1: Cached zkVersion [205] not equal to that 
in zookeeper, skip updating ISR*
2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,36] on broker 1: Shrinking ISR for partition 
[__consumer_offsets,36] from 10,6,1 to 1

Broker 1 said :

2018-04-30 03:35:19,164 INFO  [log.Log:kafka-request-handler-4] Truncating log 
__consumer_offsets-46 to offset 146800439.

*2018-04-30 03:35:20,020 FATAL 
[server.ReplicaFetcherThread:ReplicaFetcherThread-1-3] 
[ReplicaFetcherThread-1-3], Exiting because log truncation is not allowed for 
partition __consumer_offsets-46, Current leader 3's latest offset 146800434 is 
less than replica 1's latest offset 146800439*

018-04-30 03:35:20,112 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
shutting down
2018-04-30 03:35:20,123 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
Starting controlled shutdown
2018-04-30 03:35:26,183 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
Controlled shutdown succeeded
2018-04-30 03:35:26,258 INFO [network.SocketServer:Thread-3] [Socket Server on 
Broker 1], Shutting down
2018-04-30 03:35:26,802 INFO [network.SocketServer:Thread-3] [Socket Server