Hi, we're having an outage in our production Kafka and getting desperate, any 
help would be appreciated.

On 3/14 our consumer (a Storm spout) started getting messages from only 20 out 
of 40 partitions on a topic. We only noticed yesterday. Restarting the consumer 
with a new consumer group does not fix the problem.

We just found some errors in the Kafka state change log which look like they 
may be related - the example is definitely one of the affected partition, and 
the timestamp lines up with when the problem started. Seems to be related to 
KAFKA-3963. What can we do to mitigate this and prevent it from happening again?

kafka.common.NoReplicaOnlineException: No replica for partition 
[transcription-results,9] is alive. Live brokers are: [Set()], Assigned 
replicas are: [List(1, 4, 0)]
[2018-03-14 03:11:40,863] TRACE Controller 0 epoch 44 changed state of replica 
1 for partition [transcription-results,9] from OnlineReplica to OfflineReplica 
(state.change.logger)
[2018-03-14 03:11:41,141] TRACE Controller 0 epoch 44 sending become-follower 
LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to 
broker 4 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,145] TRACE Controller 0 epoch 44 sending become-follower 
LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to 
broker 0 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,208] TRACE Controller 0 epoch 44 changed state of replica 
4 for partition [transcription-results,9] from OnlineReplica to OnlineReplica 
(state.change.logger)
[2018-03-14 03:11:41,218] TRACE Controller 0 epoch 44 changed state of replica 
1 for partition [transcription-results,9] from OfflineReplica to OnlineReplica 
(state.change.logger)
[2018-03-14 03:11:41,226] TRACE Controller 0 epoch 44 sending become-follower 
LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to 
broker 4 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,230] TRACE Controller 0 epoch 44 sending become-follower 
LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to 
broker 1 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,450] TRACE Broker 0 received LeaderAndIsr request 
(LeaderAndIsrInfo:(Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0)
 correlation id 158 from controller 0 epoch 44 for partition 
[transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,454] TRACE Broker 0 handling LeaderAndIsr request 
correlationId 158 from controller 0 epoch 44 starting the become-follower 
transition for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,455] ERROR Broker 0 received LeaderAndIsrRequest with 
correlation id 158 from controller 0 epoch 44 for partition 
[transcription-results,9] but cannot become follower since the new leader -1 is 
unavailable. (state.change.logger)
[2018-03-14 03:11:41,459] TRACE Broker 0 completed LeaderAndIsr request 
correlationId 158 from controller 0 epoch 44 for the become-follower transition 
for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,682] TRACE Controller 0 epoch 44 started leader election 
for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,687] TRACE Controller 0 epoch 44 elected leader 4 for 
Offline partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,689] TRACE Controller 0 epoch 44 changed partition 
[transcription-results,9] from OfflinePartition to OnlinePartition with leader 
4 (state.change.logger)
[2018-03-14 03:11:41,825] TRACE Controller 0 epoch 44 sending become-leader 
LeaderAndIsr request (Leader:4,ISR:4,LeaderEpoch:443,ControllerEpoch:44) to 
broker 4 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,826] TRACE Controller 0 epoch 44 sending become-follower 
LeaderAndIsr request (Leader:4,ISR:4,LeaderEpoch:443,ControllerEpoch:44) to 
broker 1 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,899] TRACE Broker 0 cached leader info 
(LeaderAndIsrInfo:(Leader:1,ISR:0,1,4,LeaderEpoch:441,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0)
 for partition [transcription-results,9] in response to UpdateMetadata request 
sent by controller 1 epoch 47 with correlation id 0 (state.change.logger)
[2018-03-14 03:11:41,906] TRACE Broker 0 received LeaderAndIsr request 
(LeaderAndIsrInfo:(Leader:1,ISR:0,1,4,LeaderEpoch:441,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0)
 correlation id 1 from controller 1 epoch 47 for partition 
[transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,908] WARN Broker 0 ignoring LeaderAndIsr request from 
controller 1 with correlation id 1 epoch 47 for partition 
[transcription-results,9] since its associated leader epoch 441 is old. Current 
leader epoch is 441 (state.change.logger)
[2018-03-14 03:11:41,982] TRACE Broker 0 cached leader info 
(LeaderAndIsrInfo:(Leader:1,ISR:0,1,4,LeaderEpoch:441,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0)
 for partition [transcription-results,9] in response to UpdateMetadata request 
sent by controller 1 epoch 47 with correlation id 2 (state.change.logger)
[2018-03-22 14:43:36,098] TRACE Broker 0 received LeaderAndIsr request 
(LeaderAndIsrInfo:(Leader:-1,ISR:,LeaderEpoch:444,ControllerEpoch:47),ReplicationFactor:3),AllReplicas:1,4,0)
 correlation id 679 from controller 1 epoch 47 for partition 
[transcription-results,9] (state.change.logger)


Reply via email to