hello Fredo,

My guess is that there was partition leader election that was might be
triggered by detection of offline partition in Kafka cluster. Somehow the
broker try to trigger leader election while previous election has been
completed hence this log :

> [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 initiated state
> change for partition [consup-0000000043,78] from OnlinePartition to
> OnlinePartition failed (state.change.logger)
> kafka.common.StateChangeFailedException: encountered error while electing
> leader for partition [consup-0000000043,78] due to: aborted leader election
> for partition [consup-0000000043,78]
> since the LeaderAndIsr path was already written by another controller. This
> probably means that the current controller 1008 went through a soft failure
> and another controller was elec
> ted with epoch 14..

The question is,
- was there any offline partition?
- was there more than one active controller?

CMIIW


On Mon, May 30, 2016 at 2:41 PM, Fredo Lee <buaatianwa...@gmail.com> wrote:

> my server.log >>>>>>>>>>>>>>>>
>
> lots of  error msg:
>
> [2016-05-28 12:12:31,132] WARN [ReplicaFetcherThread-0-1007], Replica 1008
> for partition [consup-0000000003,16] reset its fetch offset from 13985537
> to current leader 1007's latest offset 13985537
> (kafka.server.ReplicaFetcherThread)
> [2016-05-28 12:12:31,132] ERROR [ReplicaFetcherThread-0-1007], Current
> offset 13987676 for partition [consup-0000000003,16] out of range; reset
> offset to 13985537 (kafka.server.ReplicaFetcherThread)
>
>
> the other error msg:
>
> [2016-05-28 12:12:31,708] ERROR [Replica Manager on Broker 1008]: Error
> when processing fetch request for partition [consup-0000000003,35] offset
> 13848954 from consumer with correlation id 0. Possible cause: Request for
> offset 13848954 but we only have log segments in the range 12356946 to
> 13847167. (kafka.server.ReplicaManager)
>
>
>
> 2016-05-30 15:37 GMT+08:00 Fredo Lee <buaatianwa...@gmail.com>:
>
> > My state-change.log>>>>>>>>>>>
> >
> >
> > [2016-05-28 12:33:31,510] ERROR Controller 1008 epoch 13 aborted leader
> > election for partition [consup-0000000043,78] since the LeaderAndIsr path
> > was already written by another controll
> > er. This probably means that the current controller 1008 went through a
> > soft failure and another controller was elected with epoch 14.
> > (state.change.logger)
> > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> response
> > UpdateMetadataResponse(1020,-1) for a request sent to broker
> id:1009,host:
> > 22.com,port:9092 (state.change.logger)
> > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> response
> > UpdateMetadataResponse(1066,-1) for a request sent to broker
> id:1001,host:
> > consup-kafka20.com,port:9092 (state.change.logger)
> > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> response
> > UpdateMetadataResponse(777,-1) for a request sent to broker id:1006,host:
> > consup-kafka11.com,port:9092 (state.change.logger)
> > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> response
> > UpdateMetadataResponse(742,-1) for a request sent to broker id:1018,host:
> > consup-kafka10.com,port:9092 (state.change.logger)
> > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> response
> > UpdateMetadataResponse(1021,-1) for a request sent to broker
> id:1009,host:
> > consup-kafka22.com,port:9092 (state.change.logger)
> > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> response
> > UpdateMetadataResponse(1067,-1) for a request sent to broker
> id:1001,host:
> > consup-kafka20.com,port:9092 (state.change.logger)
> > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> response
> > UpdateMetadataResponse(778,-1) for a request sent to broker id:1006,host:
> > consup-kafka11.com,port:9092 (state.change.logger)
> > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> response
> > UpdateMetadataResponse(1022,-1) for a request sent to broker
> id:1009,host:
> > consup-kafka22.com,port:9092 (state.change.logger)
> > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 encountered
> error
> > while electing leader for partition [consup-0000000043,78] due to:
> aborted
> > leader election for partition [consup
> > -0000000043,78] since the LeaderAndIsr path was already written by
> another
> > controller. This probably means that the current controller 1008 went
> > through a soft failure and another con
> > troller was elected with epoch 14.. (state.change.logger)
> > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> response
> > UpdateMetadataResponse(743,-1) for a request sent to broker id:1018,host:
> > consup-kafka10.co
> > m,port:9092 (state.change.logger)
> > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 initiated state
> > change for partition [consup-0000000043,78] from OnlinePartition to
> > OnlinePartition failed (state.change.logger)
> > kafka.common.StateChangeFailedException: encountered error while electing
> > leader for partition [consup-0000000043,78] due to: aborted leader
> election
> > for partition [consup-0000000043,78]
> > since the LeaderAndIsr path was already written by another controller.
> > This probably means that the current controller 1008 went through a soft
> > failure and another controller was elec
> > ted with epoch 14..
> >         at
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:380)
> >         at
> >
> kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:208)
> >         at
> >
> kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:146)
> >         at
> >
> kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:145)
> >         at scala.collection.immutable.Set$Set1.foreach(Set.scala:79)
> >         at
> >
> kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:145)
> >         at
> >
> kafka.controller.KafkaController.onPreferredReplicaElection(KafkaController.scala:631)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply$mcV$sp(KafkaControl
> > ler.scala:1158)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply(KafkaController.sca
> > la:1153)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply(KafkaController.sca
> > la:1153)
> >         at kafka.utils.Utils$.inLock(Utils.scala:535)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17.apply(KafkaController.scala:1150)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17.apply(KafkaController.scala:1148)
> >         at
> >
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4.apply(KafkaController.scala:1148)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4.apply(KafkaController.scala:1127)
> >         at
> > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
> >         at
> > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> >         at
> >
> kafka.controller.KafkaController.kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance(KafkaController.scala:1127)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$onControllerFailover$1.apply$mcV$sp(KafkaController.scala:326)
> >         at
> >
> kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:99)
> >         at kafka.utils.Utils$$anon$1.run(Utils.scala:54)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >         at
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
> >         at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> >         at java.lang.Thread.run(Thread.java:679)
> > Caused by: kafka.common.StateChangeFailedException: aborted leader
> > election for partition [consup-0000000043,78] since the LeaderAndIsr path
> > was already written by another controller. This probably means that the
> > current controller 1008 went through a soft failure and another
> controller
> > was elected with epoch 14.
> >         at
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:354)
> >         ... 33 more
> > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> response
> > UpdateMetadataResponse(744,-1) for a request sent to broker id:1018,host:
> > consup-kafka10.com,port:9092 (state.change.logger)
> >
> >
> > 2016-05-28 15:31 GMT+08:00 Muqtafi Akhmad <muqt...@traveloka.com>:
> >
> >> hello Fredo,
> >>
> >> Can you elaborate the 'soft' failure?
> >>
> >>
> >>
> >> On Sat, May 28, 2016 at 1:53 PM, Fredo Lee <buaatianwa...@gmail.com>
> >> wrote:
> >>
> >> > we have a kafka cluster with 19 nodes. every week we  suffer a soft
> >> failure
> >> > from this cluster. how to resolve this problem.
> >> >
> >>
> >>
> >>
> >> --
> >> Muqtafi Akhmad
> >> Software Engineer
> >> Traveloka
> >>
> >
> >
>



-- 
Muqtafi Akhmad
Software Engineer
Traveloka

Reply via email to