Is this under 0.8? There are a few known bugs in 0.8 that can lead to this
situation. I'd recommend upgrading to 0.9 as soon as is viable to prevent
this and many other kinds of issues that were fixed in 0.9.

Thanks

Tom Crayford
Heroku Kafka

On Tue, May 31, 2016 at 6:19 AM, Fredo Lee <buaatianwa...@gmail.com> wrote:

> thanks for your reply.
>
> yes, there are more than one controller. the msg of "soft failure" is
> reported by the old controller.
>
> 2016-05-31 11:42 GMT+08:00 Muqtafi Akhmad <muqt...@traveloka.com>:
>
> > hello Fredo,
> >
> > My guess is that there was partition leader election that was might be
> > triggered by detection of offline partition in Kafka cluster. Somehow the
> > broker try to trigger leader election while previous election has been
> > completed hence this log :
> >
> > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 initiated
> state
> > > change for partition [consup-0000000043,78] from OnlinePartition to
> > > OnlinePartition failed (state.change.logger)
> > > kafka.common.StateChangeFailedException: encountered error while
> electing
> > > leader for partition [consup-0000000043,78] due to: aborted leader
> > election
> > > for partition [consup-0000000043,78]
> > > since the LeaderAndIsr path was already written by another controller.
> > This
> > > probably means that the current controller 1008 went through a soft
> > failure
> > > and another controller was elec
> > > ted with epoch 14..
> >
> > The question is,
> > - was there any offline partition?
> > - was there more than one active controller?
> >
> > CMIIW
> >
> >
> > On Mon, May 30, 2016 at 2:41 PM, Fredo Lee <buaatianwa...@gmail.com>
> > wrote:
> >
> > > my server.log >>>>>>>>>>>>>>>>
> > >
> > > lots of  error msg:
> > >
> > > [2016-05-28 12:12:31,132] WARN [ReplicaFetcherThread-0-1007], Replica
> > 1008
> > > for partition [consup-0000000003,16] reset its fetch offset from
> 13985537
> > > to current leader 1007's latest offset 13985537
> > > (kafka.server.ReplicaFetcherThread)
> > > [2016-05-28 12:12:31,132] ERROR [ReplicaFetcherThread-0-1007], Current
> > > offset 13987676 for partition [consup-0000000003,16] out of range;
> reset
> > > offset to 13985537 (kafka.server.ReplicaFetcherThread)
> > >
> > >
> > > the other error msg:
> > >
> > > [2016-05-28 12:12:31,708] ERROR [Replica Manager on Broker 1008]: Error
> > > when processing fetch request for partition [consup-0000000003,35]
> offset
> > > 13848954 from consumer with correlation id 0. Possible cause: Request
> for
> > > offset 13848954 but we only have log segments in the range 12356946 to
> > > 13847167. (kafka.server.ReplicaManager)
> > >
> > >
> > >
> > > 2016-05-30 15:37 GMT+08:00 Fredo Lee <buaatianwa...@gmail.com>:
> > >
> > > > My state-change.log>>>>>>>>>>>
> > > >
> > > >
> > > > [2016-05-28 12:33:31,510] ERROR Controller 1008 epoch 13 aborted
> leader
> > > > election for partition [consup-0000000043,78] since the LeaderAndIsr
> > path
> > > > was already written by another controll
> > > > er. This probably means that the current controller 1008 went
> through a
> > > > soft failure and another controller was elected with epoch 14.
> > > > (state.change.logger)
> > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > response
> > > > UpdateMetadataResponse(1020,-1) for a request sent to broker
> > > id:1009,host:
> > > > 22.com,port:9092 (state.change.logger)
> > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > response
> > > > UpdateMetadataResponse(1066,-1) for a request sent to broker
> > > id:1001,host:
> > > > consup-kafka20.com,port:9092 (state.change.logger)
> > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > response
> > > > UpdateMetadataResponse(777,-1) for a request sent to broker
> > id:1006,host:
> > > > consup-kafka11.com,port:9092 (state.change.logger)
> > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > response
> > > > UpdateMetadataResponse(742,-1) for a request sent to broker
> > id:1018,host:
> > > > consup-kafka10.com,port:9092 (state.change.logger)
> > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received
> > > response
> > > > UpdateMetadataResponse(1021,-1) for a request sent to broker
> > > id:1009,host:
> > > > consup-kafka22.com,port:9092 (state.change.logger)
> > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > response
> > > > UpdateMetadataResponse(1067,-1) for a request sent to broker
> > > id:1001,host:
> > > > consup-kafka20.com,port:9092 (state.change.logger)
> > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > response
> > > > UpdateMetadataResponse(778,-1) for a request sent to broker
> > id:1006,host:
> > > > consup-kafka11.com,port:9092 (state.change.logger)
> > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > response
> > > > UpdateMetadataResponse(1022,-1) for a request sent to broker
> > > id:1009,host:
> > > > consup-kafka22.com,port:9092 (state.change.logger)
> > > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 encountered
> > > error
> > > > while electing leader for partition [consup-0000000043,78] due to:
> > > aborted
> > > > leader election for partition [consup
> > > > -0000000043,78] since the LeaderAndIsr path was already written by
> > > another
> > > > controller. This probably means that the current controller 1008 went
> > > > through a soft failure and another con
> > > > troller was elected with epoch 14.. (state.change.logger)
> > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > response
> > > > UpdateMetadataResponse(743,-1) for a request sent to broker
> > id:1018,host:
> > > > consup-kafka10.co
> > > > m,port:9092 (state.change.logger)
> > > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 initiated
> > state
> > > > change for partition [consup-0000000043,78] from OnlinePartition to
> > > > OnlinePartition failed (state.change.logger)
> > > > kafka.common.StateChangeFailedException: encountered error while
> > electing
> > > > leader for partition [consup-0000000043,78] due to: aborted leader
> > > election
> > > > for partition [consup-0000000043,78]
> > > > since the LeaderAndIsr path was already written by another
> controller.
> > > > This probably means that the current controller 1008 went through a
> > soft
> > > > failure and another controller was elec
> > > > ted with epoch 14..
> > > >         at
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:380)
> > > >         at
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:208)
> > > >         at
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:146)
> > > >         at
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:145)
> > > >         at scala.collection.immutable.Set$Set1.foreach(Set.scala:79)
> > > >         at
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:145)
> > > >         at
> > > >
> > >
> >
> kafka.controller.KafkaController.onPreferredReplicaElection(KafkaController.scala:631)
> > > >         at
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply$mcV$sp(KafkaControl
> > > > ler.scala:1158)
> > > >         at
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply(KafkaController.sca
> > > > la:1153)
> > > >         at
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply(KafkaController.sca
> > > > la:1153)
> > > >         at kafka.utils.Utils$.inLock(Utils.scala:535)
> > > >         at
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17.apply(KafkaController.scala:1150)
> > > >         at
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17.apply(KafkaController.scala:1148)
> > > >         at
> > > >
> > >
> >
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> > > >         at
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4.apply(KafkaController.scala:1148)
> > > >         at
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4.apply(KafkaController.scala:1127)
> > > >         at
> > > >
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
> > > >         at
> > > >
> > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> > > >         at
> > > >
> > >
> >
> kafka.controller.KafkaController.kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance(KafkaController.scala:1127)
> > > >         at
> > > >
> > >
> >
> kafka.controller.KafkaController$$anonfun$onControllerFailover$1.apply$mcV$sp(KafkaController.scala:326)
> > > >         at
> > > >
> > >
> >
> kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:99)
> > > >         at kafka.utils.Utils$$anon$1.run(Utils.scala:54)
> > > >         at
> > > >
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > >         at
> > > >
> > >
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
> > > >         at
> > > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
> > > >         at
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
> > > >         at
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
> > > >         at
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > > >         at
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > > >         at java.lang.Thread.run(Thread.java:679)
> > > > Caused by: kafka.common.StateChangeFailedException: aborted leader
> > > > election for partition [consup-0000000043,78] since the LeaderAndIsr
> > path
> > > > was already written by another controller. This probably means that
> the
> > > > current controller 1008 went through a soft failure and another
> > > controller
> > > > was elected with epoch 14.
> > > >         at
> > > >
> > >
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:354)
> > > >         ... 33 more
> > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received
> > > response
> > > > UpdateMetadataResponse(744,-1) for a request sent to broker
> > id:1018,host:
> > > > consup-kafka10.com,port:9092 (state.change.logger)
> > > >
> > > >
> > > > 2016-05-28 15:31 GMT+08:00 Muqtafi Akhmad <muqt...@traveloka.com>:
> > > >
> > > >> hello Fredo,
> > > >>
> > > >> Can you elaborate the 'soft' failure?
> > > >>
> > > >>
> > > >>
> > > >> On Sat, May 28, 2016 at 1:53 PM, Fredo Lee <buaatianwa...@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > we have a kafka cluster with 19 nodes. every week we  suffer a
> soft
> > > >> failure
> > > >> > from this cluster. how to resolve this problem.
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Muqtafi Akhmad
> > > >> Software Engineer
> > > >> Traveloka
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Muqtafi Akhmad
> > Software Engineer
> > Traveloka
> >
>

Reply via email to