Double post. Please keep discussion in the other thread. Cheers, Jens
On Wed, Feb 24, 2016 at 4:39 PM, Anthony Sparks <anthony.spark...@gmail.com> wrote: > Hello, > > Our Kafka cluster (3 servers, each server has Zookeeper and Kafka installed > and running) crashed, and actually out of the 6 processes only one > Zookeeper instance remained alive. The logs do not indicate much, the only > errors shown were: > > 2016-02-21T12:21:36.881+0000: 27445381.013: [GC (Allocation Failure) > 27445381.013: [ParNew: 136472K->159K(153344K), 0.0047077 secs] > 139578K->3265K(507264K), 0.0048552 secs] [Times: user=0.01 sys=0.00, > real=0.01 secs] > > These errors were both in the Zookeeper and the Kafka logs, and it appears > they have been happening everyday (with no impact on Kafka, except for > maybe now?). > > The crash is concerning, but not as concerning as what we are encountering > right now. I am unable to get the cluster back up. Two of the three nodes > halt with this fatal error: > > [2016-02-23 21:18:47,251] FATAL [ReplicaFetcherThread-0-0], Halting because > log truncation is not allowed for topic audit_data, Current leader 0's > latest offset 52844816 is less than replica 1's latest offset 52844835 > (kafka.server.ReplicaFetcherThread) > > The other node that manages to stay alive is unable to fulfill writes > because we have min.ack set to 2 on the producers (requiring at least two > nodes to be available). We could change this, but that doesn't fix our > overall problem. > > In browsing the Kafka code, in ReplicaFetcherThread.scala there is this > little nugget: > > // Prior to truncating the follower's log, ensure that doing so is not > disallowed by the configuration for unclean leader election. > // This situation could only happen if the unclean election configuration > for a topic changes while a replica is down. Otherwise, > // we should never encounter this situation since a non-ISR leader cannot > be elected if disallowed by the broker configuration. > if (!LogConfig.fromProps(brokerConfig.toProps, > AdminUtils.fetchTopicConfig(replicaMgr.zkClient, > topicAndPartition.topic)).uncleanLeaderElectionEnable) { > // Log a fatal error and shutdown the broker to ensure that data loss > does not unexpectedly occur. > fatal("Halting because log truncation is not allowed for topic > %s,".format(topicAndPartition.topic) + > " Current leader %d's latest offset %d is less than replica %d's > latest offset %d" > .format(sourceBroker.id, leaderEndOffset, brokerConfig.brokerId, > replica.logEndOffset.messageOffset)) > Runtime.getRuntime.halt(1) > } > > For each one of our Kafka instances we have them set at: > *unclean.leader.election.enable=false > *which hasn't changed at all since we deployed the cluster (verified by > file modification stamps). This to me would indicate the above comment > assertion is incorrect; we have encountered a non-ISR leader elected even > though it is configured not to do so. > > Any ideas on how to work around this? > > Thank you, > > Tony Sparks > -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook <https://www.facebook.com/#!/tink.se> Linkedin <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary> Twitter <https://twitter.com/tink>