Re: Unable to start cluster after crash (0.8.2.2)

Jens Rantil Mon, 29 Feb 2016 01:07:26 -0800

Double post. Please keep discussion in the other thread.

Cheers,
Jens


On Wed, Feb 24, 2016 at 4:39 PM, Anthony Sparks <anthony.spark...@gmail.com>
wrote:

> Hello,
>
> Our Kafka cluster (3 servers, each server has Zookeeper and Kafka installed
> and running) crashed, and actually out of the 6 processes only one
> Zookeeper instance remained alive.  The logs do not indicate much, the only
> errors shown were:
>
> 2016-02-21T12:21:36.881+0000: 27445381.013: [GC (Allocation Failure)
> 27445381.013: [ParNew: 136472K->159K(153344K), 0.0047077 secs]
> 139578K->3265K(507264K), 0.0048552 secs] [Times: user=0.01 sys=0.00,
> real=0.01 secs]
>
> These errors were both in the Zookeeper and the Kafka logs, and it appears
> they have been happening everyday (with no impact on Kafka, except for
> maybe now?).
>
> The crash is concerning, but not as concerning as what we are encountering
> right now.  I am unable to get the cluster back up.  Two of the three nodes
> halt with this fatal error:
>
> [2016-02-23 21:18:47,251] FATAL [ReplicaFetcherThread-0-0], Halting because
> log truncation is not allowed for topic audit_data, Current leader 0's
> latest offset 52844816 is less than replica 1's latest offset 52844835
> (kafka.server.ReplicaFetcherThread)
>
> The other node that manages to stay alive is unable to fulfill writes
> because we have min.ack set to 2 on the producers (requiring at least two
> nodes to be available).  We could change this, but that doesn't fix our
> overall problem.
>
> In browsing the Kafka code, in ReplicaFetcherThread.scala there is this
> little nugget:
>
> // Prior to truncating the follower's log, ensure that doing so is not
> disallowed by the configuration for unclean leader election.
> // This situation could only happen if the unclean election configuration
> for a topic changes while a replica is down. Otherwise,
> // we should never encounter this situation since a non-ISR leader cannot
> be elected if disallowed by the broker configuration.
> if (!LogConfig.fromProps(brokerConfig.toProps,
> AdminUtils.fetchTopicConfig(replicaMgr.zkClient,
> topicAndPartition.topic)).uncleanLeaderElectionEnable) {
>     // Log a fatal error and shutdown the broker to ensure that data loss
> does not unexpectedly occur.
>     fatal("Halting because log truncation is not allowed for topic
> %s,".format(topicAndPartition.topic) +
>       " Current leader %d's latest offset %d is less than replica %d's
> latest offset %d"
>       .format(sourceBroker.id, leaderEndOffset, brokerConfig.brokerId,
> replica.logEndOffset.messageOffset))
>     Runtime.getRuntime.halt(1)
> }
>
> For each one of our Kafka instances we have them set at:
> *unclean.leader.election.enable=false
> *which hasn't changed at all since we deployed the cluster (verified by
> file modification stamps).  This to me would indicate the above comment
> assertion is incorrect; we have encountered a non-ISR leader elected even
> though it is configured not to do so.
>
> Any ideas on how to work around this?
>
> Thank you,
>
> Tony Sparks
>



-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook <https://www.facebook.com/#!/tink.se> Linkedin
<http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
 Twitter <https://twitter.com/tink>

Re: Unable to start cluster after crash (0.8.2.2)

Reply via email to