[ https://issues.apache.org/jira/browse/KAFKA-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394103#comment-16394103 ]
Dong Lin edited comment on KAFKA-3978 at 3/10/18 8:35 AM: ---------------------------------------------------------- Finally confirmed the root cause and is able to reproduce this using a test: 1) Partition P1 has replica set size 1. Broker A is the leader. The segment is empty and log start offset is 100 2) User executes partition reassignment to change replica set from \{A} to \{B, C} 3) Broker B starts ReplicaFetcherThread, which triggers handleOffsetOutOfRange(), truncates the log fully and start at offset 100. At this moment its high watermark is still 0 (or -1). Same for broker C. 4) Broker B sends FetchRequest to A at offset 100, broker A immediately adds broker B to ISR set, and controller moves leadership to broker B. 5) Broker B handles LeaderAndIsrRequest to become leader. It calls `leaderReplica.convertHWToLocalOffsetMetadata()` to initialize its HW. Since its HW was smaller than logStartOffset=100, now its HW will be overridden to LogOffsetMetadata.UnknownOffsetMetadata, i.e. -1. 6) Broker C handles LeaderAndIsrRequest to fetch from broker B. Broker C updates its HW to the FetchRequest's HW, i.e. -1. Then broker C calls replica.maybeIncrementLogStartOffset(leaderLogStartOffset) where leaderLogStartOffset=100. This cause exception because leaderLogStartOffset > HW. was (Author: lindong): Finally confirmed the root cause and is able to reproduce this using a test: 1) Partition P1 has replica set size 1. Broker A is the leader. The segment is empty and log start offset is 100 2) User executes partition reassignment to change replica set from \{A} to \{B, C} 3) Broker B starts ReplicaFetcherThread, which triggers handleOffsetOutOfRange(), truncates the log fully and start at offset 100. At this moment its high watermark is still 0 (or -1). Sam for broker C. 4) Broker B sends FetchRequest to A at offset 100, broker A immediately adds broker B to ISR set, and controller moves leadership to broker B. 5) Broker B handles LeaderAndIsrRequest to become leader. It calls `leaderReplica.convertHWToLocalOffsetMetadata()` to initialize its HW. Since its HW was smaller than logStartOffset=100, now its HW will be overridden to LogOffsetMetadata.UnknownOffsetMetadata, i.e. -1. 6) Broker C handles LeaderAndIsrRequest to fetch from broker B. Broker C updates its HW to the FetchRequest's HW, i.e. -1. Then broker C calls replica.maybeIncrementLogStartOffset(leaderLogStartOffset) where leaderLogStartOffset=100. This cause exception because leaderLogStartOffset > HW. > Cannot truncate to a negative offset (-1) exception at broker startup > --------------------------------------------------------------------- > > Key: KAFKA-3978 > URL: https://issues.apache.org/jira/browse/KAFKA-3978 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.0.0 > Environment: 3.13.0-87-generic > Reporter: Juho Mäkinen > Assignee: Dong Lin > Priority: Critical > Labels: reliability, startup > > During broker startup sequence the broker server.log has this exception. > Problem persists after multiple restarts and also on another broker in the > cluster. > {code} > INFO [Socket Server on Broker 1002], Started 1 acceptor threads > (kafka.network.SocketServer) > INFO [Socket Server on Broker 1002], Started 1 acceptor threads > (kafka.network.SocketServer) > INFO [ExpirationReaper-1002], Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > INFO [ExpirationReaper-1002], Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > INFO [ExpirationReaper-1002], Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > INFO [ExpirationReaper-1002], Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > INFO [ExpirationReaper-1002], Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > INFO [ExpirationReaper-1002], Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > INFO [ExpirationReaper-1002], Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > INFO [ExpirationReaper-1002], Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > INFO [GroupCoordinator 1002]: Starting up. > (kafka.coordinator.GroupCoordinator) > INFO [GroupCoordinator 1002]: Starting up. > (kafka.coordinator.GroupCoordinator) > INFO [GroupCoordinator 1002]: Startup complete. > (kafka.coordinator.GroupCoordinator) > INFO [GroupCoordinator 1002]: Startup complete. > (kafka.coordinator.GroupCoordinator) > INFO [Group Metadata Manager on Broker 1002]: Removed 0 expired offsets in 9 > milliseconds. (kafka.coordinator.GroupMetadataManager) > INFO [Group Metadata Manager on Broker 1002]: Removed 0 expired offsets in 9 > milliseconds. (kafka.coordinator.GroupMetadataManager) > INFO [ThrottledRequestReaper-Produce], Starting > (kafka.server.ClientQuotaManager$ThrottledRequestReaper) > INFO [ThrottledRequestReaper-Produce], Starting > (kafka.server.ClientQuotaManager$ThrottledRequestReaper) > INFO [ThrottledRequestReaper-Fetch], Starting > (kafka.server.ClientQuotaManager$ThrottledRequestReaper) > INFO [ThrottledRequestReaper-Fetch], Starting > (kafka.server.ClientQuotaManager$ThrottledRequestReaper) > INFO Will not load MX4J, mx4j-tools.jar is not in the classpath > (kafka.utils.Mx4jLoader$) > INFO Will not load MX4J, mx4j-tools.jar is not in the classpath > (kafka.utils.Mx4jLoader$) > INFO Creating /brokers/ids/1002 (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > INFO Creating /brokers/ids/1002 (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral) > INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral) > INFO Registered broker 1002 at path /brokers/ids/1002 with addresses: > PLAINTEXT -> EndPoint(172.16.2.22,9092,PLAINTEXT) (kafka.utils.ZkUtils) > INFO Registered broker 1002 at path /brokers/ids/1002 with addresses: > PLAINTEXT -> EndPoint(172.16.2.22,9092,PLAINTEXT) (kafka.utils.ZkUtils) > INFO Kafka version : 0.10.0.0 (org.apache.kafka.common.utils.AppInfoParser) > INFO Kafka commitId : b8642491e78c5a13 > (org.apache.kafka.common.utils.AppInfoParser) > INFO [Kafka Server 1002], started (kafka.server.KafkaServer) > INFO [Kafka Server 1002], started (kafka.server.KafkaServer) > Error when handling request > {controller_id=1004,controller_epoch=1,partition_states=[..REALLY LONG OUTPUT > SNIPPED AWAY..], > live_leaders=[{id=1004,host=172.16.6.187,port=9092},{id=1003,host=172.16.2.21,port=9092}]} > (kafka.server.KafkaApis) > ERROR java.lang.IllegalArgumentException: Cannot truncate to a negative > offset (-1). > at kafka.log.Log.truncateTo(Log.scala:731) > at > kafka.log.LogManager$$anonfun$truncateTo$2.apply(LogManager.scala:288) > at > kafka.log.LogManager$$anonfun$truncateTo$2.apply(LogManager.scala:280) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at kafka.log.LogManager.truncateTo(LogManager.scala:280) > at kafka.server.ReplicaManager.makeFollowers(ReplicaManager.scala:802) > at > kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:648) > at > kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:144) > at kafka.server.KafkaApis.handle(KafkaApis.scala:80) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)