[ https://issues.apache.org/jira/browse/KAFKA-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koelli Mungee updated KAFKA-7022: --------------------------------- Summary: Setting a very small segment.bytes can cause ReplicaFetcherThread to crash and in turn an unhealthy cluster due to under-replicated partitions (was: Setting a very small segment.bytes can cause ReplicaFetcher threads to crash and in turn an unhealthy cluster due to under-replicated partitions) > Setting a very small segment.bytes can cause ReplicaFetcherThread to crash > and in turn an unhealthy cluster due to under-replicated partitions > ---------------------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-7022 > URL: https://issues.apache.org/jira/browse/KAFKA-7022 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 1.0.1 > Reporter: Koelli Mungee > Assignee: Rajini Sivaram > Priority: Major > > The topic configuration segment.bytes was changed to 14 using the alter > command. This resulted in ReplicaFetcher threads dying with the following > exception: > {code:java} > [2018-06-07 21:02:15,669] ERROR [ReplicaFetcher replicaId=7, leaderId=9, > fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread) > kafka.common.KafkaException: Error processing data for partition test-11 > offset 2362 > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:169) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:166) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:166) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:164) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > Caused by: kafka.common.KafkaException: Trying to roll a new log segment for > topic partition test-11 with start offset 2362 while it already exists. > at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1349) > at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1316) > at kafka.log.Log.maybeHandleIOException(Log.scala:1678) > at kafka.log.Log.roll(Log.scala:1316) > at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1303) > at kafka.log.Log$$anonfun$append$2.apply(Log.scala:726) > at kafka.log.Log$$anonfun$append$2.apply(Log.scala:640) > at kafka.log.Log.maybeHandleIOException(Log.scala:1678) > at kafka.log.Log.append(Log.scala:640) > at kafka.log.Log.appendAsFollower(Log.scala:623) > at > kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560) > at > kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250) > at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:256) > at > kafka.cluster.Partition.appendRecordsToFollower(Partition.scala:559) > at > kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:112) > at > kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:43) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:183) > ... 13 more > [2018-06-07 21:02:15,669] INFO [ReplicaFetcher replicaId=7, leaderId=9, > fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread) > {code} > In order to fix the issue the topic configuration must be changed back to a > reasonable value and brokers which had ReplicaFetcher threads die need to be > restarted one at a time to recover the under-replicated partitions. > A value like 14 bytes is too small to store a message in the log segment. An > ls -al of the topic partition directory would look something like: > {code:java} > -rw-r--r--. 1 root root 10M Jun 7 21:53 00000000000000002362.index > -rw-r--r--. 1 root root 0 Jun 7 21:02 00000000000000002362.log > -rw-r--r--. 1 root root 10M Jun 7 21:53 00000000000000002362.timeindex > -rw-r--r--. 1 root root 4 Jun 7 21:53 leader-epoch-checkpoint > {code} > It would be good to add a check to prevent this configuration to be set to > such a small value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)