Koelli Mungee created KAFKA-7022:
------------------------------------
Summary: Setting segment.bytes for a topic too small can cause
ReplicaFetcher thread crash and in turn an unhealthy cluster due to
under-replicated partitions
Key: KAFKA-7022
URL: https://issues.apache.org/jira/browse/KAFKA-7022
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 1.0.1
Reporter: Koelli Mungee
The topic configuration segment.bytes was changed to 14 using the alter
command. This resulted in ReplicaFetcher threads dying with the following
exception:
{code:java}
[2018-06-07 21:02:15,669] ERROR [ReplicaFetcher replicaId=7, leaderId=9,
fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread)
kafka.common.KafkaException: Error processing data for partition test-11 offset
2362
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:169)
at scala.Option.foreach(Option.scala:257)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:166)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:166)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:164)
at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Caused by: kafka.common.KafkaException: Trying to roll a new log segment for
topic partition ledger-entry-request-5-11 with start offset 2362 while it
already exists.
at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1349)
at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1316)
at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
at kafka.log.Log.roll(Log.scala:1316)
at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1303)
at kafka.log.Log$$anonfun$append$2.apply(Log.scala:726)
at kafka.log.Log$$anonfun$append$2.apply(Log.scala:640)
at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
at kafka.log.Log.append(Log.scala:640)
at kafka.log.Log.appendAsFollower(Log.scala:623)
at
kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560)
at
kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:256)
at kafka.cluster.Partition.appendRecordsToFollower(Partition.scala:559)
at
kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:112)
at
kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:43)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:183)
... 13 more
[2018-06-07 21:02:15,669] INFO [ReplicaFetcher replicaId=7, leaderId=9,
fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
{code}
In order to fix the issue the topic configuration must be changed back to a
reasonable value and brokers which had ReplicaFetcher threads die need to be
restarted one at a time to recover the under-replicated partitions.
A value like 14 bytes is too small to store a message in the log segment. An ls
-al of the topic partition directory would look something like:
{code:java}
-rw-r--r--. 1 root root 10M Jun 7 21:53 00000000000000002362.index
-rw-r--r--. 1 root root 0 Jun 7 21:02 00000000000000002362.log
-rw-r--r--. 1 root root 10M Jun 7 21:53 00000000000000002362.timeindex
-rw-r--r--. 1 root root 4 Jun 7 21:53 leader-epoch-checkpoint
{code}
It would be good to add a check to prevent this configuration to be set to such
a small value.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)