[ 
https://issues.apache.org/jira/browse/KAFKA-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram reassigned KAFKA-7022:
-------------------------------------

    Assignee: Rajini Sivaram

> Setting a very small segment.bytes can cause ReplicaFetcher threads to crash 
> and in turn an unhealthy cluster due to under-replicated partitions
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7022
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7022
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.0.1
>            Reporter: Koelli Mungee
>            Assignee: Rajini Sivaram
>            Priority: Major
>
> The topic configuration segment.bytes was changed to 14 using the alter 
> command. This resulted in ReplicaFetcher threads dying with the following 
> exception:
> {code:java}
> [2018-06-07 21:02:15,669] ERROR [ReplicaFetcher replicaId=7, leaderId=9, 
> fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread)
> kafka.common.KafkaException: Error processing data for partition test-11 
> offset 2362
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:169)
>         at scala.Option.foreach(Option.scala:257)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:166)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:166)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
>         at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
>         at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:164)
>         at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> Caused by: kafka.common.KafkaException: Trying to roll a new log segment for 
> topic partition ledger-entry-request-5-11 with start offset 2362 while it 
> already exists.
>         at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1349)
>         at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1316)
>         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
>         at kafka.log.Log.roll(Log.scala:1316)
>         at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1303)
>         at kafka.log.Log$$anonfun$append$2.apply(Log.scala:726)
>         at kafka.log.Log$$anonfun$append$2.apply(Log.scala:640)
>         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
>         at kafka.log.Log.append(Log.scala:640)
>         at kafka.log.Log.appendAsFollower(Log.scala:623)
>         at 
> kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560)
>         at 
> kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560)
>         at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
>         at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:256)
>         at 
> kafka.cluster.Partition.appendRecordsToFollower(Partition.scala:559)
>         at 
> kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:112)
>         at 
> kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:43)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:183)
>         ... 13 more
> [2018-06-07 21:02:15,669] INFO [ReplicaFetcher replicaId=7, leaderId=9, 
> fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
> {code}
> In order to fix the issue the topic configuration must be changed back to a 
> reasonable value and brokers which had ReplicaFetcher threads die need to be 
> restarted one at a time to recover the under-replicated partitions. 
> A value like 14 bytes is too small to store a message in the log segment. An 
> ls -al of the topic partition directory would look something like:
> {code:java}
> -rw-r--r--. 1 root root 10M Jun 7 21:53 00000000000000002362.index 
> -rw-r--r--. 1 root root 0 Jun 7 21:02 00000000000000002362.log 
> -rw-r--r--. 1 root root 10M Jun 7 21:53 00000000000000002362.timeindex 
> -rw-r--r--. 1 root root 4 Jun 7 21:53 leader-epoch-checkpoint
> {code}
> It would be good to add a check to prevent this configuration to be set to 
> such a small value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to