Ricardo Bartolome created KAFKA-6762:
Summary: log-cleaner thread terminates due to
java.lang.IllegalStateException
Key: KAFKA-6762
URL: https://issues.apache.org/jira/browse/KAFKA-6762
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 1.0.0
Environment: os: GNU/Linux
arch: x86_64
Kernel: 4.9.77
jvm: OpenJDK 1.8.0
Reporter: Ricardo Bartolome
We are experiencing some problems with kafka log-cleaner thread on Kafka 1.0.0.
We have planned to update this cluster to 1.1.0 by next week in order to fix
KAFKA-6683, but until then we can only confirm that it happens in 1.0.0.
log-cleaner thread crashes after a while with the following error:
{code:java}
[2018-03-28 11:14:40,199] INFO Cleaner 0: Beginning cleaning of log
__consumer_offsets-31. (kafka.log.LogCleaner)
[2018-03-28 11:14:40,199] INFO Cleaner 0: Building offset map for
__consumer_offsets-31... (kafka.log.LogCleaner)
[2018-03-28 11:14:40,218] INFO Cleaner 0: Building offset map for log
__consumer_offsets-31 for 16 segments in offset range [1612869, 14282934).
(kafka.log.LogCleaner)
[2018-03-28 11:14:58,566] INFO Cleaner 0: Offset map for log
__consumer_offsets-31 complete. (kafka.log.LogCleaner)
[2018-03-28 11:14:58,566] INFO Cleaner 0: Cleaning log __consumer_offsets-31
(cleaning prior to Tue Mar 27 09:25:09 GMT 2018, discarding tombstones prior to
Sat Feb 24 11:04:21 GMT 2018
)... (kafka.log.LogCleaner)
[2018-03-28 11:14:58,567] INFO Cleaner 0: Cleaning segment 0 in log
__consumer_offsets-31 (largest timestamp Fri Feb 23 11:40:54 GMT 2018) into 0,
discarding deletes. (kafka.log.LogClea
ner)
[2018-03-28 11:14:58,570] INFO Cleaner 0: Growing cleaner I/O buffers from
262144bytes to 524288 bytes. (kafka.log.LogCleaner)
[2018-03-28 11:14:58,576] INFO Cleaner 0: Growing cleaner I/O buffers from
524288bytes to 112 bytes. (kafka.log.LogCleaner)
[2018-03-28 11:14:58,593] ERROR [kafka-log-cleaner-thread-0]: Error due to
(kafka.log.LogCleaner)
java.lang.IllegalStateException: This log contains a message larger than
maximum allowable size of 112.
at kafka.log.Cleaner.growBuffers(LogCleaner.scala:622)
at kafka.log.Cleaner.cleanInto(LogCleaner.scala:574)
at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:459)
at kafka.log.Cleaner.$anonfun$doClean$6(LogCleaner.scala:396)
at kafka.log.Cleaner.$anonfun$doClean$6$adapted(LogCleaner.scala:395)
at scala.collection.immutable.List.foreach(List.scala:389)
at kafka.log.Cleaner.doClean(LogCleaner.scala:395)
at kafka.log.Cleaner.clean(LogCleaner.scala:372)
at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:263)
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:243)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
[2018-03-28 11:14:58,601] INFO [kafka-log-cleaner-thread-0]: Stopped
(kafka.log.LogCleaner)
[2018-04-04 14:25:12,773] INFO The cleaning for partition
__broker-11-health-check-0 is aborted and paused (kafka.log.LogCleaner)
[2018-04-04 14:25:12,773] INFO Compaction for partition
__broker-11-health-check-0 is resumed (kafka.log.LogCleaner)
[2018-04-04 14:25:12,774] INFO The cleaning for partition
__broker-11-health-check-0 is aborted (kafka.log.LogCleaner)
[2018-04-04 14:25:22,850] INFO Shutting down the log cleaner.
(kafka.log.LogCleaner)
[2018-04-04 14:25:22,850] INFO [kafka-log-cleaner-thread-0]: Shutting down
(kafka.log.LogCleaner)
[2018-04-04 14:25:22,850] INFO [kafka-log-cleaner-thread-0]: Shutdown completed
(kafka.log.LogCleaner)
{code}
What we know so far is:
* We are unable to reproduce it yet in a consistent manner.
* It only happens in the PRO cluster and not in the PRE cluster for the same
customer (which message payloads are very similar)
* Checking our Kafka logs, it only happened on the internal topics
*__consumer_offsets-**
* When we restart the broker process the log-cleaner starts working again but
it can take between 3 minutes and some hours to die again.
* We workaround it by temporary increasing the message.max.bytes and
replica.fetch.max.bytes values to 10485760 (10MB) from default 112 (~1MB).
** Before message.max.bytes = 10MB, we tried to match message.max.size with the
value of replica.fetch.max.size (1048576), but log-cleaned died with the same
error but different limit.
** This allowed the log-cleaner not to die and compact enough data as for disk
space to go from ~600GB to ~100GB.
** Without this limit change, the log-cleaner dies after a while and the used
disk space stay at ~450GB and starts growing again due to cluster activity.
Our server.properties content is as follows, as printed ins server.log at
broker startup.
{code:java}
broker.id=11
delete.topic.enable=true
advertised.listeners=PLAINTEXT://broker-ip:9092
num.network.threads=3
num.io.