Hello, We are doing a Kafka POC on our CDH cluster. We are running 3 brokers with 24TB (48TB Raw) of available RAID10 storage (XFS filesystem mounted with nobarrier/largeio) (HP Smart Array P420i for the controller, latest firmware) and 48GB of RAM. The broker is running with "-Xmx4G -Xms4G -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC". This is on RHEL 6.6 with the 2.6.32-504.8.1.el6.x86_64 kernel. JDK is jdk1.7.0_67 64-bit. We were using the 1.2.0 version of the Cloudera Kafka 0.8.2.0 build. We are upgrading to 1.3.0 after the RAID testing, but none of the fixes they included in 1.3.0 seem to be related to what we're seeing.
We are using a custom producer to push copies of real messages from our existing messaging system onto Kafka in order to test ingestion rates and compression ratios. After a couple of hours (during which about 4.3 billion, ~2.2 terabytes before replication), one of our brokers will fail with an I/O error (2 slightly different ones so far) followed by a memory error. We're currently doing stress testing on the arrays (write/verify with IOzone set for 24 threads), but assuming the tests don't find anything on IO, what could cause this? Errors are included below. Thanks, -Jeff Occurrence 1: 2015-05-12 03:55:08,291 FATAL kafka.server.KafkaApis: [KafkaApi-834] Halting due to unrecoverable I/O error while handling produce request: kafka.common.KafkaStorageException: I/O exception in append to log 'TEST_TOPIC-1' at kafka.log.Log.append(Log.scala:266) at kafka.cluster.Partition$$anonfun$appendMessagesToLeader$1.apply(Partition.scala:379) at kafka.cluster.Partition$$anonfun$appendMessagesToLeader$1.apply(Partition.scala:365) at kafka.utils.Utils$.inLock(Utils.scala:561) at kafka.utils.Utils$.inReadLock(Utils.scala:567) at kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:365) at kafka.server.KafkaApis$$anonfun$appendToLocalLog$2.apply(KafkaApis.scala:291) at kafka.server.KafkaApis$$anonfun$appendToLocalLog$2.apply(KafkaApis.scala:282) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at kafka.server.KafkaApis.appendToLocalLog(KafkaApis.scala:282) at kafka.server.KafkaApis.handleProducerOrOffsetCommitRequest(KafkaApis.scala:204) at kafka.server.KafkaApis.handle(KafkaApis.scala:59) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:888) at kafka.log.OffsetIndex.<init>(OffsetIndex.scala:74) at kafka.log.LogSegment.<init>(LogSegment.scala:57) at kafka.log.Log.roll(Log.scala:565) at kafka.log.Log.maybeRoll(Log.scala:539) at kafka.log.Log.append(Log.scala:306) ... 21 more Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:885) ... 26 more Occurrence 2: 2015-05-12 20:08:15,052 FATAL kafka.server.KafkaApis: [KafkaApi-835] Halting due to unrecoverable I/O error while handling produce request: kafka.common.KafkaStorageException: I/O exception in append to log 'TEST_TOPIC-23' at kafka.log.Log.append(Log.scala:266) at kafka.cluster.Partition$$anonfun$appendMessagesToLeader$1.apply(Partition.scala:379) at kafka.cluster.Partition$$anonfun$appendMessagesToLeader$1.apply(Partition.scala:365) at kafka.utils.Utils$.inLock(Utils.scala:561) at kafka.utils.Utils$.inReadLock(Utils.scala:567) at kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:365) at kafka.server.KafkaApis$$anonfun$appendToLocalLog$2.apply(KafkaApis.scala:291) at kafka.server.KafkaApis$$anonfun$appendToLocalLog$2.apply(KafkaApis.scala:282) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at kafka.server.KafkaApis.appendToLocalLog(KafkaApis.scala:282) at kafka.server.KafkaApis.handleProducerOrOffsetCommitRequest(KafkaApis.scala:204) at kafka.server.KafkaApis.handle(KafkaApis.scala:59) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:888) at kafka.log.OffsetIndex.<init>(OffsetIndex.scala:74) at kafka.log.LogSegment.<init>(LogSegment.scala:57) at kafka.log.Log.roll(Log.scala:565) at kafka.log.Log.maybeRoll(Log.scala:539) at kafka.log.Log.append(Log.scala:306) ... 21 more Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:885) ... 26 more Occurrence 3: 2015-05-13 01:11:14,626 FATAL kafka.server.ReplicaFetcherThread: [ReplicaFetcherThread-0-835], Disk error while replicating data. kafka.common.KafkaStorageException: I/O exception in append to log 'TEST_TOPIC-17' at kafka.log.Log.append(Log.scala:266) at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:54) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1$$anonfun$apply$mcV$sp$2.apply(AbstractFetcherThread.scala:128) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1$$anonfun$apply$mcV$sp$2.apply(AbstractFetcherThread.scala:109) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1.apply$mcV$sp(AbstractFetcherThread.scala:109) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1.apply(AbstractFetcherThread.scala:109) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1.apply(AbstractFetcherThread.scala:109) at kafka.utils.Utils$.inLock(Utils.scala:561) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:108) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:86) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:888) at kafka.log.OffsetIndex.<init>(OffsetIndex.scala:74) at kafka.log.LogSegment.<init>(LogSegment.scala:57) at kafka.log.Log.roll(Log.scala:565) at kafka.log.Log.maybeRoll(Log.scala:539) at kafka.log.Log.append(Log.scala:306) ... 13 more Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:885) ... 18 more