[ 
https://issues.apache.org/jira/browse/KAFKA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553829#comment-14553829
 ] 

Albert Visagie commented on KAFKA-2201:
---------------------------------------

Some new information:
1. We adjusted the segment size to about 30MB instead of roughly 1MB as before. 
The segments roll over more slowly, as one would expect.
2. Our stats graphs attached show the previous climb in open files right up to 
the crash.
3. The lower slope of the line after the high peak shows that it still climbs, 
but much more slowly.
4. The more recent sudden drop in open filehandles corresponds exactly in time 
with the following output from kafkaServer-gc.log (good idea logging that by 
default right out the box, btw).

2015-05-21T07:38:47.294+0000: 65001.794: [GC2015-05-21T07:38:47.294+0000: 
65001.794: [ParNew: 5827K->5191K(314560K), 0.0083640 secs] 
648593K->648101K(1013632K), 0.0084390 secs] [Times: user=0.09 sys=0.00, 
real=0.01 secs]
2015-05-21T07:38:53.377+0000: 65007.877: [GC2015-05-21T07:38:53.377+0000: 
65007.877: [ParNew: 284807K->6304K(314560K), 0.0356520 secs] 
927717K->649362K(1013632K), 0.0357490 secs] [Times: user=0.29 sys=0.00, 
real=0.04 secs]
2015-05-21T07:38:59.836+0000: 65014.336: [GC2015-05-21T07:38:59.836+0000: 
65014.336: [ParNew: 285920K->5286K(314560K), 0.0391160 secs] 
928978K->648489K(1013632K), 0.0392210 secs] [Times: user=0.30 sys=0.00, 
real=0.04 secs]
2015-05-21T07:38:59.876+0000: 65014.376: [GC [1 CMS-initial-mark: 
643203K(699072K)] 650842K(1013632K), 0.0078780 secs] [Times: user=0.01 
sys=0.00, real=0.01 secs]
2015-05-21T07:38:59.884+0000: 65014.384: [CMS-concurrent-mark-start]
2015-05-21T07:38:59.903+0000: 65014.403: [CMS-concurrent-mark: 0.017/0.019 
secs] [Times: user=0.13 sys=0.01, real=0.02 secs]
2015-05-21T07:38:59.903+0000: 65014.403: [CMS-concurrent-preclean-start]
2015-05-21T07:38:59.908+0000: 65014.408: [CMS-concurrent-preclean: 0.004/0.005 
secs] [Times: user=0.02 sys=0.01, real=0.00 secs]
2015-05-21T07:38:59.908+0000: 65014.408: 
[CMS-concurrent-abortable-preclean-start]
2015-05-21T07:39:03.094+0000: 65017.595: [CMS-concurrent-abortable-preclean: 
3.079/3.186 secs] [Times: user=4.73 sys=0.36, real=3.19 secs]
2015-05-21T07:39:03.095+0000: 65017.595: [GC[YG occupancy: 147607 K (314560 
K)]2015-05-21T07:39:03.095+0000: 65017.595: [Rescan (parallel) , 0.0171330 
secs]2015-05-21T07:39:03.112+0000: 65017.612: [weak refs processing, 0.0015230 
secs]2015-05-21T07:39:03.113+0000: 65017.614: [clas
s unloading, 0.0037630 secs]2015-05-21T07:39:03.117+0000: 65017.617: [scrub 
symbol table, 0.0015450 secs]2015-05-21T07:39:03.119+0000: 65017.619: [scrub 
string table, 0.0002790 secs] [1 CMS-remark: 643203K(699072K)] 
790811K(1013632K), 0.0261450 secs] [Times: user=0.21 sys=0.00, re
al=0.03 secs]
2015-05-21T07:39:03.121+0000: 65017.621: [CMS-concurrent-sweep-start]
2015-05-21T07:39:03.121+0000: 65017.622: [GC2015-05-21T07:39:03.121+0000: 
65017.622: [ParNew: 147607K->3379K(314560K), 0.0178100 secs] 
790811K->646718K(1013632K), 0.0178920 secs] [Times: user=0.14 sys=0.00, 
real=0.02 secs]
2015-05-21T07:39:04.378+0000: 65018.879: [CMS-concurrent-sweep: 1.238/1.257 
secs] [Times: user=2.12 sys=0.20, real=1.26 secs]
2015-05-21T07:39:04.378+0000: 65018.879: [CMS-concurrent-reset-start]
2015-05-21T07:39:04.380+0000: 65018.881: [CMS-concurrent-reset: 0.002/0.002 
secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2015-05-21T07:39:09.187+0000: 65023.688: [GC2015-05-21T07:39:09.187+0000: 
65023.688: [ParNew: 282995K->5097K(314560K), 0.0353740 secs] 
289202K->11447K(1013632K), 0.0354670 secs] [Times: user=0.26 sys=0.01, 
real=0.04 secs]
2015-05-21T07:39:15.904+0000: 65030.404: [GC2015-05-21T07:39:15.904+0000: 
65030.404: [ParNew: 284713K->5448K(314560K), 0.0376610 secs] 
291063K->11942K(1013632K), 0.0377660 secs] [Times: user=0.29 sys=0.01, 
real=0.04 secs]
2015-05-21T07:39:22.546+0000: 65037.047: [GC2015-05-21T07:39:22.547+0000: 
65037.047: [ParNew: 285064K->5615K(314560K), 0.0354600 secs] 
291558K->12109K(1013632K), 0.0355580 secs] [Times: user=0.30 sys=0.00, 
real=0.04 secs]
2015-05-21T07:39:28.956+0000: 65043.456: [GC2015-05-21T07:39:28.956+0000: 
65043.456: [ParNew: 285231K->5984K(314560K), 0.0420590 secs] 
291725K->12642K(1013632K), 0.0421550 secs] [Times: user=0.31 sys=0.00, 
real=0.05 secs]
2015-05-21T07:39:35.536+0000: 65050.037: [GC2015-05-21T07:39:35.536+0000: 
65050.037: [ParNew: 285600K->5180K(314560K), 0.0457240 secs] 
292258K->11992K(1013632K), 0.0458610 secs] [Times: user=0.32 sys=0.00, 
real=0.05 secs]

It only ever does the parallel collection on the young generation. The drop 
corresponds to the full GC.

This appears to support the hypothesis that the files are only closed on full 
GC, and thus are not closed right after they are deleted.

I will upgrade the jdk from 1.7u40 to 1.7u79 next.


> Open file handle leak
> ---------------------
>
>                 Key: KAFKA-2201
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2201
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.2.1
>         Environment: Debian Linux 7, 64 bit
> Oracle JDK 1.7.0u40, 64-bit
>            Reporter: Albert Visagie
>
> The kafka broker crashes with the following stack trace from the server.log 
> roughly every 18 hours:
> [2015-05-19 07:39:22,924] FATAL [KafkaApi-0] Halting due to unrecoverable I/O 
> error while handling produce request:  (kafka.server.KafkaApis)
> kafka.common.KafkaStorageException: I/O exception in append to log 'nnnnnnn-1'
>         at kafka.log.Log.append(Log.scala:266)
>         at 
> kafka.cluster.Partition$$anonfun$appendMessagesToLeader$1.apply(Partition.scala:379)
>         at 
> kafka.cluster.Partition$$anonfun$appendMessagesToLeader$1.apply(Partition.scala:365)
>         at kafka.utils.Utils$.inLock(Utils.scala:535)
>         at kafka.utils.Utils$.inReadLock(Utils.scala:541)
>         at kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:365)
>         at 
> kafka.server.KafkaApis$$anonfun$appendToLocalLog$2.apply(KafkaApis.scala:291)
>         at 
> kafka.server.KafkaApis$$anonfun$appendToLocalLog$2.apply(KafkaApis.scala:282)
>         at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>         at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>         at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>         at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>         at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>         at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>         at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>         at kafka.server.KafkaApis.appendToLocalLog(KafkaApis.scala:282)
>         at 
> kafka.server.KafkaApis.handleProducerOrOffsetCommitRequest(KafkaApis.scala:204)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:59)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59)
>         at java.lang.Thread.run(Thread.java:724)
> Caused by: java.io.IOException: Map failed
>         at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:888)
>         at 
> kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:286)
>         at 
> kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:276)
>         at kafka.utils.Utils$.inLock(Utils.scala:535)
>         at kafka.log.OffsetIndex.resize(OffsetIndex.scala:276)
>         at 
> kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply$mcV$sp(OffsetIndex.scala:265)
>         at 
> kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply(OffsetIndex.scala:265)
>         at 
> kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply(OffsetIndex.scala:265)
>         at kafka.utils.Utils$.inLock(Utils.scala:535)
>         at kafka.log.OffsetIndex.trimToValidSize(OffsetIndex.scala:264)
>         at kafka.log.Log.roll(Log.scala:563)
>         at kafka.log.Log.maybeRoll(Log.scala:539)
>         at kafka.log.Log.append(Log.scala:306)
>         ... 21 more
> Caused by: java.lang.OutOfMemoryError: Map failed
>         at sun.nio.ch.FileChannelImpl.map0(Native Method)
>         at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:885)
>         ... 33 more
> The Kafka broker's open filehandles as seen by
> lsof | grep pid | wc -l 
> grows steadily as it runs. Under our load it lasts about 18 hours before 
> crashing with the stack trace above.
> We were experimenting with settings under Log Retention Policy in 
> server.properties:
> log.retention.hours=168
> log.retention.bytes=107374182
> log.segment.bytes=1073741
> log.retention.check.interval.ms=3000
> The result is that the broker rolls over segments quite rapidly. We don't 
> have to run it that way of course.
> We are running only one broker at the moment.
> lsof shows many open files without size and absent from ls in the log 
> directory with the suffix ".deleted"
> This is kafka 0.8.2.1 with scala 2.10.4 as downloaded from the website last 
> week.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to