[ https://issues.apache.org/jira/browse/HADOOP-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eli Acherkan updated HADOOP-14376: ---------------------------------- Attachment: HADOOP-14376.001.patch Patch attached. First time contributor, I hope I followed the guidelines correctly. For testing, I enhanced an existing unit test - TestCodec.codecTest(), since it's already invoked for different types of native and pure-Java codecs. I added an assertion that the number of leased decompressors after the test equals to the one before it. This exposed a similar bug in BZip2Codec.BZip2CompressionInputStream.close(), which also doesn't call its super.close() method, and thus doesn't return the decompressor to the pool. Adding an assertion for compressors as well as decompressors uncovered a similar issue in CompressorStream.close(), GzipCodec.GzipOutputStream.close(), and BZip2Codec.BZip2CompressionOutputStream.close(), which I attempted to fix as well. Regarding BZip2Codec.BZip2CompressionOutputStream.close(), I removed the overriding method altogether, because the superclass's close() method invokes finish(). The finish() method handles internalReset() if needed, and also calls output.finish(), which eliminates the need to call output.flush() or output.close(). Testing GzipCodec without native libraries showed that CodecPool erroneously calls updateLeaseCounts even for compressors/decompressors that are null, or ones with the @DoNotPool annotation. I added a condition that checks for that. The memory leak only manifests when using the native libraries. In Eclipse I achieved this by setting java.library.path in the unit test launcher. Seeing the usage of assumeTrue(isNative*Loaded()), I understand that native-related tests are covered in Maven builds as well. Looking forward to a code review. > Memory leak when reading a compressed file using the native library > ------------------------------------------------------------------- > > Key: HADOOP-14376 > URL: https://issues.apache.org/jira/browse/HADOOP-14376 > Project: Hadoop Common > Issue Type: Bug > Components: common, io > Affects Versions: 2.7.0 > Reporter: Eli Acherkan > Assignee: Eli Acherkan > Attachments: Bzip2MemoryTester.java, HADOOP-14376.001.patch, > log4j.properties > > > Opening and closing a large number of bzip2-compressed input streams causes > the process to be killed on OutOfMemory when using the native bzip2 library. > Our initial analysis suggests that this can be caused by > {{DecompressorStream}} overriding the {{close()}} method, and therefore > skipping the line from its parent: > {{CodecPool.returnDecompressor(trackedDecompressor)}}. When the decompressor > object is a {{Bzip2Decompressor}}, its native {{end()}} method is never > called, and the allocated memory isn't freed. > If this analysis is correct, the simplest way to fix this bug would be to > replace {{in.close()}} with {{super.close()}} in {{DecompressorStream}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org