[ 
https://issues.apache.org/jira/browse/HADOOP-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Acherkan updated HADOOP-14376:
----------------------------------
    Attachment: HADOOP-14376.001.patch

Patch attached. First time contributor, I hope I followed the guidelines 
correctly.

For testing, I enhanced an existing unit test - TestCodec.codecTest(), since 
it's already invoked for different types of native and pure-Java codecs. I 
added an assertion that the number of leased decompressors after the test 
equals to the one before it. This exposed a similar bug in 
BZip2Codec.BZip2CompressionInputStream.close(), which also doesn't call its 
super.close() method, and thus doesn't return the decompressor to the pool.

Adding an assertion for compressors as well as decompressors uncovered a 
similar issue in CompressorStream.close(), GzipCodec.GzipOutputStream.close(), 
and BZip2Codec.BZip2CompressionOutputStream.close(), which I attempted to fix 
as well.

Regarding BZip2Codec.BZip2CompressionOutputStream.close(), I removed the 
overriding method altogether, because the superclass's close() method invokes 
finish(). The finish() method handles internalReset() if needed, and also calls 
output.finish(), which eliminates the need to call output.flush() or 
output.close().

Testing GzipCodec without native libraries showed that CodecPool erroneously 
calls updateLeaseCounts even for compressors/decompressors that are null, or 
ones with the @DoNotPool annotation. I added a condition that checks for that.

The memory leak only manifests when using the native libraries. In Eclipse I 
achieved this by setting java.library.path in the unit test launcher. Seeing 
the usage of assumeTrue(isNative*Loaded()), I understand that native-related 
tests are covered in Maven builds as well.

Looking forward to a code review.

> Memory leak when reading a compressed file using the native library
> -------------------------------------------------------------------
>
>                 Key: HADOOP-14376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14376
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common, io
>    Affects Versions: 2.7.0
>            Reporter: Eli Acherkan
>            Assignee: Eli Acherkan
>         Attachments: Bzip2MemoryTester.java, HADOOP-14376.001.patch, 
> log4j.properties
>
>
> Opening and closing a large number of bzip2-compressed input streams causes 
> the process to be killed on OutOfMemory when using the native bzip2 library.
> Our initial analysis suggests that this can be caused by 
> {{DecompressorStream}} overriding the {{close()}} method, and therefore 
> skipping the line from its parent: 
> {{CodecPool.returnDecompressor(trackedDecompressor)}}. When the decompressor 
> object is a {{Bzip2Decompressor}}, its native {{end()}} method is never 
> called, and the allocated memory isn't freed.
> If this analysis is correct, the simplest way to fix this bug would be to 
> replace {{in.close()}} with {{super.close()}} in {{DecompressorStream}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to