[ 
https://issues.apache.org/jira/browse/HADOOP-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Acherkan updated HADOOP-14376:
----------------------------------
    Attachment: HADOOP-14376.002.patch

Thanks [~jlowe]. Those are excellent points, I completely agree that the patch 
introduced subtle differences if some of the streams throw exceptions upon 
close(). My previous reasoning was that in this case something's probably gone 
horribly and irrevocably wrong anyway. But following your comments, I prepared 
a more defensive patch, in which even if some of the close() or finish() 
methods throw exceptions we still try to close/recover what we can. The price 
of this is assuming that it's okay to call the close() method of a stream 
multiple times.

*BZip2CompressionOutputStream*:
bq. Other maintainers will come along and see that BZip2CompressionOutputStream 
never calls close() on one of its private member streams which is usually a 
bug. Even if the close() ends up being redundant today that doesn't mean it 
always will. The root cause for this JIRA is a great example.
In patch 002 I put back the BZip2CompressionOutputStream.close() method with a 
call to super.close() and some explanatory documentation. It still seems to me 
that calling super.close() should be sufficient, let me try to explain why.

bq. I'm not seeing how the superclass's finish() method ever ends up closing 
the out stream. I see it write some final bytes and sets it to null, which in 
turn prevents the close() method from trying to call out.close(), so I'm 
wondering how the output stream normally gets closed.
My understanding is that the output stream _does_ get closed, thanks to the 
out.close() call in CompressionOutputStream.close(). The {{out}} data member of 
CBZip2OutputStream is indeed nullified, but the {{out}} data member of 
CompressionOutputStream should still reference the actual stream object.

My reasoning was: the only difference between 
BZip2CompressionOutputStream.finish() and close() is that 
BZip2CompressionOutputStream.finish() calls output.finish(), whereas 
BZip2CompressionOutputStream.close() calls output.flush() and output.close(). 
Changing BZip2CompressionOutputStream.close() to super.close() will mean that 
we invoke finish() only instead of flush() and close(). Looking at 
CBZip2OutputStream (which can be the only class of the {{output}} data member 
in the current implementation), it seems to me that it's okay to invoke 
finish() instead of flush() + close(), because the only difference between them 
is calling out.flush() + out.close(). As I said above, out.close() will be 
called anyway by CompressionOutputStream.close(), and I'm assuming that any 
reasonable stream calls flush() internally on close().

*BZip2CompressionInputStream*:
In BZip2CompressionInputStream, patch 002 puts the call to super.close() in a 
finally block. This preserves the previous logic (set needsReset to true only 
if input.close() didn't throw) while ensuring that super.close() will 
unconditionally close the {{in}} stream and return the trackedDecompressor to 
the pool.

*CompressorStream/CompressionInput/OutputStream*:
bq. This is subtly different that the previous code because finish() can throw. 
In the old code, finish() could throw and out.close() would still be called but 
now we'll skip calling out.close() but still set closed=true so we can't retry 
the close. (...) Similarly the CompressionInputStream/CompressionOutputStream 
code won't return the codec to the pool if finish() throws or the underlying 
stream's close() throws.
In patch 002 I wrapped each of CompressionInput/OutputStream.close()'s internal 
steps in try/finally. (For CompressionOutputStream.close() this leaves the 
corner case of both finish() _and_ out.close() throwing an exception each, but 
I think it's reasonable that only one of them will be propagated since it's a 
doomed stream anyway.) This brings the behavior of the patched 
CompressorStream.close() to what it was before my changes: if finish() throws, 
out.close() is still called.

> Memory leak when reading a compressed file using the native library
> -------------------------------------------------------------------
>
>                 Key: HADOOP-14376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14376
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common, io
>    Affects Versions: 2.7.0
>            Reporter: Eli Acherkan
>            Assignee: Eli Acherkan
>         Attachments: Bzip2MemoryTester.java, HADOOP-14376.001.patch, 
> HADOOP-14376.002.patch, log4j.properties
>
>
> Opening and closing a large number of bzip2-compressed input streams causes 
> the process to be killed on OutOfMemory when using the native bzip2 library.
> Our initial analysis suggests that this can be caused by 
> {{DecompressorStream}} overriding the {{close()}} method, and therefore 
> skipping the line from its parent: 
> {{CodecPool.returnDecompressor(trackedDecompressor)}}. When the decompressor 
> object is a {{Bzip2Decompressor}}, its native {{end()}} method is never 
> called, and the allocated memory isn't freed.
> If this analysis is correct, the simplest way to fix this bug would be to 
> replace {{in.close()}} with {{super.close()}} in {{DecompressorStream}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to