[ 
https://issues.apache.org/jira/browse/HADOOP-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852715#action_12852715
 ] 

Xiao Kang commented on HADOOP-4196:
-----------------------------------

Thanks Hong Tang for noticing duplication of another jira HADOOP-6662. 

Since the first performance enhancement suggestion is clear and easy to 
implement, maybe we can resolve it seperately. Close HADOOP-6662 and move the 
patch to this or dicuss in HADOOP-6662.

> Possible performance enhancement in Hadoop compress module
> ----------------------------------------------------------
>
>                 Key: HADOOP-4196
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4196
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.18.0
>            Reporter: Hong Tang
>
> There are several less performant implementation issues with the current 
> Hadoop compression module. Generally, the opportunities all come from the 
> fact that the granuarities of I/O operations from the CompressionStream and 
> DecompressionStream are not controllable by the users, and thus users are 
> forced to attach BufferedInputStream or BufferedOutputStream to both ends of 
> the CompressionStream and DecompressionStream:
> - ZlibCompressor: always returns false from needInput() after setInput(), and 
> thus lead to a native call deflateBytesDirect() for almost every write() 
> operation from CompressorStream(). This becomes problematic when applications 
> call write() on the CompressorStream with small write sizes (e.g. one byte at 
> a time). It is better to follow similar code path in LzoCompressor and append 
> to internal uncompressed data buffer.
> - CompressorStream: whenever the compressor produces some compressed data, it 
> will directly issue write() calls to the down stream. Could be improved by 
> keep appending to the byte[] until it is full (or half full) before writing 
> to the down stream. Otherwise, applications have to use a 
> BufferedOutputStream as the down stream in case the output sizes from 
> CompressorStream is too small. This generally causes double buffering.
> - BlockCompressorStream: similar issue as described above.
> - BlockDecompressorStream: getCompressedData() reads only one compressed 
> chunk at a time. Could be better to read a full buffer, and then obtain 
> compressed chunk from buffer (similar to DecompressStream is doing, but 
> admittedly a bit more complicated).
> In generally, the following could be some guideline of 
> Compressor/Decompressor and CompressorStream/DecompressorStream 
> design/implementation that can give users some performance guarantee:
> - Compressor and Decompressor keep two DirectByteBuffer, the size of which 
> should be tuned to be optimal with regard to the specific 
> compression/decompression algorithm. Ensure always call Compressor.compress() 
> will a full (or near full) uncompressed data DirectBuffer.
> - CompressorStream and DecompressorStream maintains a byte[] to read data 
> from the down stream. The size of the byte[] should be user customizable (add 
> a bufferSize parameter to CompressionCodec's createInputStream and 
> createOutputStream interface). Ensure that I/O from the down stream at or 
> near the granularity of the size of the byte[]. So applications can simply 
> rely on the buffering inside CompressorStream and DecompressorStream (for the 
> case of LZO: BlockCompressorStream and BlockDecompressorStream).
> A more radical change would be to let the downward InputStream to directly 
> deposit data to a ByteBuffer or the downard OutputStream to accept input data 
> from ByteBuffer. We may call it ByteBufferInputStream and 
> ByteBufferOutputStream. The CompressorStream and DecompressorStream may 
> simply test whether the down stream indeed implements such interfaces and 
> bypass its own byte[] buffer if true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to