[ 
https://issues.apache.org/jira/browse/HADOOP-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved HADOOP-4196.
--------------------------------------

    Resolution: Incomplete

I believe the compression code has changed quite a bit since this was filed.  
Closing as stale.

> Possible performance enhancement in Hadoop compress module
> ----------------------------------------------------------
>
>                 Key: HADOOP-4196
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4196
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.18.0
>            Reporter: Hong Tang
>
> There are several less performant implementation issues with the current 
> Hadoop compression module. Generally, the opportunities all come from the 
> fact that the granuarities of I/O operations from the CompressionStream and 
> DecompressionStream are not controllable by the users, and thus users are 
> forced to attach BufferedInputStream or BufferedOutputStream to both ends of 
> the CompressionStream and DecompressionStream:
> - ZlibCompressor: always returns false from needInput() after setInput(), and 
> thus lead to a native call deflateBytesDirect() for almost every write() 
> operation from CompressorStream(). This becomes problematic when applications 
> call write() on the CompressorStream with small write sizes (e.g. one byte at 
> a time). It is better to follow similar code path in LzoCompressor and append 
> to internal uncompressed data buffer.
> - CompressorStream: whenever the compressor produces some compressed data, it 
> will directly issue write() calls to the down stream. Could be improved by 
> keep appending to the byte[] until it is full (or half full) before writing 
> to the down stream. Otherwise, applications have to use a 
> BufferedOutputStream as the down stream in case the output sizes from 
> CompressorStream is too small. This generally causes double buffering.
> - BlockCompressorStream: similar issue as described above.
> - BlockDecompressorStream: getCompressedData() reads only one compressed 
> chunk at a time. Could be better to read a full buffer, and then obtain 
> compressed chunk from buffer (similar to DecompressStream is doing, but 
> admittedly a bit more complicated).
> In generally, the following could be some guideline of 
> Compressor/Decompressor and CompressorStream/DecompressorStream 
> design/implementation that can give users some performance guarantee:
> - Compressor and Decompressor keep two DirectByteBuffer, the size of which 
> should be tuned to be optimal with regard to the specific 
> compression/decompression algorithm. Ensure always call Compressor.compress() 
> will a full (or near full) uncompressed data DirectBuffer.
> - CompressorStream and DecompressorStream maintains a byte[] to read data 
> from the down stream. The size of the byte[] should be user customizable (add 
> a bufferSize parameter to CompressionCodec's createInputStream and 
> createOutputStream interface). Ensure that I/O from the down stream at or 
> near the granularity of the size of the byte[]. So applications can simply 
> rely on the buffering inside CompressorStream and DecompressorStream (for the 
> case of LZO: BlockCompressorStream and BlockDecompressorStream).
> A more radical change would be to let the downward InputStream to directly 
> deposit data to a ByteBuffer or the downard OutputStream to accept input data 
> from ByteBuffer. We may call it ByteBufferInputStream and 
> ByteBufferOutputStream. The CompressorStream and DecompressorStream may 
> simply test whether the down stream indeed implements such interfaces and 
> bypass its own byte[] buffer if true.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to