[ https://issues.apache.org/jira/browse/HADOOP-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allen Wittenauer resolved HADOOP-4196. -------------------------------------- Resolution: Incomplete I believe the compression code has changed quite a bit since this was filed. Closing as stale. > Possible performance enhancement in Hadoop compress module > ---------------------------------------------------------- > > Key: HADOOP-4196 > URL: https://issues.apache.org/jira/browse/HADOOP-4196 > Project: Hadoop Common > Issue Type: Improvement > Components: io > Affects Versions: 0.18.0 > Reporter: Hong Tang > > There are several less performant implementation issues with the current > Hadoop compression module. Generally, the opportunities all come from the > fact that the granuarities of I/O operations from the CompressionStream and > DecompressionStream are not controllable by the users, and thus users are > forced to attach BufferedInputStream or BufferedOutputStream to both ends of > the CompressionStream and DecompressionStream: > - ZlibCompressor: always returns false from needInput() after setInput(), and > thus lead to a native call deflateBytesDirect() for almost every write() > operation from CompressorStream(). This becomes problematic when applications > call write() on the CompressorStream with small write sizes (e.g. one byte at > a time). It is better to follow similar code path in LzoCompressor and append > to internal uncompressed data buffer. > - CompressorStream: whenever the compressor produces some compressed data, it > will directly issue write() calls to the down stream. Could be improved by > keep appending to the byte[] until it is full (or half full) before writing > to the down stream. Otherwise, applications have to use a > BufferedOutputStream as the down stream in case the output sizes from > CompressorStream is too small. This generally causes double buffering. > - BlockCompressorStream: similar issue as described above. > - BlockDecompressorStream: getCompressedData() reads only one compressed > chunk at a time. Could be better to read a full buffer, and then obtain > compressed chunk from buffer (similar to DecompressStream is doing, but > admittedly a bit more complicated). > In generally, the following could be some guideline of > Compressor/Decompressor and CompressorStream/DecompressorStream > design/implementation that can give users some performance guarantee: > - Compressor and Decompressor keep two DirectByteBuffer, the size of which > should be tuned to be optimal with regard to the specific > compression/decompression algorithm. Ensure always call Compressor.compress() > will a full (or near full) uncompressed data DirectBuffer. > - CompressorStream and DecompressorStream maintains a byte[] to read data > from the down stream. The size of the byte[] should be user customizable (add > a bufferSize parameter to CompressionCodec's createInputStream and > createOutputStream interface). Ensure that I/O from the down stream at or > near the granularity of the size of the byte[]. So applications can simply > rely on the buffering inside CompressorStream and DecompressorStream (for the > case of LZO: BlockCompressorStream and BlockDecompressorStream). > A more radical change would be to let the downward InputStream to directly > deposit data to a ByteBuffer or the downard OutputStream to accept input data > from ByteBuffer. We may call it ByteBufferInputStream and > ByteBufferOutputStream. The CompressorStream and DecompressorStream may > simply test whether the down stream indeed implements such interfaces and > bypass its own byte[] buffer if true. -- This message was sent by Atlassian JIRA (v6.2#6252)