[ 
https://issues.apache.org/jira/browse/HADOOP-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259836#comment-13259836
 ] 

Tim Broberg commented on HADOOP-8148:
-------------------------------------

More dust:
1 - block-based non-scatter-gather libraries (basically everything software 
except gzip) won't readily support the scatter-gather List<ByteBuffer> 
interface. I think we should dump it and just pass ByteBuffer's.
2 - Direct buffers have a reputation for being costly to create. As I 
understand it, the reason the codec pool class exists is to allow compressors 
with direct buffers to be reused without having to create a new direct buffer 
each time a record is read. The interface proposed does not address ownership 
or recycling of the buffers. We could add calls to each interface that passes 
these buffers to manage the buffers, or the buffers themselves could have a 
call to return them to a pool from which they can be reused. Managing the 
number of elements in the pool and the size of the buffers is a nontrivial task.
3 - If we do address buffer recycling, the codec pool approach would appear to 
be obsolete. Note that, outside of compression streams, codec pool is the only 
customer that cares about the compression interface any longer - an extreme 
statement, but witness that bzip doesn't implement a compressor interface at 
all except for dummy stubs to show to codec pool.
4 - The interface of the existing compressor / decompressor classes pack a lot 
of baggage from the gzip interface that decouples the input from the output for 
a streaming compressor class. setInput, needsInput, finished, finish, reset, 
and reinit all manage state between the input and output where a simple 
compress(ByteBuffer src, ByteBuffer dst) could replace the existing call and 
all the rest. (Full disclosure, I want all those other calls dead personally 
because all that state makes asynchronous compression a nightmare.)

So, I'm highly tempted to sweep away the compressor interface and replace it 
with a much simpler one -
 - compress(src, dst) to process data
 - finish() to allow cleaning up open streams
 - getBytesRead(), getBytesWritten() for statistics

Replace the codec pool with a pool of buffers extending ByteBuffer which have a 
callback method to recycle them.

Too radical? What would be a better way to solve the problems? Any problems 
this doesn't solve?
                
> Zero-copy ByteBuffer-based compressor / decompressor API
> --------------------------------------------------------
>
>                 Key: HADOOP-8148
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8148
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Tim Broberg
>            Assignee: Tim Broberg
>         Attachments: hadoop8148.patch
>
>
> Per Todd Lipcon's comment in HDFS-2834, "
>   Whenever a native decompression codec is being used, ... we generally have 
> the following copies:
>   1) Socket -> DirectByteBuffer (in SocketChannel implementation)
>   2) DirectByteBuffer -> byte[] (in SocketInputStream)
>   3) byte[] -> Native buffer (set up for decompression)
>   4*) decompression to a different native buffer (not really a copy - 
> decompression necessarily rewrites)
>   5) native buffer -> byte[]
>   with the proposed improvement we can hopefully eliminate #2,#3 for all 
> applications, and #2,#3,and #5 for libhdfs.
> "
> The interfaces in the attached patch attempt to address:
>  A - Compression and decompression based on ByteBuffers (HDFS-2834)
>  B - Zero-copy compression and decompression (HDFS-3051)
>  C - Provide the caller a way to know how the max space required to hold 
> compressed output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to