[ 
https://issues.apache.org/jira/browse/HADOOP-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405492#comment-13405492
 ] 

Tim Broberg commented on HADOOP-8148:
-------------------------------------

Ok, Owen. Sorry not to get back sooner. Last week was crazy.

Non-empty tarball with my proposal is attached.

Comments on yours are below, keeping in mind that I'm somewhat of a Java 
neophyte. If I say something stupid, please try to be gentle:
 1 - By declaring an abstract stream class in the codec, you are forcing 
implementations to extend those classes, which means they can't reuse any of 
the existing compression stream code. What benefits do you see coming from this 
that warrant the tradeoff and/or what am I missing?
 2 - What are the benefits of creating a separate codec / factory class vs just 
returning a compression stream and checking whether it implements 
ZeroCopy<whatever>?
 3 - The read(ByteBuffer) definition just fundamentally doesn't work for my 
particular use case where there is a significant latency on reads. If the 
caller provides the ByteBuffer, then the read() function has to wait for the 
latency of the entire operation or else read() has to perform a copy from his 
own pool of completed buffers, which defeats the whole point of the exercise. 
This is why I felt I needed "ByteBuffer readBuffer()" in my interface.
 4 - Am I seeing compress and decompress functions at the codec level? I guess 
you're trying to hide the complex setInput / compress interface of the existing 
compressor? I guess you're saying we want the compressors to still be available 
in non-stream form, but you want to clean up the interface?
 5 - What about the codec pool? Do you see that disappearing?

You mentioned channels. One alternative approach would be to use the new Java 7 
Asynchronous channels, but making compression require Java 7 would have some 
pretty broad implications that I don't expect we're ready to deal with. Also, 
this makes the caller deal with the asynchronicity instead of having the stream 
just read ahead and return buffers on command.

Ideally, it would be great to see an interface that doesn't require the caller 
to deal with pooling the buffers below his level in the stream, but I haven't 
found a way to do this that doesn't require copies for streams that implement 
any kind of read ahead.

I'm sure at present I'm firmly in the minority in caring about multithreaded 
decompressors. I've tried to keep the overhead innocuous such that it's worth 
the trouble. Any ideas you have on addressing the problem more efficiently 
would be most welcome, but read(ByteBuffer) doesn't do it.
                
> Zero-copy ByteBuffer-based compressor / decompressor API
> --------------------------------------------------------
>
>                 Key: HADOOP-8148
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8148
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io, performance
>            Reporter: Tim Broberg
>            Assignee: Tim Broberg
>         Attachments: hadoop-8148.patch, hadoop8148.patch, zerocopyifc.tgz
>
>
> Per Todd Lipcon's comment in HDFS-2834, "
>   Whenever a native decompression codec is being used, ... we generally have 
> the following copies:
>   1) Socket -> DirectByteBuffer (in SocketChannel implementation)
>   2) DirectByteBuffer -> byte[] (in SocketInputStream)
>   3) byte[] -> Native buffer (set up for decompression)
>   4*) decompression to a different native buffer (not really a copy - 
> decompression necessarily rewrites)
>   5) native buffer -> byte[]
>   with the proposed improvement we can hopefully eliminate #2,#3 for all 
> applications, and #2,#3,and #5 for libhdfs.
> "
> The interfaces in the attached patch attempt to address:
>  A - Compression and decompression based on ByteBuffers (HDFS-2834)
>  B - Zero-copy compression and decompression (HDFS-3051)
>  C - Provide the caller a way to know how the max space required to hold 
> compressed output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to