[jira] [Commented] (HDFS-4953) enable HDFS local reads via mmap

Owen O'Malley (JIRA) Tue, 10 Sep 2013 14:18:33 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763526#comment-13763526
 ]


Owen O'Malley commented on HDFS-4953:
-------------------------------------

{quote}
This seems like a fairly arbitrary requirement
{quote}

Actually, it is a standard best practice. Building exception objects is 
*expensive* and handling exceptions is error-prone.

{quote}
Unfortunately, zero copy being available is not a binary thing. It might be 
available sometimes, but not other times.
{quote}

I clearly stated in the previous comment it was about whether the zero copy was 
available AT THE CURRENT POINT IN THE STREAM.

{quote}
This is the point of the fallback buffer-- it will be used when a block 
boundary, non-local block, etc. prevents the request being fulfilled with an 
mmap. We've made this as efficient as it can be made.
{quote}

No, you haven't made it efficient at all. If the code is going to do a buffer 
copy and use the fallback buffer if it crosses a block, the performance will be 
dreadful. With my interface, you get no extra allocation, and no buffer copies.

The two reasonable strategies are:
* short read to cut it to a record boundary
* return multiple buffers

You can't make it easier than that, because the whole point of this API is to 
avoid buffer copies. You can't violate the goal of the interface because you 
don't like those choices.

{quote}
Well, what are the alternatives? We can't subclass java.nio.ByteBuffer, since 
all of its constructors are package-private.
{quote}

Agreed that you can't extend ByteBuffer. But other than close, you don't really 
need anything.

{quote}
There has to be a close method on whatever we return
{quote}

That is NOT a requirement. Especially when it conflicts with fundamental 
performance. As I wrote, adding a return method on the stream is fine.

{quote}
I'm sort of getting the impression that you plan on having super-huge buffers, 
which scares me.
{quote}

For ORC, I need to read typically 200 MB at a time. Obviously, if I don't need 
the whole range, I won't get it, but getting it in large sets of bytes is much 
much more efficient than lots of little reads.

{quote}
The factory API also gives me the impression that you plan on allocating a new 
buffer for each read, which would also be problematic.
{quote}

No, that is not the case. The point of the API is to avoid allocating the 
buffers if they aren't needed. The current API requires a buffer whether it is 
needed or not. Obviously the application will need a cache of the buffers to 
reuse, but the factory lets them write efficient code.

{quote}
If we have a "Factory" object, it needs to have not only a "get" method, but 
also a "put" method, where ByteBuffers are placed back when we're done with 
them. At that point, it becomes more like a cache. This might be a reasonable 
API, but I wonder if the additional complexity is worth it.
{quote}

Of course the implementations of the factory will have a release method. The 
question was just whether the FSDataInputStream needed to access the release 
method. If we add the releaseByteBuffer then we'd need the release to the 
factory.

Based on this, I'd propose:

{code}
/**
 * Is the current location of the stream available via zero copy?
 */
public boolean isZeroCopyAvailable();

/**
 * Read from the current location at least 1 and up to maxLength bytes. In most 
situations, the returned
 * buffer will contain maxLength bytes unless either:
 *  * the read crosses a block boundary and zero copy is being used
 *  * the stream has fewer than maxLength bytes left
 * The returned buffer will either be one that was created by the factory or a 
MappedByteBuffer.
 */
public ByteBuffer readByteBuffer(ByteBufferFactory factory, int maxLength) 
throws IOException;

/**
 * Release a buffer that was returned from readByteBuffer. If the method was 
created by the factory
 * it will be returned to the factory.
 */
public void releaseByteBuffer(ByteBufferFactory factory, ByteBuffer buffer);

/**
 * Allow application to manage how ByteBuffers are created for fallback 
buffers. Only buffers created by
 * the factory will be released to it.
 */
public interface ByteBufferFactory {
  ByteBuffer createBuffer(int capacity);
  void releaseBuffer(ByteBuffer buffer);
}
{code}

This will allow applications to:
* determine whether zero copy is available for the next read
* the user can use the same read interface for all filesystems and files, using 
zero copy if available
* no extra buffer copies
* no bytebuffers are allocated if they are not needed
* applications have to deal with short reads, but only get a single byte buffer
* allow applications to create buffer managers that reuse buffers
* allow applications to control whether direct or byte[] byte buffers are used

The example code would look like:
{code}
FSDataInputStream in = fs.open(path);
in.seek(100*1024*1024);
List<ByteBuffer> buffers = new ArrayList<ByteBuffer>();
long remaining = 100 * 1024 * 1024;
while (remaining > 0) {
  ByteBuffer buf = in.readByteBuffer(fallbackFactory, 100*1024*1024);
  remaining -= buf.remaining();
  buffers.add(buf);
}
...
for(ByteBuffer buf: buffers) {
  in.releaseByteBuffer(fallbackFactory, buf);
}
{code}
                
> enable HDFS local reads via mmap
> --------------------------------
>
>                 Key: HDFS-4953
>                 URL: https://issues.apache.org/jira/browse/HDFS-4953
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>    Affects Versions: 2.3.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>             Fix For: HDFS-4949
>
>         Attachments: benchmark.png, HDFS-4953.001.patch, HDFS-4953.002.patch, 
> HDFS-4953.003.patch, HDFS-4953.004.patch, HDFS-4953.005.patch, 
> HDFS-4953.006.patch, HDFS-4953.007.patch, HDFS-4953.008.patch
>
>
> Currently, the short-circuit local read pathway allows HDFS clients to access 
> files directly without going through the DataNode.  However, all of these 
> reads involve a copy at the operating system level, since they rely on the 
> read() / pread() / etc family of kernel interfaces.
> We would like to enable HDFS to read local files via mmap.  This would enable 
> truly zero-copy reads.
> In the initial implementation, zero-copy reads will only be performed when 
> checksums were disabled.  Later, we can use the DataNode's cache awareness to 
> only perform zero-copy reads when we know that checksum has already been 
> verified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4953) enable HDFS local reads via mmap

Reply via email to