[ 
https://issues.apache.org/jira/browse/HDFS-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759740#comment-13759740
 ] 

Owen O'Malley commented on HDFS-4953:
-------------------------------------

{quote}
An unsophisticated user just sets the fallback buffer once for the cursor, and 
then calls away at the new API.
{quote}

But to the user, it isn't clear if or how to set the fallback buffer. What is 
the scope of the fallback buffer? How large does it need to be? Should it be 
direct or byte array? The user wants to read their HDFS data via ByteBuffer 
(especially if it is available via zero copy). Obviously, my concern is not 
abstract. I was thinking about how to use this for implementing a reader for 
the ORC file format and I absolutely need to manage the byte buffers and have a 
single path for reading both cached and non-cached files. It is much much 
better API design to have the filesystem create it as needed rather than making 
the application preemptively create "fallback" byte buffers for the file system 
to use.

Let's say I need to read 100 MB that may cross a block boundary, under the 
current API to safely read it, I need to do:

{code}
FSDataInputStream in = fs.open(path);
in.seek(offset);
List<ByteBuffer> result = new ArrayList<ByteBuffer>();
try {
  ZeroCopyCursor cursor = in.createZeroCopyCursor();
  // don't fail if we cross block boundaries
  cursor.setAllowShortReads(true);
  long done = 0;
  while (done < len) {
    // can't reuse previous buffers since they are still used in result
    cursor.setFallbackBuffer(ByteBuffer.allocate(len));
    cursor.read(len - done);
    ByteBuffer buffer = cursor.getData();
    done += buffer.remaining();
    result.add(buffer);
  }
} catch (ZeroCopyUnavailableException zcu) {
  ByteBuffer buffer = ByteBuffer.allocate(len);
  IOUtils.readFully(in, buffer.array(), buffer.arrayOffset(), len);
  buffer.limit(len);
  result.add(buffer);
}
{code}

compared to my proposed:

{code}
ByteBuffer[] result = in.readByteBuffers(offset, len);
...
in.releaseByteBuffers(result);
{code}

Am I missing something? This is a single read and of course real clients are 
going to do many of these throughout their code.

Using exceptions for nominal conditions like ZeroCopyUnavailable is bad 
practice and very expensive since it builds the stack trace.

By requiring allocation of the fallback buffer in all cases, extra allocations 
will be done.

{quote}
A sophisticated user might want to know if a read involves copying or not,
{quote}

At *scheduling* time they want to know how "local" the data is (cached, local, 
on-rack, off-rack), but at read time they just want to get the bytes. Making a 
distinction at read time just complicates the API. Furthermore, if the data 
crosses a block boundary

{quote}
I'll also note that it isn't easy for apps to deal with multiple returned 
buffers
{quote}

You'll need to return multiple buffers if the read request crosses a block 
boundary. Mmapped byte buffers can only read a single file. It is much better 
to have the entire request fulfilled in multiple byte buffers than have to loop 
externally. Of course in the vast majority of cases that don't cross block 
boundaries, it will be a single buffer.


                
> enable HDFS local reads via mmap
> --------------------------------
>
>                 Key: HDFS-4953
>                 URL: https://issues.apache.org/jira/browse/HDFS-4953
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>    Affects Versions: 2.3.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>             Fix For: HDFS-4949
>
>         Attachments: benchmark.png, HDFS-4953.001.patch, HDFS-4953.002.patch, 
> HDFS-4953.003.patch, HDFS-4953.004.patch, HDFS-4953.005.patch, 
> HDFS-4953.006.patch, HDFS-4953.007.patch, HDFS-4953.008.patch
>
>
> Currently, the short-circuit local read pathway allows HDFS clients to access 
> files directly without going through the DataNode.  However, all of these 
> reads involve a copy at the operating system level, since they rely on the 
> read() / pread() / etc family of kernel interfaces.
> We would like to enable HDFS to read local files via mmap.  This would enable 
> truly zero-copy reads.
> In the initial implementation, zero-copy reads will only be performed when 
> checksums were disabled.  Later, we can use the DataNode's cache awareness to 
> only perform zero-copy reads when we know that checksum has already been 
> verified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to