[ https://issues.apache.org/jira/browse/HDFS-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759740#comment-13759740 ]
Owen O'Malley commented on HDFS-4953: ------------------------------------- {quote} An unsophisticated user just sets the fallback buffer once for the cursor, and then calls away at the new API. {quote} But to the user, it isn't clear if or how to set the fallback buffer. What is the scope of the fallback buffer? How large does it need to be? Should it be direct or byte array? The user wants to read their HDFS data via ByteBuffer (especially if it is available via zero copy). Obviously, my concern is not abstract. I was thinking about how to use this for implementing a reader for the ORC file format and I absolutely need to manage the byte buffers and have a single path for reading both cached and non-cached files. It is much much better API design to have the filesystem create it as needed rather than making the application preemptively create "fallback" byte buffers for the file system to use. Let's say I need to read 100 MB that may cross a block boundary, under the current API to safely read it, I need to do: {code} FSDataInputStream in = fs.open(path); in.seek(offset); List<ByteBuffer> result = new ArrayList<ByteBuffer>(); try { ZeroCopyCursor cursor = in.createZeroCopyCursor(); // don't fail if we cross block boundaries cursor.setAllowShortReads(true); long done = 0; while (done < len) { // can't reuse previous buffers since they are still used in result cursor.setFallbackBuffer(ByteBuffer.allocate(len)); cursor.read(len - done); ByteBuffer buffer = cursor.getData(); done += buffer.remaining(); result.add(buffer); } } catch (ZeroCopyUnavailableException zcu) { ByteBuffer buffer = ByteBuffer.allocate(len); IOUtils.readFully(in, buffer.array(), buffer.arrayOffset(), len); buffer.limit(len); result.add(buffer); } {code} compared to my proposed: {code} ByteBuffer[] result = in.readByteBuffers(offset, len); ... in.releaseByteBuffers(result); {code} Am I missing something? This is a single read and of course real clients are going to do many of these throughout their code. Using exceptions for nominal conditions like ZeroCopyUnavailable is bad practice and very expensive since it builds the stack trace. By requiring allocation of the fallback buffer in all cases, extra allocations will be done. {quote} A sophisticated user might want to know if a read involves copying or not, {quote} At *scheduling* time they want to know how "local" the data is (cached, local, on-rack, off-rack), but at read time they just want to get the bytes. Making a distinction at read time just complicates the API. Furthermore, if the data crosses a block boundary {quote} I'll also note that it isn't easy for apps to deal with multiple returned buffers {quote} You'll need to return multiple buffers if the read request crosses a block boundary. Mmapped byte buffers can only read a single file. It is much better to have the entire request fulfilled in multiple byte buffers than have to loop externally. Of course in the vast majority of cases that don't cross block boundaries, it will be a single buffer. > enable HDFS local reads via mmap > -------------------------------- > > Key: HDFS-4953 > URL: https://issues.apache.org/jira/browse/HDFS-4953 > Project: Hadoop HDFS > Issue Type: New Feature > Affects Versions: 2.3.0 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Fix For: HDFS-4949 > > Attachments: benchmark.png, HDFS-4953.001.patch, HDFS-4953.002.patch, > HDFS-4953.003.patch, HDFS-4953.004.patch, HDFS-4953.005.patch, > HDFS-4953.006.patch, HDFS-4953.007.patch, HDFS-4953.008.patch > > > Currently, the short-circuit local read pathway allows HDFS clients to access > files directly without going through the DataNode. However, all of these > reads involve a copy at the operating system level, since they rely on the > read() / pread() / etc family of kernel interfaces. > We would like to enable HDFS to read local files via mmap. This would enable > truly zero-copy reads. > In the initial implementation, zero-copy reads will only be performed when > checksums were disabled. Later, we can use the DataNode's cache awareness to > only perform zero-copy reads when we know that checksum has already been > verified. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira