[ https://issues.apache.org/jira/browse/HDFS-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760761#comment-13760761 ]
Colin Patrick McCabe commented on HDFS-4953: -------------------------------------------- Your proposed API doesn't address one of the big asks we had when designing ZCR, which is to provide a mechanism for notifying the user that he cannot get an mmap. As I mentioned earlier, for performance reasons, many users who might like to have access to a 128 MB mmap segment do not want to copy into a 128MB backing buffer. Doing such a large copy would blow the L2 cache (and possibly the page cache), and rather than improving performance, might degrade it. Similarly, users don't want to get multiple byte buffers back-- the big advantage of mmap is getting a single buffer back (in the cases where that's possible). What if the user wants to use a direct byte buffer as his fallback buffer? With the current code, that is easy-- I just call setFallbackBuffer(ByteBuffer.allocateDirect(...)). With your proposed API, there's no way to do this. Creating a new ByteBuffer for each read is going to be slower than reusing the same ByteBuffer-- especially for direct ByteBuffers. Sure, we could have some kind of ByteBuffer cache inside the FSDataInputStream, but that's going to be very complicated. What if someone needs a ByteBuffer of size 100 but we only have ones of size 10 and 900 in the cache? Do we use the big one for the small read or leave it around? How long do we cache them? Do we prefer to the direct ones? And so on. Really, the only design that makes sense is having the user pass in the fallback buffer. We do not want to be re-inventing malloc inside FSDataInputStream. The design principles of the current API are: * some users want a fallback path, and some don't. We have to satisfy both. * we don't want to manage buffers inside FSDataInputStream. It's a messy and hard problem with no optimal solutions that fit all cases. * nobody wants to receive more than one buffer in response to a read. * most programmers don't correctly handle short reads, so there should be an option to disable them. One thing that we could and should do is provide a generic fallback path that is independent of filesystem. > enable HDFS local reads via mmap > -------------------------------- > > Key: HDFS-4953 > URL: https://issues.apache.org/jira/browse/HDFS-4953 > Project: Hadoop HDFS > Issue Type: New Feature > Affects Versions: 2.3.0 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Fix For: HDFS-4949 > > Attachments: benchmark.png, HDFS-4953.001.patch, HDFS-4953.002.patch, > HDFS-4953.003.patch, HDFS-4953.004.patch, HDFS-4953.005.patch, > HDFS-4953.006.patch, HDFS-4953.007.patch, HDFS-4953.008.patch > > > Currently, the short-circuit local read pathway allows HDFS clients to access > files directly without going through the DataNode. However, all of these > reads involve a copy at the operating system level, since they rely on the > read() / pread() / etc family of kernel interfaces. > We would like to enable HDFS to read local files via mmap. This would enable > truly zero-copy reads. > In the initial implementation, zero-copy reads will only be performed when > checksums were disabled. Later, we can use the DataNode's cache awareness to > only perform zero-copy reads when we know that checksum has already been > verified. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira