[ https://issues.apache.org/jira/browse/HDFS-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763654#comment-13763654 ]
Colin Patrick McCabe commented on HDFS-4953: -------------------------------------------- bq. Actually, it is a standard best practice. Building exception objects is expensive and handling exceptions is error-prone. There's no bundling going on here. {{DFSClient}} throws the exception, not a remote daemon. Can you be more clear about what the alternative is to an exception here? It seems like a null return would have the same issues. (unless I am missing something?) bq. I clearly stated in the previous comment [that isZeroCopyAvailable] was about whether the zero copy was available AT THE CURRENT POINT IN THE STREAM. You can't know ahead of time whether your call to mmap will succeed. As I said, mmap can fail, for dozens of reasons. And of course blocks move over time. There is a fundamental "time of check, time of use" (TOCTOU) race condition in this kind of API. bq. [block boundary comments] Unless I'm missing something, I think we're both in agreement here-- we return a short read when near the block boundary, to avoid using the fallback buffer. bq. [factory comments] OK. I misinterpreted what you said earlier. I see now that you plan on reusing buffers. Since I am not knowledgeable about ORC, I have to ask you, Owen: is it necessary to read 200 MB at a time to decode this file format? As Todd wrote, our benchmarks show that large reads may not be the most efficient due to cache effects. bq. [new API] This proposal seems a lot better than the previous ones we've come up with. There is already a method named {{FSDataInputStream#read(ByteBuffer buf)}} in {{FSDataInputStream}}. If we create a new method named {{FSDataInputStream#readByteBuffer}}, I would expect there to be some confusion between the two. That's why I proposed {{FSDataInputStream#readZero}} for the new name. Does that make sense? {code} public void releaseByteBuffer(ByteBufferFactory factory, ByteBuffer buffer); {code} If, instead of returning a {{ByteBuffer}} from the {{readByteBuffer}} call, we returned a {{ZeroBuffer}} object wrapping the {{ByteBuffer}}, we could simply call {{ZeroBuffer#close()}}. The {{ZeroBuffer}} itself could retain a reference to its factory (if it came from a factory), and the {{DFSInputStream}} (if it needed to be munmapped). This seems a lot more object-oriented to me. To make the client pass in this information seems error-prone and confusing. I don't buy the argument that having a wrapper object imposes an "unacceptable performance hit." As I mentioned before, if you *don't* have a wrapper object, you have to do a hash table lookup in {{DFSInputStream#releaseByteBuffer}} to see if the buffer being released is one that we mmap'ed. A hash table lookup is surely slower than a single object dereference. I'd like to get some other prospective zero-copy API users to comment on whether they like the wrapper object or the {{DFSInputStream#releaseByteBuffer}} approach better... > enable HDFS local reads via mmap > -------------------------------- > > Key: HDFS-4953 > URL: https://issues.apache.org/jira/browse/HDFS-4953 > Project: Hadoop HDFS > Issue Type: New Feature > Affects Versions: 2.3.0 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Fix For: HDFS-4949 > > Attachments: benchmark.png, HDFS-4953.001.patch, HDFS-4953.002.patch, > HDFS-4953.003.patch, HDFS-4953.004.patch, HDFS-4953.005.patch, > HDFS-4953.006.patch, HDFS-4953.007.patch, HDFS-4953.008.patch > > > Currently, the short-circuit local read pathway allows HDFS clients to access > files directly without going through the DataNode. However, all of these > reads involve a copy at the operating system level, since they rely on the > read() / pread() / etc family of kernel interfaces. > We would like to enable HDFS to read local files via mmap. This would enable > truly zero-copy reads. > In the initial implementation, zero-copy reads will only be performed when > checksums were disabled. Later, we can use the DataNode's cache awareness to > only perform zero-copy reads when we know that checksum has already been > verified. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira