[ https://issues.apache.org/jira/browse/HDFS-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763526#comment-13763526 ]
Owen O'Malley commented on HDFS-4953: ------------------------------------- {quote} This seems like a fairly arbitrary requirement {quote} Actually, it is a standard best practice. Building exception objects is *expensive* and handling exceptions is error-prone. {quote} Unfortunately, zero copy being available is not a binary thing. It might be available sometimes, but not other times. {quote} I clearly stated in the previous comment it was about whether the zero copy was available AT THE CURRENT POINT IN THE STREAM. {quote} This is the point of the fallback buffer-- it will be used when a block boundary, non-local block, etc. prevents the request being fulfilled with an mmap. We've made this as efficient as it can be made. {quote} No, you haven't made it efficient at all. If the code is going to do a buffer copy and use the fallback buffer if it crosses a block, the performance will be dreadful. With my interface, you get no extra allocation, and no buffer copies. The two reasonable strategies are: * short read to cut it to a record boundary * return multiple buffers You can't make it easier than that, because the whole point of this API is to avoid buffer copies. You can't violate the goal of the interface because you don't like those choices. {quote} Well, what are the alternatives? We can't subclass java.nio.ByteBuffer, since all of its constructors are package-private. {quote} Agreed that you can't extend ByteBuffer. But other than close, you don't really need anything. {quote} There has to be a close method on whatever we return {quote} That is NOT a requirement. Especially when it conflicts with fundamental performance. As I wrote, adding a return method on the stream is fine. {quote} I'm sort of getting the impression that you plan on having super-huge buffers, which scares me. {quote} For ORC, I need to read typically 200 MB at a time. Obviously, if I don't need the whole range, I won't get it, but getting it in large sets of bytes is much much more efficient than lots of little reads. {quote} The factory API also gives me the impression that you plan on allocating a new buffer for each read, which would also be problematic. {quote} No, that is not the case. The point of the API is to avoid allocating the buffers if they aren't needed. The current API requires a buffer whether it is needed or not. Obviously the application will need a cache of the buffers to reuse, but the factory lets them write efficient code. {quote} If we have a "Factory" object, it needs to have not only a "get" method, but also a "put" method, where ByteBuffers are placed back when we're done with them. At that point, it becomes more like a cache. This might be a reasonable API, but I wonder if the additional complexity is worth it. {quote} Of course the implementations of the factory will have a release method. The question was just whether the FSDataInputStream needed to access the release method. If we add the releaseByteBuffer then we'd need the release to the factory. Based on this, I'd propose: {code} /** * Is the current location of the stream available via zero copy? */ public boolean isZeroCopyAvailable(); /** * Read from the current location at least 1 and up to maxLength bytes. In most situations, the returned * buffer will contain maxLength bytes unless either: * * the read crosses a block boundary and zero copy is being used * * the stream has fewer than maxLength bytes left * The returned buffer will either be one that was created by the factory or a MappedByteBuffer. */ public ByteBuffer readByteBuffer(ByteBufferFactory factory, int maxLength) throws IOException; /** * Release a buffer that was returned from readByteBuffer. If the method was created by the factory * it will be returned to the factory. */ public void releaseByteBuffer(ByteBufferFactory factory, ByteBuffer buffer); /** * Allow application to manage how ByteBuffers are created for fallback buffers. Only buffers created by * the factory will be released to it. */ public interface ByteBufferFactory { ByteBuffer createBuffer(int capacity); void releaseBuffer(ByteBuffer buffer); } {code} This will allow applications to: * determine whether zero copy is available for the next read * the user can use the same read interface for all filesystems and files, using zero copy if available * no extra buffer copies * no bytebuffers are allocated if they are not needed * applications have to deal with short reads, but only get a single byte buffer * allow applications to create buffer managers that reuse buffers * allow applications to control whether direct or byte[] byte buffers are used The example code would look like: {code} FSDataInputStream in = fs.open(path); in.seek(100*1024*1024); List<ByteBuffer> buffers = new ArrayList<ByteBuffer>(); long remaining = 100 * 1024 * 1024; while (remaining > 0) { ByteBuffer buf = in.readByteBuffer(fallbackFactory, 100*1024*1024); remaining -= buf.remaining(); buffers.add(buf); } ... for(ByteBuffer buf: buffers) { in.releaseByteBuffer(fallbackFactory, buf); } {code} > enable HDFS local reads via mmap > -------------------------------- > > Key: HDFS-4953 > URL: https://issues.apache.org/jira/browse/HDFS-4953 > Project: Hadoop HDFS > Issue Type: New Feature > Affects Versions: 2.3.0 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Fix For: HDFS-4949 > > Attachments: benchmark.png, HDFS-4953.001.patch, HDFS-4953.002.patch, > HDFS-4953.003.patch, HDFS-4953.004.patch, HDFS-4953.005.patch, > HDFS-4953.006.patch, HDFS-4953.007.patch, HDFS-4953.008.patch > > > Currently, the short-circuit local read pathway allows HDFS clients to access > files directly without going through the DataNode. However, all of these > reads involve a copy at the operating system level, since they rely on the > read() / pread() / etc family of kernel interfaces. > We would like to enable HDFS to read local files via mmap. This would enable > truly zero-copy reads. > In the initial implementation, zero-copy reads will only be performed when > checksums were disabled. Later, we can use the DataNode's cache awareness to > only perform zero-copy reads when we know that checksum has already been > verified. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira