[ https://issues.apache.org/jira/browse/HDFS-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838159#comment-13838159 ]
Colin Patrick McCabe commented on HDFS-5182: -------------------------------------------- So, previously we discussed a few different ways for the {{DataNode}} to notify the {{DFSClient}} about a change in the block's mlock status. One way (let's call this choice #1) was using a shared memory segment. This would take the form of a third file descriptor passed from the {{DataNode}} to the {{DFSClient}}. On Linux, this would simply be a 4kb file from the {{/dev/shm}} filesystem, which is a {{tmpfs}} filesystem. That filesystem is the best choice because it will not cause the file to be written to memory every {{dirty_centisecs}}. However, on looking into this further, I found some issues with this method. There is no way for the {{DataNode}} to know when the {{DFSClient}} has closed the file descriptor for the shared memory area. We could add some kind of protocol for keeping the area alive by writing to an agreed-upon location, but that would add a fair amount of complexity, and might be triggered accidentally in the case of a garbage collection event on the {{DFSClient}} or {{DataNode}}. Another issue is that there is no way for the {{DataNode}} to revoke access to this shared memory segment. If the {{DFSClient}} wants to hold on to it forever, leaking memory, it can do that. This opens a hole. The client might not have UNIX permissions to grab space in {{/dev/shm}}, but through this mechanism it can consume an arbitrary amount of space there. The other way (let's call this choice #2) is for the client to keep open the Domain socket it used to request the two file descriptors. If we can listen for messages sent on this socket, we can have a truly edge-triggered notification method. The messages can be as short as a single byte, since we have very simple message needs. This requires adding an epoll loop to handle these notifications without consuming a whole thread per socket. Regardless of whether we go with choice #1 or #2, there are some other things that need to be done. * Right now, we don't allow {{BlockReaderLocal}} instances to share file descriptors with each other. However, this would be advisable, to avoid creating 100 pipes/shm areas when someone re-opens the same file 100 times. Doing this is actually an easy change (I wrote and tested the patch already). * We need to revise {{FileInputStreamCache}} to store the communication method (pipe or shared memory area) which will be giving us notifications. This cache also needs to get support for dealing with mmap regions, and for BRL instances sharing FDs / mmaps. I have a patch which reworks this cache, but it's not quite done yet. * {{BlockReaderLocal}} needs to get support for switching back and forth between honoring checksums and not. I have a patch which substantially reworks BRL to add this capability, which I'm considering posting as a separate JIRA. > BlockReaderLocal must allow zero-copy reads only when the DN believes it's > valid > --------------------------------------------------------------------------------- > > Key: HDFS-5182 > URL: https://issues.apache.org/jira/browse/HDFS-5182 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client > Affects Versions: 3.0.0 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > > BlockReaderLocal must allow zero-copy reads only when the DN believes it's > valid. This implies adding a new field to the response to > REQUEST_SHORT_CIRCUIT_FDS. We also need some kind of heartbeat from the > client to the DN, so that the DN can inform the client when the mapped region > is no longer locked into memory. -- This message was sent by Atlassian JIRA (v6.1#6144)