[ 
https://issues.apache.org/jira/browse/HDFS-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838159#comment-13838159
 ] 

Colin Patrick McCabe commented on HDFS-5182:
--------------------------------------------

So, previously we discussed a few different ways for the {{DataNode}} to notify 
the {{DFSClient}} about a change in the block's mlock status.

One way (let's call this choice #1) was using a shared memory segment.  This 
would take the form of a third file descriptor passed from the {{DataNode}} to 
the {{DFSClient}}.  On Linux, this would simply be a 4kb file from the 
{{/dev/shm}} filesystem, which is a {{tmpfs}} filesystem.  That filesystem is 
the best choice because it will not cause the file to be written to memory 
every {{dirty_centisecs}}.

However, on looking into this further, I found some issues with this method.  
There is no way for the {{DataNode}} to know when the {{DFSClient}} has closed 
the file descriptor for the shared memory area.  We could add some kind of 
protocol for keeping the area alive by writing to an agreed-upon location, but 
that would add a fair amount of complexity, and might be triggered accidentally 
in the case of a garbage collection event on the {{DFSClient}} or {{DataNode}}.
Another issue is that there is no way for the {{DataNode}} to revoke access to 
this shared memory segment.  If the {{DFSClient}} wants to hold on to it 
forever, leaking memory, it can do that.  This opens a hole.  The client might 
not have UNIX permissions to grab space in {{/dev/shm}}, but through this 
mechanism it can consume an arbitrary amount of space there.

The other way (let's call this choice #2) is for the client to keep open the 
Domain socket it used to request the two file descriptors.  If we can listen 
for messages sent on this socket, we can have a truly edge-triggered 
notification method.  The messages can be as short as a single byte, since we 
have very simple message needs.  This requires adding an epoll loop to handle 
these notifications without consuming a whole thread per socket.

Regardless of whether we go with choice #1 or #2, there are some other things 
that need to be done.

* Right now, we don't allow {{BlockReaderLocal}} instances to share file 
descriptors with each other.  However, this would be advisable, to avoid 
creating 100 pipes/shm areas when someone re-opens the same file 100 times.  
Doing this is actually an easy change (I wrote and tested the patch already).

* We need to revise {{FileInputStreamCache}} to store the communication method 
(pipe or shared memory area) which will be giving us notifications.  This cache 
also needs to get support for dealing with mmap regions, and for BRL instances 
sharing FDs / mmaps.  I have a patch which reworks this cache, but it's not 
quite done yet.

* {{BlockReaderLocal}} needs to get support for switching back and forth 
between honoring checksums and not.  I have a patch which substantially reworks 
BRL to add this capability, which I'm considering posting as a separate JIRA.

> BlockReaderLocal must allow zero-copy  reads only when the DN believes it's 
> valid
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-5182
>                 URL: https://issues.apache.org/jira/browse/HDFS-5182
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>
> BlockReaderLocal must allow zero-copy reads only when the DN believes it's 
> valid.  This implies adding a new field to the response to 
> REQUEST_SHORT_CIRCUIT_FDS.  We also need some kind of heartbeat from the 
> client to the DN, so that the DN can inform the client when the mapped region 
> is no longer locked into memory.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to