[jira] [Commented] (HDFS-3370) HDFS hardlink

Jesse Yates (JIRA) Wed, 27 Jun 2012 15:26:46 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402620#comment-13402620
 ]


Jesse Yates commented on HDFS-3370:
-----------------------------------

I'd like to propose an alternative to 'real' hardlinkes: "reference counted 
soft-Links", or all the hardness you really need in a distributed FS.

In this implementation of "hard" links, I would propose that wherever the file 
is created is considered the "owner" of that file. Initially, when created, the 
file has a reference count of (1) on the local namespace. If you want another 
hardlink to the file in the same namespace, you then talk to the NN and request 
another handle to that file, which implicitly updates the references to the 
file. The reference to that file could be stored in memory (and journaled) or 
written as part of the file metadata (more on that later, but lets ignore that 
for the moment). 

Suppose instead that you are in a separate namespace and want a hardlink to the 
file in the original namespace. Then you would make a request to your NN (NNa) 
for a hardlink. Since NNa doesn't own the file you want to reference, it makes 
a hardlink request to NN which originally created the file, the file 'owner' 
(or NNb). NNb then says 'Cool, I've got your request and increment the 
ref-count for the file." NNa can then grant your request and give you a link to 
that file. The failure case here is either
1) NNb goes down, in which case you can just keep around the reference requests 
and batch them when NNb comes back up.
2) NNa goes down mid-request - if NNa doesn't recieve an ACK back for the 
granted request, it can then disregard that request and re-decrement the count 
for that hardlink. 

Deleting the hardlink then follows a similar process. You issue a request to 
the owner NN, either directly from the client if you are deleting a link in the 
current namespace or through a proxy NN to the original namenode. It then 
decrements the reference count on the file and allows the deletion of the link. 
If the reference count ever hits 0, then the NN also deletes the file since 
there are no valid references to that file. 

This has the implicit implication though that the file will not be visible in 
the namespace that created it if all the hardlinks to it are removed. This 
means it essentially becomes a 'hidden' inode. We could, in the future, also 
work out a mechanism to transfer the hidden inode to a NN that has valid 
references to it (maybe via a gossip-style protocol), but that would be out of 
the current scope.

There are some implications for this model. If the owner NN manages the 
ref-count in memory, if that NN goes down, its whole namespace then becomes 
inaccessible, including _creating new hardlinks_ to any of the files (inodes) 
that it owns. However, the owner NN going down doesn't preclude the other NN 
from serving the file from their own 'soft' inodes. 

Alternatively, the NN could have a lock on the a hardlinked file, with the 
ref-counts and ownership info in the file metadata. This might introduce some 
overhead when creating new hardlinks (you need to reopen and modify the block 
or write a new block with the new information periodically - this latter 
actually opens a route to do ref-count management via appends to a file-ref 
file), but has the added advantage that if the owner NN crashed, an alternative 
NN could some and claim ownership of that file. This is similar to doing Paxos 
style leader-election for a given hardlinked file combined with leader-leases. 
However, this very unlikely to see lots of fluctuation as the leader can just 
reclaim the leader token via appends to the file-owner file, with periodic 
rewrites to minimize file size. 

The on-disk representation of the extreme version I'm proposing is then this: 
the full file then is actually composed of three pieces: (1) the actual data 
and then two metadata files, "extents" (to add a new word/definition),  (2) an 
external-reference extent: each time a reference is made to the file a new 
count is appended and it can periodically recompacted to a single value, (3) an 
owner-extent with the current NN owner and the lease time on the file, 
dictating who controls overall deletion of the file (since ref counts are done 
via the external-ref file). This means (2) and (3) are hidden inodes, only 
accessible to the namenode. We can minimize overhead to these file extents by 
ensuring a single writer via messaging to the the owner NN (as specified by the 
owner-file), though this is not strictly necessary.

Further, (1) could become a hidden inode if all the local namespace references 
are removed, but it could eventually be transferred over to another NN shard 
(namespace) to keep overhead at a minimum, though (again), this is not a strict 
necessity.

The design retains the NN view of files as directory entries, just entries with 
a little bit of metadata. The metadata could be in memory or part of the file 
and periodically modified, but that’s more implementation detail than anything 
(as mentioned above).
                
> HDFS hardlink
> -------------
>
>                 Key: HDFS-3370
>                 URL: https://issues.apache.org/jira/browse/HDFS-3370
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Hairong Kuang
>            Assignee: Liyin Tang
>         Attachments: HDFS-HardLink.pdf
>
>
> We'd like to add a new feature hardlink to HDFS that allows harlinked files 
> to share data without copying. Currently we will support hardlinking only 
> closed files, but it could be extended to unclosed files as well.
> Among many potential use cases of the feature, the following two are 
> primarily used in facebook:
> 1. This provides a lightweight way for applications like hbase to create a 
> snapshot;
> 2. This also allows an application like Hive to move a table to a different 
> directory without breaking current running hive queries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3370) HDFS hardlink

Reply via email to