[ 
https://issues.apache.org/jira/browse/HADOOP-8598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HADOOP-8598:
--------------------------------

     Target Version/s: 3.0.0  (was: 2.1.0-alpha)
    Affects Version/s:     (was: 2.0.0-alpha)

Filed HADOOP-8689 for v2 per ATM's suggestion so re-targeting this change for 
trunk/v3.
                
> Server-side Trash
> -----------------
>
>                 Key: HADOOP-8598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8598
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>            Priority: Critical
>
> There are a number of problems with Trash that continue to result in 
> permanent data loss for users. The primary reasons trash is not used:
> - Trash is configured client-side and not enabled by default.
> - Trash is shell-only. FileSystem, WebHDFS, HttpFs, etc never use trash.
> - If trash fails, for example, because we can't create the trash directory or 
> the move itself fails, trash is bypassed and the data is deleted.
> Trash was designed as a feature to help end users via the shell, however in 
> my experience the primary use of trash is to help administrators implement 
> data retention policies (this was also the motivation for HADOOP-7460).  One 
> could argue that (periodic read-only) snapshots are a better solution to this 
> problem, however snapshots are not slated for Hadoop 2.x and trash is 
> complimentary to snapshots (and backup) - eg you may create and delete data 
> within your snapshot or backup window - so it makes sense to revisit trash's 
> design. I think it's worth bringing trash's functionality in line with what 
> users need.
> I propose we enable trash on a per-filesystem basis and implement it 
> server-side. Ie trash becomes an HDFS feature enabled by administrators. 
> Because the trash emptier lives in HDFS and users already have a 
> per-filesystem trash directory we're mostly there already. The design 
> preference from HADOOP-2514 was for trash to be implemented in "user code" 
> however (a) in light of these problems, (b) we have a lot more user-facing 
> APIs than the shell and (c) clients increasingly span file systems (via 
> federation and symlinks) this design choice makes less sense. This is why we 
> already use a per-filesystem trash/home directory instead of the user's 
> client-configured one - otherwise trash would not work because renames can't 
> span file systems.
> In short, HDFS trash would work similarly to how it does today, the 
> difference is that client delete APIs would result in a rename into trash 
> (ala TrashPolicyDefault#moveToTrash) if trash is enabled. Like today it would 
> be renamed to the trash directory on the file system where the file being 
> removed resides. The primary difference is that enablement and policy are 
> configured server-side by adminstrators and is used regardless of the API 
> used to access the filesytem. The one execption to this is that I think we 
> should continue to support the explict skipTrash shell option. The rationale 
> for skipTrash (HADOOP-6080) is that a move to trash may fail in cases where a 
> rm may not, if a user has a home directory quota and does a rmr /tonsOfData, 
> for example. Without a way to bypass this the user has no way (unless we 
> revisit quotas, permissions or trash paths) to remove a directory they have 
> permissions to remove without getting their quota adjusted by an admin. The 
> skip trash API can be implemented by adding an explicit FileSystem API that 
> bypasses trash and modifying the shell to use it when skipTrash is enabled. 
> Given that users must explicitly specify skipTrash the API is less error 
> prone. We could have the shell ask confirmation and annotate the API private 
> to FsShell to discourage programatic use. This is not ideal but can be done 
> compatibly (unlike redefining quotas, permissions or trash paths).
> In terms of compatibility, while this proposal is technically an incompatible 
> change (client side configuration that disables trash and uses skipTrash with 
> a previous FsShell release will now both be ignored if server-side trash is 
> enabled, and non-HDFS file systems would need to make similar changes) I 
> think it's worth targeting for Hadoop 2.x given that the new semantics 
> preserve the current semantics. In 2.x I think we should preserve FsShell 
> based trash and support both it and server-side trash (defaults to disabled). 
> For trunk/3.x I think we should remove the FsShell based trash entirely and 
> enable server-side trash by default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to