[ https://issues.apache.org/jira/browse/HADOOP-8598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eli Collins updated HADOOP-8598: -------------------------------- Target Version/s: 3.0.0 (was: 2.1.0-alpha) Affects Version/s: (was: 2.0.0-alpha) Filed HADOOP-8689 for v2 per ATM's suggestion so re-targeting this change for trunk/v3. > Server-side Trash > ----------------- > > Key: HADOOP-8598 > URL: https://issues.apache.org/jira/browse/HADOOP-8598 > Project: Hadoop Common > Issue Type: New Feature > Reporter: Eli Collins > Assignee: Eli Collins > Priority: Critical > > There are a number of problems with Trash that continue to result in > permanent data loss for users. The primary reasons trash is not used: > - Trash is configured client-side and not enabled by default. > - Trash is shell-only. FileSystem, WebHDFS, HttpFs, etc never use trash. > - If trash fails, for example, because we can't create the trash directory or > the move itself fails, trash is bypassed and the data is deleted. > Trash was designed as a feature to help end users via the shell, however in > my experience the primary use of trash is to help administrators implement > data retention policies (this was also the motivation for HADOOP-7460). One > could argue that (periodic read-only) snapshots are a better solution to this > problem, however snapshots are not slated for Hadoop 2.x and trash is > complimentary to snapshots (and backup) - eg you may create and delete data > within your snapshot or backup window - so it makes sense to revisit trash's > design. I think it's worth bringing trash's functionality in line with what > users need. > I propose we enable trash on a per-filesystem basis and implement it > server-side. Ie trash becomes an HDFS feature enabled by administrators. > Because the trash emptier lives in HDFS and users already have a > per-filesystem trash directory we're mostly there already. The design > preference from HADOOP-2514 was for trash to be implemented in "user code" > however (a) in light of these problems, (b) we have a lot more user-facing > APIs than the shell and (c) clients increasingly span file systems (via > federation and symlinks) this design choice makes less sense. This is why we > already use a per-filesystem trash/home directory instead of the user's > client-configured one - otherwise trash would not work because renames can't > span file systems. > In short, HDFS trash would work similarly to how it does today, the > difference is that client delete APIs would result in a rename into trash > (ala TrashPolicyDefault#moveToTrash) if trash is enabled. Like today it would > be renamed to the trash directory on the file system where the file being > removed resides. The primary difference is that enablement and policy are > configured server-side by adminstrators and is used regardless of the API > used to access the filesytem. The one execption to this is that I think we > should continue to support the explict skipTrash shell option. The rationale > for skipTrash (HADOOP-6080) is that a move to trash may fail in cases where a > rm may not, if a user has a home directory quota and does a rmr /tonsOfData, > for example. Without a way to bypass this the user has no way (unless we > revisit quotas, permissions or trash paths) to remove a directory they have > permissions to remove without getting their quota adjusted by an admin. The > skip trash API can be implemented by adding an explicit FileSystem API that > bypasses trash and modifying the shell to use it when skipTrash is enabled. > Given that users must explicitly specify skipTrash the API is less error > prone. We could have the shell ask confirmation and annotate the API private > to FsShell to discourage programatic use. This is not ideal but can be done > compatibly (unlike redefining quotas, permissions or trash paths). > In terms of compatibility, while this proposal is technically an incompatible > change (client side configuration that disables trash and uses skipTrash with > a previous FsShell release will now both be ignored if server-side trash is > enabled, and non-HDFS file systems would need to make similar changes) I > think it's worth targeting for Hadoop 2.x given that the new semantics > preserve the current semantics. In 2.x I think we should preserve FsShell > based trash and support both it and server-side trash (defaults to disabled). > For trunk/3.x I think we should remove the FsShell based trash entirely and > enable server-side trash by default. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira