[ https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032964#comment-14032964 ]
Colin Patrick McCabe commented on HDFS-6382: -------------------------------------------- bq. Plus, in the places that need this the most, one has to deal with getting what essentially becomes a critical part of uptime getting scheduled, competing with all of the other things running.... and, to remind you, to just delete files. It's sort of ridiculous to require YARN running for what is fundamentally a file system problem. It simply doesn't work in the real world. In the examples you give, you're already using YARN for Hive and Pig, so it's already a critical part of the infrastructure. Anyway, you should be able to put the cleanup job in a different queue. It's not like YARN is strictly FIFO. bq. One eventually gets to the point that the auto cleaner job is now running hourly just so /tmp doesn't overrun the rest of HDFS. Because these run outside of HDFS, they are slow and tedious and generally fall in the lap of teams that don't do Java so end up doing all sorts of squirrely things to make these jobs work. This also sucks. Well, presumably the implementation in this JIRA won't be done by a team that "doesn't do Java" so we should skip that problem, right? The comments about /tmp are, I think, another example of how this needs to be highly configurable. Rather than modifying Hive or Pig to set TTLs on things, we probably want to be able to configure the scanner to look at everything under /tmp. Perhaps the scanner should attach a TTL to things in /tmp that don't already have one. Running this under YARN has an intuitive appeal to the upstream developers, since YARN is a scheduler. If we write our own scheduler for this inside HDFS, we're kind of duplicating some of that work, including the monitoring, logging, etc. features. I think Steve's comments (and a lot of the earlier comments) reflect that. Of course, to users not already using YARN, a standalone daemon might seem more appealing. The proposal to put this in the balancer seems like a reasonable compromise. We can reuse some of the balancer code, and that way, we're not adding another daemon to manage. I wonder if we could have YARN run the balancer periodically? That might be interesting. > HDFS File/Directory TTL > ----------------------- > > Key: HDFS-6382 > URL: https://issues.apache.org/jira/browse/HDFS-6382 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, namenode > Affects Versions: 2.4.0 > Reporter: Zesheng Wu > Assignee: Zesheng Wu > Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf > > > In production environment, we always have scenario like this, we want to > backup files on hdfs for some time and then hope to delete these files > automatically. For example, we keep only 1 day's logs on local disk due to > limited disk space, but we need to keep about 1 month's logs in order to > debug program bugs, so we keep all the logs on hdfs and delete logs which are > older than 1 month. This is a typical scenario of HDFS TTL. So here we > propose that hdfs can support TTL. > Following are some details of this proposal: > 1. HDFS can support TTL on a specified file or directory > 2. If a TTL is set on a file, the file will be deleted automatically after > the TTL is expired > 3. If a TTL is set on a directory, the child files and directories will be > deleted automatically after the TTL is expired > 4. The child file/directory's TTL configuration should override its parent > directory's > 5. A global configuration is needed to configure that whether the deleted > files/directories should go to the trash or not > 6. A global configuration is needed to configure that whether a directory > with TTL should be deleted when it is emptied by TTL mechanism or not. -- This message was sent by Atlassian JIRA (v6.2#6252)