[ https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031018#comment-14031018 ]
Steve Loughran commented on HDFS-6382: -------------------------------------- My comments # this can be done as an MR job. # If you are worried about excessive load, start exactly one mapper, and consider throttling requests. As some object stores throttle heavy load & reject on a very high DELETE rate, throttling is going to be needed for anything that works against them. # you can then use OOzie as the scheduler. # MR restart handles failures: you just re-enum the directories and deleted files don't show up. # If you really, really can't do it as MR, write it as a one-node YARN app, for which I'd recommend apache twill as the starting point. In fact, this project would make for a nice example. Don't rush to write a new service here for an intermittent job. that just adds a new cost "A service to install and monitor". Especially when you consider that this new service will need # a launcher entry point # tests # commitment from the HDFS team to maintain it {quote} We can implement TTL within a MapReduce job that is similar with DistCp. We could run this MapReduce job over and over again or nightly or weekly to delete the expired files and directories. {quote} Yes, and schedule with oozie {quote} (1) Advantages: The major advantage of the MapReduce framework is concurrency control, if we want to run multiple tasks concurrently, choose a MapReduce approach will ease of concurrency control. {quote} There are other advantages # The MR job will be simple to write and can be submitted remotely. # it's trivial to test and therefore maintain. # no need to wait for a new version of Hadoop. You can evolve it locally. # different users, submitting jobs with different kerberos tickets can work on their own files securely. # there's no need to install and maintain a new service. {quote} (2) Disadvantages: For implementing the TTL functionality, one task is enough, multiple tasks will give too much race and load to the NameNode. {quote} # Demonstrate this by writing an MR job and assessing its load when you have a throttled executor. {quote} On another hand, use a MapReduce job will introduce additional dependencies and have additional overheads. {quote} # additional dependencies? In a cluster with MapReduce installed? The only additional dependency is the JAR with the mapper and the reducer. # What "additional overheads"? Are they really any less than running another service in your cluster, with its own classpath, failure modes, security needs? My recommendation, before writing a single line of a new service, is to write it as an MR job. You will find it easy to write and maintain; server load is handled by making sleep time a configurable parameter. If you can then actually demonstrate that this is inadequate on a large cluster, then consider a service. But start with MapReduce first. If you haven't written an MR job before, don't worry -it doesn't take that long to learn, and having done it you'll understand your user's workflow better. > HDFS File/Directory TTL > ----------------------- > > Key: HDFS-6382 > URL: https://issues.apache.org/jira/browse/HDFS-6382 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, namenode > Affects Versions: 2.4.0 > Reporter: Zesheng Wu > Assignee: Zesheng Wu > Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf > > > In production environment, we always have scenario like this, we want to > backup files on hdfs for some time and then hope to delete these files > automatically. For example, we keep only 1 day's logs on local disk due to > limited disk space, but we need to keep about 1 month's logs in order to > debug program bugs, so we keep all the logs on hdfs and delete logs which are > older than 1 month. This is a typical scenario of HDFS TTL. So here we > propose that hdfs can support TTL. > Following are some details of this proposal: > 1. HDFS can support TTL on a specified file or directory > 2. If a TTL is set on a file, the file will be deleted automatically after > the TTL is expired > 3. If a TTL is set on a directory, the child files and directories will be > deleted automatically after the TTL is expired > 4. The child file/directory's TTL configuration should override its parent > directory's > 5. A global configuration is needed to configure that whether the deleted > files/directories should go to the trash or not > 6. A global configuration is needed to configure that whether a directory > with TTL should be deleted when it is emptied by TTL mechanism or not. -- This message was sent by Atlassian JIRA (v6.2#6252)