[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031018#comment-14031018
 ] 

Steve Loughran commented on HDFS-6382:
--------------------------------------

My comments

# this can be done as an MR job. 
# If you are worried about excessive load, start exactly one mapper, and 
consider throttling requests. As some object stores throttle heavy load & 
reject on a very high DELETE rate, throttling is going to be needed for 
anything that works against them.
# you can then use OOzie as the scheduler.
# MR restart handles failures: you just re-enum the directories and deleted 
files don't show up.
# If you really, really can't do it as MR, write it as a one-node YARN app, for 
which I'd recommend apache twill as the starting point. In fact, this project 
would make for a nice example.

Don't rush to write a new service here for an intermittent job. that just adds 
a new cost "A service to install and monitor". Especially when you consider 
that this new service will need
# a launcher entry point
# tests
# commitment from the HDFS team to maintain it

{quote}
We can implement TTL within a MapReduce job that is similar with DistCp. We 
could run this MapReduce job over and over again or nightly or weekly to delete 
the expired files and directories.
{quote}

Yes, and schedule with oozie
{quote}
 (1) Advantages:
The major advantage of the MapReduce framework is concurrency control, if we 
want to run multiple tasks concurrently, choose a MapReduce approach will ease 
of concurrency control.
{quote}

There are other advantages
# The MR job will be simple to write and can be submitted remotely. 
# it's trivial to test and therefore maintain. 
# no need to wait for a new version of Hadoop. You can evolve it locally.
# different users, submitting jobs with different kerberos tickets can work on 
their own files securely.
# there's no need to install and maintain a new service.

{quote}
(2) Disadvantages:
For implementing the TTL functionality, one task is enough, multiple tasks will 
give too much race and load to the NameNode. 
{quote}

# Demonstrate this by writing an MR job and assessing its load when you have a 
throttled executor.
{quote}

On another hand, use a MapReduce job will introduce additional dependencies and 
have additional overheads.
{quote}

# additional dependencies? In a cluster with MapReduce installed? The only 
additional dependency is the JAR with the mapper and the reducer.
# What "additional overheads"? Are they really any less than running another 
service in your cluster, with its own classpath, failure modes, security needs?
 
My recommendation, before writing a single line of a new service, is to write 
it as an MR job. You will find it easy to write and maintain; server load is 
handled by making sleep time a configurable parameter. 

If you can then actually demonstrate that this is inadequate on a large 
cluster, then consider a service. But start with MapReduce first. If you 
haven't written an MR job before, don't worry -it doesn't take that long to 
learn, and having done it you'll understand your user's workflow better.

> HDFS File/Directory TTL
> -----------------------
>
>                 Key: HDFS-6382
>                 URL: https://issues.apache.org/jira/browse/HDFS-6382
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, namenode
>    Affects Versions: 2.4.0
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>         Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf
>
>
> In production environment, we always have scenario like this, we want to 
> backup files on hdfs for some time and then hope to delete these files 
> automatically. For example, we keep only 1 day's logs on local disk due to 
> limited disk space, but we need to keep about 1 month's logs in order to 
> debug program bugs, so we keep all the logs on hdfs and delete logs which are 
> older than 1 month. This is a typical scenario of HDFS TTL. So here we 
> propose that hdfs can support TTL.
> Following are some details of this proposal:
> 1. HDFS can support TTL on a specified file or directory
> 2. If a TTL is set on a file, the file will be deleted automatically after 
> the TTL is expired
> 3. If a TTL is set on a directory, the child files and directories will be 
> deleted automatically after the TTL is expired
> 4. The child file/directory's TTL configuration should override its parent 
> directory's
> 5. A global configuration is needed to configure that whether the deleted 
> files/directories should go to the trash or not
> 6. A global configuration is needed to configure that whether a directory 
> with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to