[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-07-18 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066507#comment-14066507
 ] 

Doug Cutting commented on HDFS-6382:


bq. But the trash emptier runs inside NN as a daemon thread, instead of a 
separate daemon process.

The trash emptier was embedded in the NN mostly just to avoid making folks have 
to manage another daemon process.  However, embedding the emptier has many of 
the hazards that Chris and Colin described above for embedding TTL.  So, if we 
add a separate daemon process for TTL, then we might also have that process 
empty the trash and remove the embedded emptier.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design-3.pdf, 
 HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-07-18 Thread Zesheng Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067254#comment-14067254
 ] 

Zesheng Wu commented on HDFS-6382:
--

Thanks [~cutting]
I will move the trash emptier into TTL daemon after HDFS-6525 and HDFS-6526 are 
resolved.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design-3.pdf, 
 HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-23 Thread Zesheng Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040683#comment-14040683
 ] 

Zesheng Wu commented on HDFS-6382:
--

Hi guys, I've uploaded an initial implementation on HDFS-6525 and HDFS-6526 
separately, hope you can take a look at, any comments will be appreciated. 
Thanks in advance.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design-3.pdf, 
 HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-16 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032964#comment-14032964
 ] 

Colin Patrick McCabe commented on HDFS-6382:


bq. Plus, in the places that need this the most, one has to deal with getting 
what essentially becomes a critical part of uptime getting scheduled, competing 
with all of the other things running and, to remind you, to just delete 
files. It's sort of ridiculous to require YARN running for what is 
fundamentally a file system problem. It simply doesn't work in the real world.

In the examples you give, you're already using YARN for Hive and Pig, so it's 
already a critical part of the infrastructure.  Anyway, you should be able to 
put the cleanup job in a different queue.  It's not like YARN is strictly FIFO.

bq. One eventually gets to the point that the auto cleaner job is now running 
hourly just so /tmp doesn't overrun the rest of HDFS. Because these run outside 
of HDFS, they are slow and tedious and generally fall in the lap of teams that 
don't do Java so end up doing all sorts of squirrely things to make these jobs 
work. This also sucks.

Well, presumably the implementation in this JIRA won't be done by a team that 
doesn't do Java so we should skip that problem, right?

The comments about /tmp are, I think, another example of how this needs to be 
highly configurable.  Rather than modifying Hive or Pig to set TTLs on things, 
we probably want to be able to configure the scanner to look at everything 
under /tmp.  Perhaps the scanner should attach a TTL to things in /tmp that 
don't already have one.

Running this under YARN has an intuitive appeal to the upstream developers, 
since YARN is a scheduler.  If we write our own scheduler for this inside HDFS, 
we're kind of duplicating some of that work, including the monitoring, logging, 
etc. features.  I think Steve's comments (and a lot of the earlier comments) 
reflect that.  Of course, to users not already using YARN, a standalone daemon 
might seem more appealing.

The proposal to put this in the balancer seems like a reasonable compromise.  
We can reuse some of the balancer code, and that way, we're not adding another 
daemon to manage.  I wonder if we could have YARN run the balancer 
periodically?  That might be interesting.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-16 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033079#comment-14033079
 ] 

Steve Loughran commented on HDFS-6382:
--

{quote}
It's sort of ridiculous to require YARN running for what is fundamentally a 
file system problem. It simply doesn't work in the real world.
{quote}

Allen, that's like saying it's ridiculous to require bash scripts to perform 
what is fundamentally a unix filesystem problem. One is data, the other is the 
mechanism to run code near the data. I don't try and hide any local /tmp 
cleanup init.d scripts inside an ext3 plugin, after all.
YARN
# handles security by having you include kerberos tickets in the launch.
# stops you having to choose a specific server to run this thing (hence point 
of failure). 
# lets you scale up when needed.



 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-15 Thread Zesheng Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032082#comment-14032082
 ] 

Zesheng Wu commented on HDFS-6382:
--

[~ste...@apache.org] Thanks for your feedback.
We have discussed that whether to use a MR job or a standalone daemon, and most 
people upstream has come to an agreement that a standalone daemon is reasonable 
and acceptable. You can go through the earlier discussion.  

[~aw] Thanks for your feedback.
Your suggestion is really valuable and firms our confidence to implement it as 
a standalone daemon.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-14 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031762#comment-14031762
 ] 

Allen Wittenauer commented on HDFS-6382:


Let's take a step back and give a more concrete example: /tmp

/tmp is the plague of Hadoop operations teams everywhere.  Between Hive and Pig 
leaving bits hanging around because the client can't properly handle signals to 
crunch that by default just leaves (seemingly) as much crap laying around that 
it possibly can because someone somewhere might want to debug it someday, /tmp 
is the cesspool of HDFS.  Every competent admin ends up writing some sort of 
auto-/tmp-cleaner because of these issues.  At scale, /tmp can have hundreds of 
TB and millions of objects in it in less than 24 hours. It sucks.  

One eventually gets to the point that the auto cleaner job is now running 
hourly just so /tmp doesn't overrun the rest of HDFS.  Because these run 
outside of HDFS, they are slow and tedious and generally fall in the lap of 
teams that don't do Java so end up doing all sorts of squirrely things to make 
these jobs work.  This also sucks. 

Now, I can see why using an MR job is appealing (easy!), but it isn't very 
effective.  For one, we've already been here once and the result was distch.  
Hell, there was a big fight just to get distch written and that--years and 
years later!--still isn't documented because of how slow it works.  Throw in 
directories like /tmp that simply have WAY too much churn and one can see that 
depending upon MR to work here just isn't viable.  Plus, in the places that 
need this the most, one has to deal with getting what essentially becomes a 
critical part of uptime getting scheduled, competing with all of the other 
things running and, to remind you, to just delete files.  It's sort of 
ridiculous to require YARN running for what is fundamentally a file system 
problem.   It simply doesn't work in the real world.  

While at Hadoop Summit, a bunch of us sat around a table and were talking about 
this issue with regards specifically to /tmp.  (We didn't know about this JIRA, 
BTW.) The solution we came up with was basically a service that would bootstrap 
by reading fsimage and then reading the edits stream by sending the audit 
information to Kafka.  One of the big advantages of this is that we can get 
near real-time updates of the parts of the file system we care need to operate 
on.  Since we only care about a subsection of the file system, the memory 
requirements are significantly lower and it might be possible to coalesce 
deletes in a smart way to cut back on RPCs.  I suspect it wouldn't be hard to 
generalize this type of solution to handle multiple user cases.  But for me, 
this is critical admin functionality that HDFS needs desperately and throwing 
the the problem to MR just isn't workable.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-13 Thread Zesheng Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030366#comment-14030366
 ] 

Zesheng Wu commented on HDFS-6382:
--

I filed two sub-tasks to track the development of this feature.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031018#comment-14031018
 ] 

Steve Loughran commented on HDFS-6382:
--

My comments

# this can be done as an MR job. 
# If you are worried about excessive load, start exactly one mapper, and 
consider throttling requests. As some object stores throttle heavy load  
reject on a very high DELETE rate, throttling is going to be needed for 
anything that works against them.
# you can then use OOzie as the scheduler.
# MR restart handles failures: you just re-enum the directories and deleted 
files don't show up.
# If you really, really can't do it as MR, write it as a one-node YARN app, for 
which I'd recommend apache twill as the starting point. In fact, this project 
would make for a nice example.

Don't rush to write a new service here for an intermittent job. that just adds 
a new cost A service to install and monitor. Especially when you consider 
that this new service will need
# a launcher entry point
# tests
# commitment from the HDFS team to maintain it

{quote}
We can implement TTL within a MapReduce job that is similar with DistCp. We 
could run this MapReduce job over and over again or nightly or weekly to delete 
the expired files and directories.
{quote}

Yes, and schedule with oozie
{quote}
 (1) Advantages:
The major advantage of the MapReduce framework is concurrency control, if we 
want to run multiple tasks concurrently, choose a MapReduce approach will ease 
of concurrency control.
{quote}

There are other advantages
# The MR job will be simple to write and can be submitted remotely. 
# it's trivial to test and therefore maintain. 
# no need to wait for a new version of Hadoop. You can evolve it locally.
# different users, submitting jobs with different kerberos tickets can work on 
their own files securely.
# there's no need to install and maintain a new service.

{quote}
(2) Disadvantages:
For implementing the TTL functionality, one task is enough, multiple tasks will 
give too much race and load to the NameNode. 
{quote}

# Demonstrate this by writing an MR job and assessing its load when you have a 
throttled executor.
{quote}

On another hand, use a MapReduce job will introduce additional dependencies and 
have additional overheads.
{quote}

# additional dependencies? In a cluster with MapReduce installed? The only 
additional dependency is the JAR with the mapper and the reducer.
# What additional overheads? Are they really any less than running another 
service in your cluster, with its own classpath, failure modes, security needs?
 
My recommendation, before writing a single line of a new service, is to write 
it as an MR job. You will find it easy to write and maintain; server load is 
handled by making sleep time a configurable parameter. 

If you can then actually demonstrate that this is inadequate on a large 
cluster, then consider a service. But start with MapReduce first. If you 
haven't written an MR job before, don't worry -it doesn't take that long to 
learn, and having done it you'll understand your user's workflow better.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-11 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028682#comment-14028682
 ] 

Tsz Wo Nicholas Sze commented on HDFS-6382:
---

Checked the design doc.  It looks good.  Some comments:

- Standalone Daemon Approach ... To Implement a completely new standalone 
daemon can rarely reuse existing code, will need lots of work to do.
I don't agree.  We may refactor Balancer or other tools if necessary.

- Using xattrs for TTL is a good idea. Do we really need ttl in milliseconds?  
Do you think that the daemon could guarantee such accuracy?  We don't want to 
waste namenode memory space to store trailing zeros/digits for each ttl.  How 
about supporting symbolic ttl notation, e.g. 10h, 5d?

- The name Supervisor sounds too general.  How about calling it TtlManager 
for the moment?  If there are more new features added to the tool, we may 
change the name later.

- For setting ttl on a directory foo, write permission permission on the parent 
directory of foo is not enough.  Namenode also checks rwx for all 
subdirectories of foo for recursive delete.  BTW, permission could be changed 
from time to time.  A user may be able to delete a file/dir at the time of 
setting TTL but the same user may not have permission to delete the same 
file/dir when the ttl expires.
I suggest not to check additional permission requirement on setting ttl but run 
as the particular user when deleting the file.  Then we need to add username to 
the ttl xattr.



 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-11 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028729#comment-14028729
 ] 

Tsz Wo Nicholas Sze commented on HDFS-6382:
---

 How about supporting symbolic ttl notation, e.g. 10h, 5d?

The actual value in xattr could be encoded as two bytes -- 3 bits for the unit 
(year, month, week, day, hour, minute, second, millisecond) and 13 bits for 
value.  If we store it using milliseconds, it probably needs eight bytes.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-11 Thread Zesheng Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028768#comment-14028768
 ] 

Zesheng Wu commented on HDFS-6382:
--

[~szetszwo], Thanks for your valuable suggestions.
bq. Using xattrs for TTL is a good idea. Do we really need ttl in milliseconds? 
Do you think that the daemon could guarantee such accuracy? We don't want to 
waste namenode memory space to store trailing zeros/digits for each ttl. How 
about supporting symbolic ttl notation, e.g. 10h, 5d?
Yes, I agree with you that the daemon can't guarantee milliseconds accuracy, 
and in fact there's no need to guarantee such accuracy. As you suggested, we 
can use encoded bytes to save NN's memory.

bq. The name Supervisor sounds too general. How about calling it TtlManager 
for the moment? If there are more new features added to the tool, we may change 
the name later.
OK, TtlManager is more suitable for the moment.

bq. For setting ttl on a directory foo, write permission permission on the 
parent directory of foo is not enough. Namenode also checks rwx for all 
subdirectories of foo for recursive delete. 
Nice catch, If we want to conform to the delete semantics mentioned by Colin, 
we should check the subdirectories recursively.

bq. BTW, permission could be changed from time to time. A user may be able to 
delete a file/dir at the time of setting TTL but the same user may not have 
permission to delete the same file/dir when the ttl expires.
The deleting work will be done by a super user(which the TtlManager runs as), 
seems this is not a problem?

bq. I suggest not to check additional permission requirement on setting ttl but 
run as the particular user when deleting the file. Then we need to add username 
to the ttl xattr.
Good point, but adding the username to the ttl xattr requires more space of 
NN's memory, we should do the trade-off whether it's worth doing.


 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-10 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026772#comment-14026772
 ] 

Colin Patrick McCabe commented on HDFS-6382:


bq. You mean that we scan the whole namespace at first and then split it into 5 
pieces according to hash of the path, why do we just complete the work during 
the first scanning process? If I misunderstand your meaning, please point out.

You need to make one RPC for each file or directory you delete.  In contrast, 
when listing a directory you make only one RPC for every {{dfs.ls.limit}} 
elements (by default 1000).  So if you have 5 workers all listing all 
directories, but only calling delete on some of the files, you still might come 
out ahead in terms of number of RPCs, provided you had a high ratio of files to 
directories.

There are other ways to partition the namespace which are smarter, but rely on 
some knowledge of what is in it, which you'd have to keep track of.

A single node design will work for now, though.  Considering that you probably 
want rate-limiting anyway.

bq. For the simplicity purpose, in the initial version, we will use logs to 
record which file/directory is deleted by TTL, and errors during the deleting 
process.

Even if it's not implemented at first, we should think about the configuration 
required here.  I think we want the ability to email the admins when things go 
wrong.  Possibly the notifier could be pluggable or have several policies.  
There was nothing in the doc about configuration in general, which I think we 
need to fix.  For example, how is rate limiting configurable?  How do we notify 
admins that the rate is too slow to finish in the time given?

bq. It doesn't need to be an administrator command, user only can setTtl on 
file/directory that they have write permission, and can getTtl on 
file/directory that they have read permission.

You can't delete a file in HDFS unless you have write permission on the 
containing directory.  Whether you have write permission on the file itself is 
not relevant.  So I would expect the same semantics here (probably enforced by 
setfacl itself).

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-10 Thread Zesheng Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027325#comment-14027325
 ] 

Zesheng Wu commented on HDFS-6382:
--

bq. Even if it's not implemented at first, we should think about the 
configuration required here. I think we want the ability to email the admins 
when things go wrong. Possibly the notifier could be pluggable or have several 
policies. There was nothing in the doc about configuration in general, which I 
think we need to fix. For example, how is rate limiting configurable? How do we 
notify admins that the rate is too slow to finish in the time given?
OK, I will update the document and post a new version soon.

bq. You can't delete a file in HDFS unless you have write permission on the 
containing directory. Whether you have write permission on the file itself is 
not relevant. So I would expect the same semantics here (probably enforced by 
setfacl itself).
That's reasonable, I'll figure it out clearly in the document.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-09 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025585#comment-14025585
 ] 

Colin Patrick McCabe commented on HDFS-6382:


For the MR strategy, it seems like this could be parallelized fairly easily.  
For example, if you have 5 MR tasks, you can calculate the hash of each path, 
and then task 1 can do all the paths that are 0 mod 5, task 2 can do all the 
paths that are 1 mod 5, and so forth.  MR also doesn't introduce extra 
dependencies since HDFS and MR are packaged together.

I don't understand what you mean by the mapreduce strategy will have 
additional overheads.  What overheads are you forseeing?

It is true that you need to avoid overloading the NameNode.  But this is a 
concern with any approach, not just the MR one.  It would be good to see a 
section on this.  I think the simplest way to do it is to rate-limit RPCs to 
the NameNode to a configurable rate.

bq. \[for the standalone daemon\] The major advantage of this approach is that 
we don’t need any extra work to finish the TTL work, all will be done in the 
daemon automatically. 

I don't understand what you mean by this.  What will be done automatically?

How are you going to implement HA for the standalone daemon?  I suppose if all 
the state is kept in HDFS, you can simply restart it when it fails.  However, 
it seems like you need to checkpoint how far along in the FS you are, so that 
if you die and later get restarted, you don't have to redo the whole FS scan.  
This implies reading directories in alphabetical order, or similar.  You also 
need to somehow record when the last scan was, perhaps in a file in HDFS.

I don't see a lot of discussion of logging and monitoring in general.  How is 
the user going to become aware that a file was deleted because of a TTL?  Or if 
there is an error during the delete, how will the user know?  Logging is one 
choice here.  Creating a file in HDFS is another.

The setTtl command seems reasonable.  Does this need to be an administrator 
command?

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-09 Thread Zesheng Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026047#comment-14026047
 ] 

Zesheng Wu commented on HDFS-6382:
--

Thanks [~cmccabe] for your feedback.
bq. For the MR strategy, it seems like this could be parallelized fairly 
easily. For example, if you have 5 MR tasks, you can calculate the hash of each 
path, and then task 1 can do all the paths that are 0 mod 5, task 2 can do all 
the paths that are 1 mod 5, and so forth. MR also doesn't introduce extra 
dependencies since HDFS and MR are packaged together.
You mean that we scan the whole namespace at first and then split it into 5 
pieces according to hash of the path, why do we just complete the work during 
the first scanning process? If I misunderstand your meaning, please point out.

bq. I don't understand what you mean by the mapreduce strategy will have 
additional overheads. What overheads are you foreseeing?
Possible overheads: Starting a mapreduce job needs to split the input, start an 
 AppMaster, collect result from random machines (Perhaps 'overheads' is not a 
proper word here)

bq. I don't understand what you mean by this. What will be done automatically?
Here automatically means we do not have to rely on external tools, the daemon 
itself can manage the work well.

bq. How are you going to implement HA for the standalone daemon?
Good point. As you suggested, one approach is save the state in HDFS and simply 
restart it when it fails. But managing the state is a complex work, I am 
considering how to simplify this. One possible simpler approach is that we can 
consider that the daemon is stateless and simply restart it when if fails. We 
needn't do checkpoint and just scan from the beginning when it restarts. 
Because we can require that the work the daemon does is idempotent, starting 
from the beginning will be harmless. Possible drawbacks of the later approach 
are that it may waste some time and may delay the work, but they are 
acceptable. 

bq. I don't see a lot of discussion of logging and monitoring in general. How 
is the user going to become aware that a file was deleted because of a TTL? Or 
if there is an error during the delete, how will the user know? 
For the simplicity purpose, in the initial version, we will use logs to record 
which file/directory is deleted by TTL, and errors during the deleting process.

bq. Does this need to be an administrator command?
It doesn't need to be an administrator command, user only can setTtl on 
file/directory that they have write permission, and can getTtl on 
file/directory that they have read permission.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu
 Attachments: HDFS-TTL-Design.pdf


 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-05 Thread Hangjun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018506#comment-14018506
 ] 

Hangjun Ye commented on HDFS-6382:
--

Thanks Colin. We would start to draft a design doc and ask you guys' help to 
review.

Yes, the xattrs has saved the big burden for saving the policy, the major 
question left is where to run the logic.

Besides these 3 options, another related stuff might be the trash. Currently 
trash is implemented as a client-side capability, the trash cleanup logic 
(trash emptier) depends on FileSystem to operate namespace and basically is a 
client-side function. But the trash emptier runs *inside* NN as a daemon 
thread, instead of a separate daemon process. I guess it interacts with NN via 
RPC even it runs inside NN.

We could observe some similarities of trash, balancer, and the proposed TTL: 
mainly need data from NN; could be implemented as client-side capability (via 
RPC); need to be run periodically. So if possible we unify all these stuff in 
one framework/daemon? It also echos Haohui's points earlier. And if it's 
implemented clearly enough, the user could optionally run it inside NN as a 
daemon thread to have less jobs to maintain, as long as the user would like to 
take the risk of running additional logic inside NN (w/o changing NN's logic 
for this, as it still interacts with NN like a client).

That's just a premature idea, we might still want to have the TTL as a separate 
daemon firstly as it's most straight forward. Let's discuss more after we have 
the design doc.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-04 Thread Hangjun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017427#comment-14017427
 ] 

Hangjun Ye commented on HDFS-6382:
--

Thanks Colin! That's exactly what we want. Seems it's on the way to be merged 
to branch-2, we will wait for it.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-03 Thread Hangjun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016335#comment-14016335
 ] 

Hangjun Ye commented on HDFS-6382:
--

Thanks Haohui and Colin.

The balancer or a balancer-like standalone daemon sounds a feasible approach to 
us. A special requirement of the TTL cleanup is that we need a persistent 
storage to contain all TTL policies set by users, while balancer and DistCp 
don't require. It might be nice if the namenode could store such information 
then we don't have to find somewhere else.

So just wondering if possible we add an opaque feature in INode to store 
arbitrary bytes? NN just stores it, doesn't interpret it. As an analogy, HBase 
supports tags to store arbitrary metadata at a cell: 
https://issues.apache.org/jira/browse/HBASE-8496

Then we could have external tools/daemon to let end-users set their TTL 
policies, and do the cleanup logic. The only change to NN is to add a new 
feature and also expose APIs to set/get the feature, complicated and volatile 
logic (metadata encoding, interpretation, cleanup) are done outside NN. And the 
change might have a much broader usage other than TTL.

Any thoughts?

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-03 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016805#comment-14016805
 ] 

Colin Patrick McCabe commented on HDFS-6382:


bq. So just wondering if possible we add an opaque feature in INode to store 
arbitrary bytes? NN just stores it, doesn't interpret it. As an analogy, HBase 
supports tags to store arbitrary metadata at a cell: 
https://issues.apache.org/jira/browse/HBASE-8496

It sounds like extended attributes (xattrs) might work here.  They were 
recently implemented in HDFS-2006 and subtasks.  They basically let you 
associate some arbitrary key/value pairs with each inode.  Check out 
https://issues.apache.org/jira/secure/attachment/12644341/HDFS-XAttrs-Design-3.pdf

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-02 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015678#comment-14015678
 ] 

Haohui Mai commented on HDFS-6382:
--

I think the comments against implementing it in NN are legit. Popping up one 
level, I'm wondering what is the best approach to meet the following 
requirements:

# Fine tune the behavior of HDFS, which requires the information from the 
internal data structure in HDFS.
# Performing the above task without MapReduce to simplify the operations of the 
cluster.

To meet the above requirements, today it looks like to me that there is no way 
other than making massive changes in HDFS.

What I'm wondering is that whether it is possible to architect the system to 
make things easier. For example, is it possible to generalize the architecture 
of the balancer we have today to accomplish these types of tasks? From a very 
high level it looks to me that most of the code can sit outside of the NN while 
meeting the above requirements. Since this is aiming for advanced usages, there 
are more freedoms on the design of the architecture. For instance, the 
architecture might choose to expose the details of the implementation and do 
not guarantee compatibility (like an Exokernel type of system). 

Thoughts?

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-06-02 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015922#comment-14015922
 ] 

Colin Patrick McCabe commented on HDFS-6382:


I'm -1 on the idea of putting this in the NameNode.  Let's see if we can work 
together to figure out where the best place for it is, though.  Can you comment 
on why MR is not an option for you?  I am concerned that there will be a lot of 
wheel reinvention if we don't use MR (authentication, resource management, 
scheduling, etc. etc.)  Why not do as DistCp does?  As Haohui said, another 
option is the balancer or a completely standalone daemon.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-30 Thread Hangjun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013411#comment-14013411
 ] 

Hangjun Ye commented on HDFS-6382:
--

I think we have two discussions here now: a TTL cleanup policy (implemented 
inside or outiside NN), and a general mechanism to help implement such a policy 
easily inside NN.

I've been convinced that a specific TTL cleanup policy implementation does NOT 
sound feasible to fly in core code of NN directly, I'm more interested to 
pursuing a mechanism to enable such policy implementation.

Considering HBase having co-processor 
(https://blogs.apache.org/hbase/entry/coprocessor_introduction), people could 
extend the functionality easily (w/o extending the base classes), such as 
counting rows, secondary index. We could argue that most of such usages are NOT 
necessarily implemented as server side, but having such a mechanism gives users 
an opportunity to choose what is most suitable for their requirements.

If the NN has such an extensible mechanism (as Haohui suggested earlier), we 
could implement a TTL cleanup policy in NN in an elegant way (w/o touching the 
base classes). And NN has abstracted out the INode.Feature, we could 
implement a TTLFeature to hold the meta. The policy implementation doesn't have 
to go into community's codebase if it's too specific, we could keep it in our 
private branch. But basing on a general mechanism (w/o touching the base 
classes) makes it easy to be maintained (considering we would upgrade with new 
Hadoop releases regularly).

If you guys think such a general mechanism deserves to be considered, we are 
happy to contribute some efforts.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-30 Thread Hangjun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013624#comment-14013624
 ] 

Hangjun Ye commented on HDFS-6382:
--

BTW: TTL is one of applications that could benefit from a general mechanism. 
Haohui gave several nice use cases that would benefit as well.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-30 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014030#comment-14014030
 ] 

Colin Patrick McCabe commented on HDFS-6382:


Chris, Andrew, and I have brought up a lot of reasons why this probably doesn't 
make sense in the NameNode.

Just to summarize:
* security / correctness concerns: it's easy to make a mistake that could bring 
down the NameNode or entire FS
* non-generality to systems using s3 or another FS in addition to HDFS
* issues with federation (which NN does the cleanup?  How do you decide?)
* complexities surrounding our client-side Trash implementation and our 
server-side snapshots
* configuration burden on sysadmins
* inability to change the cleanup code without restarting the NameNode
* HA concerns (need to avoid split-brain or lost updates)
* error handling (where do users find out about errors?)
* semantics: disappearing or time-limited files is an unfamiliar API, not like 
the traditional FS APIs we usually implement

Making this pluggable doesn't fix any of those problems, and it adds some more:
* API stability issues (the INode and Feature classes have changed a lot, and 
we make no guarantees there)
* CLASSPATH issues (if I want to send an email about a cleanup job with the 
FooEmailer library, how do I get that into the NameNode's CLASSPATH?)  How do I 
avoid jar conflicts?

The only points I've seen raised in favor of doing this in the NameNode are:
* the NameNode already has an authorization system which this could use.
* HBase has coprocessors which also allow loading arbitrary code.

To the first point, there are lots of other ways to deal with authorization, 
like by using YARN (which also has authorization), or configuring the cleanup 
using files in HDFS.

To the second point, HBase doesn't use coprocessors for cleanup jobs... it uses 
them for things like secondary indices, a much better-defined problem.  The 
functionality you want is not something that should be implemented as a 
coprocessor, even if we had those.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-29 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012442#comment-14012442
 ] 

Chris Nauroth commented on HDFS-6382:
-

bq. The implemented mechanism inside the NameNode would (maybe periodically) 
execute all policies specified by users, and it would do it as a superuser 
safely, as authentication/authorization have been done when user set their 
policies to the NameNode.

This logic is subject to time of check/time of use race conditions, possibly 
resulting in incorrect deletion of data.  For example, imagine the following 
sequence:
# A user calls the setttl command on /file1.  Authentication is successful, and 
the authenticated user is the file owner, so NN decides the user is authorized 
to set a TTL.
# An admin changes the owner of /file1 in order to revoke the user's access.
# Now the NN's background expiration thread/job starts running.  It finds a TTL 
on /file1 and deletes it.  Since this is running as the HDFS super-user, 
nothing blocks the delete, even though the user who set the TTL really no 
longer has permission to delete.

With an external process, authentication and authorization are enforced at the 
time of delete for the specific user, so there is no time of check/time of use 
race condition, and there is no chance of an incorrect delete.

Running some code as a privileged user might look expedient in some ways, but 
it also compromises the file system permissions model somewhat.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-29 Thread Hangjun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012480#comment-14012480
 ] 

Hangjun Ye commented on HDFS-6382:
--

bq. This logic is subject to time of check/time of use race conditions, 
possibly resulting in incorrect deletion of data. For example, imagine the 
following sequence: ...

It doesn't sound a race condition to me. We could consider TTL as an 
independent attribute of file, just like file owner and replication number. 
In the above scenario,  it seems to work as expected, since the admin only 
changes the owner of /file1 but leaves the TTL attribute as is and so the TTL 
should be still effective. If the admin doesn't want it to happen, he/she 
should unset the TTL attribute (i.e. set it to infinity) firstly before 
he/she changes the owner of /file1, and the new owner of /file1 could set a new 
TTL attribute later if needed.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-29 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012551#comment-14012551
 ] 

Colin Patrick McCabe commented on HDFS-6382:


bq. One approach, as you suggested, is we that implement a separate cleanup 
platform and users submit their policy to this platform, and we do the real 
cleanup action to the HDFS on behalf of users (as a superuser or other powerful 
user). But the separate platform has to implement an 
authentication/authorization mechanism to make sure the user is who they claim 
to be and have the permission (authentication is a must, authorization might be 
optional but it'd better have). It's a repeated job as the NameNode has done 
with Kerberos/acl If it's implemented inside the NameNode, we could 
leverage NameNode's authentication/authorization mechanism.

YARN / MR / etc already have authentication frameworks that you can use.  For 
example, you can set up a YARN queue with certain permissions so that only 
certain users or groups can submit to it.

Another idea is to have an HDFS directory where each group (or user) puts their 
files containing the cleanup policies they want, and let HDFS take care of 
permissions.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-29 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012591#comment-14012591
 ] 

Andrew Wang commented on HDFS-6382:
---

Even if implementing security was a major hurdle (and I really don't think it 
is as hard as you think, considering we have quite a few examples of Hadoop 
auth besides the NN), the rest of Chris's points still stand. I also think that 
semantically, a TTL is not an expected type of file attribute for those of us 
with a Unix background, which leads to TOCTOU issues like Chris also pointed 
out even if just because of user expectations.

So, at this point, I think there are strong technical reasons not to implement 
this in the NN, and strong reasons to do this type of data lifecycle management 
externally.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-28 Thread Hangjun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010841#comment-14010841
 ] 

Hangjun Ye commented on HDFS-6382:
--

Thanks Haohui for your reply.

Let me confirm I got your point. Your suggestion is that we'd better have a 
general mechanism/framework to run a job (maybe periodically) over the 
namespace inside the NN, and the TTL policy is just a specific job that might 
be implemented by user?
That's an interesting direction, we will think about it.

We are heavy users of Hadoop and also do some in-house improvements per our 
business requirement. We definitely want to contribute the improvements back to 
community, as long as it's helpful for the community.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-28 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010879#comment-14010879
 ] 

Haohui Mai commented on HDFS-6382:
--

bq. Your suggestion is that we'd better have a general mechanism/framework to 
run a job (maybe periodically) over the namespace inside the NN, and the TTL 
policy is just a specific job that might be implemented by user?

This is correct. There are a couple additional use cases that might be useful 
to keep in mind:

# Archiving data. TTL is one of the use case here.
# Backing up or syncing data between clusters. It's nice to back up / to sync 
data between clusters for disaster recovery, without running a MR job.
# Balancing data between data nodes.

A mechanism that can support the above use cases can be quite powerful and 
improve the state of the art. I'm happy to collaborate if this is the direction 
you guys want to pursue.

bq. We are heavy users of Hadoop and also do some in-house improvements per our 
business requirement. We definitely want to contribute the improvements back to 
community.

This is great to hear. Patches are welcome.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-28 Thread Hangjun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010891#comment-14010891
 ] 

Hangjun Ye commented on HDFS-6382:
--

Thanks Haohui, that's clear to us now.
That's interesting and we'd like to pursue the more general approach.
We will take time to work out a rough design and ask you guys to review.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-28 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011305#comment-14011305
 ] 

Chris Nauroth commented on HDFS-6382:
-

bq. ...run a job (maybe periodically) over the namespace inside the NN...

Please correct me if I misunderstood, but this sounds like execution of 
arbitrary code inside the NN process.  If so, this opens the risk of resource 
exhaustion at the NN by buggy or malicious code.  Even if there is a fork for 
process isolation, it's still sharing machine resources with the NN process.  
If the code is running as the HDFS super-user, then it has access to sensitive 
resources like the fsimage file.  If multiple such in-process jobs are 
submitted concurrently, then it would cause resource contention with the main 
work of the NN.  Multiple concurrent jobs also gets into the realm of 
scheduling.  There are lots of tough problems here that would increase the 
complexity of the NN.

Even putting that aside, I see multiple advantages in implementing this 
externally instead of embedded inside the NN.  Here is a list of several 
problems that an embedded design would need to solve, and which I believe are 
already easily addressed by an external design.  This includes/expands on 
issues brought up by others in earlier comments too.

* Trash: The description mentions trash capability as a requirement.  Trash 
functionality is currently implemented as a client-side capability.
** Embedded: We'd need to reimplement trash inside the NN, or heavily refactor 
for code sharing.
** External: The client already has the trash capability, so this problem is 
already solved.
* Integration: Many Hadoop deployments use an alternative file system like S3 
or Azure storage.  In these deployments, there is no NameNode.
** Embedded: The feature is only usable for HDFS-based deployments.  Users of 
alternative file systems can't use the feature.
** External: The client already has the capability to target any Hadoop file 
system implementation, so this problem is already solved.
* HA: In the event of a failover, we must guarantee that the former active NN 
does not drive any expiration activity.
** Embedded: Any background thread or in-process jobs running inside the NN 
must coordinate shutdown during a failover.
** External: Thanks to our client-side retry policies, an external process 
automatically transitions to the new active NN after a failover, and there is 
no risk of split-brain scenario, so this problem is already solved.
* Authentication/Authorization: Who exactly is the effective user running the 
delete, and how do we manage their login and file permission enforcement?
** Embedded: You mention there is an advantage to running embedded, but I 
didn't quite understand.  Are you suggesting running the deletes inside a 
{{UserGroupInformation#doAs}} for the specific user?
** External: The client already knows how to authenticate RPC, and the NN 
already knows how to enforce authorization on files for that authenticated 
user, so this problem is already solved.
* Error Handling: How do users find out when the deletes don't work?
** Embedded: There is no mechanism for asynchronous user notification inside 
the NN.  As others have mentioned, there is a lot of complexity in this area.  
If it's email, then you need to solve the problem of reliable email delivery 
(i.e. retries if SMTP gateways are down).  If it's monitoring/alerting, then 
you need to expose new monitoring endpoints to publish sufficient information.
** External: The client's exception messages are sufficient to identify file 
paths that failed during synchronous calls, and the NN audit log is another 
source of troubleshooting information, so this problem is already solved.
* Federation: With federation, the HDFS namespace is split across multiple 
NameNodes.
** Embedded: The design needs to coordinate putting the right expiration work 
on the right NN hosting that part of the namespace.
** External: The client has the capability to configure a client-side mount 
table that joins together multiple federated namespaces, and {{ViewFileSystem}} 
then routes RPC to the correct NN depending on the target file path, so this 
problem is already solved.



 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program 

[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-28 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011357#comment-14011357
 ] 

Colin Patrick McCabe commented on HDFS-6382:


I agree with Chris' comments here.  There are just so many advantages to 
running outside the NameNode, that I think that's the design we should start 
with.  If we later find something that would work better with NN support, we 
can think about it then.

Hangjun Ye wrote:
bq. Another benefit to having it inside NN is we don't have to handle the 
authentication/authorization problem in a separate system. For example we have 
a shared HDFS cluster for many internal users, we don't want someone to set TTL 
policy to other one's files. NN could handle it easily by its own 
authentication/authorization mechanism.

The client handles authentication/authorization very well, actually.  You can 
choose to run your cleanup job as superuser (can do anything) or some other 
less powerful user who is limited (safer).  But when you run inside the 
NameNode, there are no safeguards... everything is effectively superuser.  And 
you can destroy or corrupt the entire filesystem very easily that way, 
especially if your cleanup code is buggy.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-28 Thread Hangjun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012008#comment-14012008
 ] 

Hangjun Ye commented on HDFS-6382:
--

Thanks Chris and Colin for your valuable comments, I'd like to address your 
concern about the security problem.

Firstly our scenario is as following:
We have a Hadoop cluster shared by multiple teams for their storage and 
computation requirement and we are the dev/supporting team to ensure the 
functionality and availability of the cluster. The cluster is security enabled 
to ensure every team could only access the files that they should. So every 
team is a common user of the cluster and we own the superuser.

Currently several teams have the requirement to clean up files based on TTL 
policy. Obviously they could have cron job to do that by themselves but it 
would have many repeated jobs, so we'd better have a mechanism to let them to 
specify/implement their policy easily.

One approach, as you suggested, is we that implement a separate cleanup 
platform and users submit their policy to this platform, and we do the real 
cleanup action to the HDFS on behalf of users (as a superuser or other powerful 
user). But the separate platform has to implement an 
authentication/authorization mechanism to make sure the user is who they claim 
to be and have the permission (authentication is a must, authorization might be 
optional but it'd better have). It's a repeated job as the NameNode has done 
with Kerberos/acl.

If it's implemented inside the NameNode, we could leverage NameNode's 
authentication/authorization mechanism. For example we provide a ./bin/hdfs 
dfs -setttl path/file command (just like -setrep). Users could specify their 
policy by it and the NameNode should persist it somewhere, maybe as an 
attribute of file like replication number. The implemented mechanism inside the 
NameNode would (maybe periodically) execute all policies specified by users, 
and it would do it as a superuser safely, as authentication/authorization have 
been done when user set their policies to the NameNode.

To address several detailed concerns you raised:
* buggy or malicious code: The proposed concept (actually Haohui proposed) 
should be pretty similar to HBase's coprocessor 
(http://hbase.apache.org/book.html#cp), it's a plug-in or extension of NameNode 
and most likely enabled at deployment time. A common user can't submit it, the 
cluster owner could do. So the code is not arbitrary and the quality/safety 
could be guaranteed.

* Who exactly is the effective user running the delete, and how do we manage 
their login and file permission enforcement: the extension is run as 
superuser/system, a specific extension implementation could do any permission 
enforcement if needed. For the TTL-based cleanup policy executor, no 
permission enforcement is needed at this stage as authentication/authorization 
have been done when user set policy.

I think the idea proposed by Haohui is to have an extensible mechanism in 
NameNode to run jobs which intensively depend on namespace data, and make the 
specific job's code as de-coupled from NameNode's core code as possible. For 
certain it's not easy, as Chris pointed out several problems, like HA and 
concurrency, but it might deserve to be thought about.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-27 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010053#comment-14010053
 ] 

Colin Patrick McCabe commented on HDFS-6382:


bq. But if there's no internal cleanup mechanism of HDFS, all users(across 
companies) need to write their own cleanup tools respectively, lots of repeated 
work.

Like I said, we should write such a tool and add it to the base Hadoop 
distribution.  This is similar to what we did with {{DistCp}}.  Then users 
would not need to write their own versions of this stuff.

It's important to distinguish between creating a tool to handle deleting old 
files (which we all agree we should do), and putting this into the NameNode 
(which seems questionable).

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-27 Thread Zesheng Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010520#comment-14010520
 ] 

Zesheng Wu commented on HDFS-6382:
--

bq. Like I said, we should write such a tool and add it to the base Hadoop 
distribution. This is similar to what we did with DistCp. Then users would not 
need to write their own versions of this stuff.
Sure, this is another good option.

bq. It's important to distinguish between creating a tool to handle deleting 
old files (which we all agree we should do), and putting this into the NameNode 
(which seems questionable).
Why do you think that putting the cleanup mechanism into the NameNode seems 
questionable, can you point out some details?

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-27 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010632#comment-14010632
 ] 

Colin Patrick McCabe commented on HDFS-6382:


bq. Why do you think that putting the cleanup mechanism into the NameNode seems 
questionable, can you point out some details?

Andrew and Chris commented about this earlier.  See:
https://issues.apache.org/jira/browse/HDFS-6382?focusedCommentId=13998933page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13998933

I would add to that:
* Every user of this is going to want a slightly different deletion policy.  
It's just way too much configuration for the NameNode to reasonably handle.  
Much easier to do it in a user process.  For example, maybe you want to keep at 
least 100 GB of logs, 100 GB of foo data, and 1000 GB of bar data.  It's 
easy to handle this complexity in a user process, incredibly complex and 
frustrating to handle it in the NameNode.
* Your nightly MR job (or whatever) also needs to be able to do things like 
email sysadmins when the disks are filling up, which the NameNode can't 
reasonably be expected to do.
* I don't see a big advantage to doing this in the NameNode, and I see a lot of 
disadvantages (more complexity to maintain, difficult configuration, need to 
restart to update config)

Maybe I could be convinced otherwise, but so far the only argument that I've 
seen for doing it in the NN is that it would be re-usable.  And this could just 
as easily apply to an implementation outside the NN.  For example, as I pointed 
out earlier, DistCp is reusable, without being in the NameNode.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-27 Thread Jian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010657#comment-14010657
 ] 

Jian Wang commented on HDFS-6382:
-

I think it is better for you to provide a  (backup  clean up ) platform for 
your user ,you can  implement a lot of clean up strategy for your users in your 
company.
This can reduce a lot of repeated jobs.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-27 Thread Hangjun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010734#comment-14010734
 ] 

Hangjun Ye commented on HDFS-6382:
--

Implementing it outside NN is definitely another option, and I agree with Colin 
that it's not feasible to implement a complex clean up policy (like based on 
storage space) inside NN.

TTL is a very simple (but general) policy and we might even consider it as an 
attribute of file, like the number of replicas. Seems it wouldn't introduce 
much complexity to handle it in the NN.

Another benefit to having it inside NN is we don't have to handle the 
authentication/authorization problem in a separate system. For example we have 
a shared HDFS cluster for many internal users, we don't want someone to set TTL 
policy to other one's files. NN could handle it easily by its own 
authentication/authorization mechanism.

So far a TTL-based clean up policy is good enough for our scenario (Zesheng and 
I are from the same company and we are supporting our company's internal usage 
for Hadoop) and it's would be nice to have a simple and workable solution in 
HDFS.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-27 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010820#comment-14010820
 ] 

Haohui Mai commented on HDFS-6382:
--

bq. TTL is a very simple (but general) policy and we might even consider it as 
an attribute of file, like the number of replicas. Seems it wouldn't introduce 
much complexity to handle it in the NN.

bq. Another benefit to having it inside NN is we don't have to handle the 
authentication/authorization problem in a separate system. For example we have 
a shared HDFS cluster for many internal users, we don't want someone to set TTL 
policy to other one's files. NN could handle it easily by its own 
authentication/authorization mechanism.

I agree that running jobs of the namespace without MR should be the direction 
to go. However, I think the main hold back here is that the design mixes the 
mechanism (running jobs of the namespace without MR) and the policy (TTL) 
together.

As [~cmccabe] pointed out earlier, every user has his / her own policy. 
Provided that HDFS has a wide range of users, this type of design / 
implementation is unlikely to fly in the ecosystem.

Currently HDFS does not have the above mechanism, you're more than welcomed to 
contribute a patch.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu
Assignee: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003783#comment-14003783
 ] 

Colin Patrick McCabe commented on HDFS-6382:


I don't think a nightly (or weekly) cleanup job that lives outside HDFS is that 
difficult or complex to write.  If it were done as a MapReduce job, it could 
easily work on the whole cluster.  This is something we could consider putting 
upstream.

Another issue to consider here is snapshots.  Deleting files is not going to 
free space if they exist in a snapshot.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-20 Thread Zesheng Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004242#comment-14004242
 ] 

Zesheng Wu commented on HDFS-6382:
--

Thanks [~cmccabe], it's sure that an outside cleanup tool is not too difficult 
or complex to implement, and there are many ways which can satisfy the 
requirements. But if there's no internal cleanup mechanism of HDFS, all 
users(across companies) need to write their own cleanup tools respectively, 
lots of repeated work. Suppose that if HDFS can support an internal cleanup 
mechanism, it's sure that this will be more convenient, do you agree this?

About snapshots, I think the behavior of a snapshotted file which is deleted by 
the TTL mechanism is just the same as the behavior of a snapshotted file which 
is deleted by a user manually.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-16 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998933#comment-13998933
 ] 

Chris Nauroth commented on HDFS-6382:
-

I agree with Andrew's opinion that this is better implemented outside the file 
system.  An automatic delete based on a TTL introduces a high risk of 
concurrency bugs for applications.  For example, imagine a MapReduce job gets 
submitted, we derive input splits from a file, and then the file expires after 
input split calculation but before the map tasks start running and reading the 
blocks.  Overall, I think it's preferable to put delete into the hands of the 
calling application for explicit control.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-16 Thread Zesheng Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999746#comment-13999746
 ] 

Zesheng Wu commented on HDFS-6382:
--

Thanks [~cnauroth], I agree with your example MapReduce scenario and the risk,  
but this risk can't be avoided even if we use outside tools. For example, we 
use a nightly cron job just like Andrew mentioned, imagine a MapReduce job gets 
submitted, we derive input splits from a file, and then the file is deleted by 
the cron job after input split calculation but before the map tasks start 
running and reading the blocks, the risk is the same. What I want to declare is 
that the TTL is just a convenient way to finish tasks like I described in the 
proposal, the users should learn how to use it and use it correctly, rather 
than use a complicated way and there's no obvious advantage.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-15 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997794#comment-13997794
 ] 

Andrew Wang commented on HDFS-6382:
---

This is just my opinion, but isn't this something better done in userspace? A 
nightly cron job could do this for you, and log files are typically even 
already timestamped for easy parsing and removal.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

2014-05-15 Thread Zesheng Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998374#comment-13998374
 ] 

Zesheng Wu commented on HDFS-6382:
--

Thanks [~andrew.wang]. Of course a nightly cron job can do this for us, but 
there're various kinds of data backup requirements from our users, log backup 
is just one of them, we just want to supply a more convenient way for our users 
to satisfy their requirements. Imagine that we have lots of backup 
requirements, each have different TTL configuration, one way to achieve this is 
each user maintains his own cron job, and the other is the cluster 
administrator maintains all the cron jobs for all users, both these two ways 
are not very convenient and need lots of manual operation work. If HDFS can 
support TTL as I proposed above, these requirements will be satisfied very 
easily.

 HDFS File/Directory TTL
 ---

 Key: HDFS-6382
 URL: https://issues.apache.org/jira/browse/HDFS-6382
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Affects Versions: 2.4.0
Reporter: Zesheng Wu

 In production environment, we always have scenario like this, we want to 
 backup files on hdfs for some time and then hope to delete these files 
 automatically. For example, we keep only 1 day's logs on local disk due to 
 limited disk space, but we need to keep about 1 month's logs in order to 
 debug program bugs, so we keep all the logs on hdfs and delete logs which are 
 older than 1 month. This is a typical scenario of HDFS TTL. So here we 
 propose that hdfs can support TTL.
 Following are some details of this proposal:
 1. HDFS can support TTL on a specified file or directory
 2. If a TTL is set on a file, the file will be deleted automatically after 
 the TTL is expired
 3. If a TTL is set on a directory, the child files and directories will be 
 deleted automatically after the TTL is expired
 4. The child file/directory's TTL configuration should override its parent 
 directory's
 5. A global configuration is needed to configure that whether the deleted 
 files/directories should go to the trash or not
 6. A global configuration is needed to configure that whether a directory 
 with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)