[ 
https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Mackrory updated HADOOP-14041:
-----------------------------------
    Attachment: HADOOP-14041-HADOOP-13345.006.patch

Thanks for the reviews, all - good stuff.

The problems [~fabbri] saw boil down to 2 things, one of which I fixed: I had 
not tested this with anything being inferred from an S3 path, and I wasn't 
trying to parse and use that like the other commands. That is now fixed and 
added to the tests. The other thing is that it appears to not be parsing 
generic options (which does indeed seem wrong - according to the docs, if you 
implement Tool you should get that for free - and we do), but the behavior 
wouldn't be what you expect anyway because it will set the table config based 
on the -m flag or the S3 path you provide. I think the CLI behavior is badly 
defined here in general, so I've filed HADOOP-14094 to really rethink what 
options are exposed and how.

I like [~ste...@apache.org]'s recommendation to just throw the IOException. I 
think what I was thinking was that if there's an issue deleting one row, we can 
keep retrying the others. But I think an exception that affects one row but not 
subsequent others is probably unlikely, so it's worth bubbling that up so we 
know about the problem. Also, removing that block highlighted that my batching 
logic was bad: instead of processing complete batches inside the loop and 
processing whatever is left over afterwards, I was effectively always 
processing whatever contents the batch had at the end of each iteration. That's 
been fixed, and I tested the number of events was correct with several hundred 
objects getting pruned.

On a related note, I also changed the log message to INFO and had it count 
items and report batch size rather than just the number of batches. Without 
that the last message you get out-of-the-box on the CLI is that the metastore 
has been initialized, which is misleading. It will now log when the 
metadatastore connection has been initialized and then finish off by logging 
how many items were deleted and what he batch size was. I think that's more 
friendly: and probably something we want to do more of for the other commands 
if / when we rethink the interface.

> CLI command to prune old metadata
> ---------------------------------
>
>                 Key: HADOOP-14041
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14041
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>         Attachments: HADOOP-14041-HADOOP-13345.001.patch, 
> HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, 
> HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch, 
> HADOOP-14041-HADOOP-13345.006.patch
>
>
> Add a CLI command that allows users to specify an age at which to prune 
> metadata that hasn't been modified for an extended period of time. Since the 
> primary use-case targeted at the moment is list consistency, it would make 
> sense (especially when authoritative=false) to prune metadata that is 
> expected to have become consistent a long time ago.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to