RE: Cleaning up files in HDFS?
It there anyway to ensure that a file is deleted if an MR job crashes? It seems like deleteOnExit should do this, but it's hard to tell from the docs/code. Michael From: lohit [EMAIL PROTECTED] Sent: Friday, November 14, 2008 6:07 PM To: core-user@hadoop.apache.org Subject: Re: Cleaning up files in HDFS? Have you tried fs.trash.interval property namefs.trash.interval/name value0/value descriptionNumber of minutes between trash checkpoints. If zero, the trash feature is disabled. /description /property more info about trash feature here. http://hadoop.apache.org/core/docs/current/hdfs_design.html Thanks, Lohit - Original Message From: Erik Holstad [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Friday, November 14, 2008 5:08:03 PM Subject: Cleaning up files in HDFS? Hi! We would like to run a delete script that deletes all files older than x days that are stored in lib l in hdfs, what is the best way of doing that? Regards Erik
Re: How to concatenate hadoop files to a single hadoop file
You might be able to use hars: http://hadoop.apache.org/core/docs/current/hadoop_archives.html On 10/2/08 2:51 PM, Steve Gao [EMAIL PROTECTED] wrote: Anybody knows? Thanks a lot. --- On Thu, 10/2/08, Steve Gao [EMAIL PROTECTED] wrote: From: Steve Gao [EMAIL PROTECTED] Subject: How to concatenate hadoop files to a single hadoop file To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Date: Thursday, October 2, 2008, 3:17 PM Suppose I have 3 files in Hadoop that I want to cat them to a single file. I know it can be done by hadoop dfs -cat to a local file and updating it to Hadoop. But it's very expensive for large files. Is there an internal way to do this in Hadoop itself? Thanks
Expiring temporary files
Hi, We have a requirement to essentially expire temporary files that are no longer needed in an HDFS share. I have noticed some traffic on this very same issue and was wondering how best to approach the problem and/or contribute. Basically, we need to remove a user specified subset of files from HDFS based on mtime or atime. Possible Approaches: - Mount HDFS over FUSE and use standard tmpreaper utility. - Implement a Hadoop version of tmpreaper using FileSystem, and PathFilter classes. - Place temporary files in .Trash like directory and use Trash classes checkpoint and expunge methods. It would be nice here if the user could choose to expire all checkpoints except the N most recent checkpoints, or incrementally expire checkpoints to free up space. Thanks for the feedback, Michael
Passing TupleWritable between map and reduce
Hi, I am a new hadoop developer and am struggling to understand why I cannot pass TupleWritable between a map and reduce function. I have modified the wordcount example to demonstrate the issue. Also I am using hadoop 0.17.1. package wordcount; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.join.*; public class WordCount {public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, TupleWritable {private final static IntWritable one = new IntWritable(1);private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException { String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);TupleWritable tuple = new TupleWritable(new Writable[] { one } );while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken());output.collect(word, tuple);}}}public static class Reduce extends MapReduceBase implements ReducerText, TupleWritable, Text, TupleWritable { public void reduce(Text key, IteratorTupleWritable values, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException {IntWritable i = new IntWritable();int sum = 0;while (values.hasNext()) {i = ((IntWritable) values.next().get(0));sum += i.get();} TupleWritable tuple = new TupleWritable(new Writable[] { new IntWritable(sum) } );output.collect(key, tuple);}}public static void main(String[] args) throws Exception {JobConf conf = new JobConf(WordCount.class);conf.setJobName(wordcount); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(TupleWritable.class); conf.setMapperClass(Map.class);conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);} } The output is always empty tuples ('[]'). Using the debugger, I have determined that the line: TupleWritable tuple = new TupleWritable(new Writable[] { one } ); Is properly constructing the desired tuple. I am not sure if it is being outputed correctly by output.collect as I cannot find the field in the OutputCollector data structure. When I check in the reduce method the values are always empty tuples. I have a feeling it has something to do with this line in the JavaDoc: TupleWritable(Writable[] vals) Initialize tuple with storage; unknown whether any of them contain written values. Thanks in advance for any all help, Michael
Re: Passing TupleWritable between map and reduce
Sorry about the massive code chunk, I am not used to this mail client, I attached the file instead. On 8/7/08 4:18 PM, Michael Andrews [EMAIL PROTECTED] wrote: Hi, I am a new hadoop developer and am struggling to understand why I cannot pass TupleWritable between a map and reduce function. I have modified the wordcount example to demonstrate the issue. Also I am using hadoop 0.17.1. Is properly constructing the desired tuple. I am not sure if it is being outputed correctly by output.collect as I cannot find the field in the OutputCollector data structure. When I check in the reduce method the values are always empty tuples. I have a feeling it has something to do with this line in the JavaDoc: TupleWritable(Writable[] vals) Initialize tuple with storage; unknown whether any of them contain written values. Thanks in advance for any all help, Michael
Re: Passing TupleWritable between map and reduce
OK thanks for the information. I guess it seems strange to want to use TupleWritable in this way, but this just seemed like the right thing to do this based on the API docs. Is it more idiomatic to inherit from Writable when processing structured data? Again, I am really new to the hadoop community but I will try to file something with JIRA on this. Not really sure how to proceed with a patch, maybe I could just try and clarify the docs? On 8/7/08 4:38 PM, Chris Douglas [EMAIL PROTECTED] wrote: You need access to TupleWritable::setWritten(int). If you want to use TupleWritable outside the join package, then you need to make this (and probably related methods, like clearWritten(int)) public and recompile. Please file a JIRA if you think it should be more general. -C On Aug 7, 2008, at 4:18 PM, Michael Andrews wrote: Hi, I am a new hadoop developer and am struggling to understand why I cannot pass TupleWritable between a map and reduce function. I have modified the wordcount example to demonstrate the issue. Also I am using hadoop 0.17.1. package wordcount; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.join.*; public class WordCount {public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, TupleWritable {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);TupleWritable tuple = new TupleWritable(new Writable[] { one } );while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken()); output.collect(word, tuple);}}}public static class Reduce extends MapReduceBase implements ReducerText, TupleWritable, Text, TupleWritable {public void reduce(Text key, IteratorTupleWritable values, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException {IntWritable i = new IntWritable();int sum = 0;while (values.hasNext()) {i = ((IntWritable) values.next().get(0));sum += i.get();}TupleWritable tuple = new TupleWritable(new Writable[] { new IntWritable(sum) } ); output.collect(key, tuple);}}public static void main(String[] args) throws Exception {JobConf conf = new JobConf(WordCount.class); conf.setJobName(wordcount); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(TupleWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);} } The output is always empty tuples ('[]'). Using the debugger, I have determined that the line: TupleWritable tuple = new TupleWritable(new Writable[] { one } ); Is properly constructing the desired tuple. I am not sure if it is being outputed correctly by output.collect as I cannot find the field in the OutputCollector data structure. When I check in the reduce method the values are always empty tuples. I have a feeling it has something to do with this line in the JavaDoc: TupleWritable(Writable[] vals) Initialize tuple with storage; unknown whether any of them contain written values. Thanks in advance for any all help, Michael