RE: Cleaning up files in HDFS?

2008-11-15 Thread Michael Andrews
It there anyway to ensure that a file is deleted if an MR job crashes?  It 
seems like deleteOnExit should do this, but it's hard to tell from the 
docs/code.

Michael

From: lohit [EMAIL PROTECTED]
Sent: Friday, November 14, 2008 6:07 PM
To: core-user@hadoop.apache.org
Subject: Re: Cleaning up files in HDFS?

Have you tried fs.trash.interval

property
  namefs.trash.interval/name
  value0/value
  descriptionNumber of minutes between trash checkpoints.
  If zero, the trash feature is disabled.
  /description
/property

more info about trash feature here.
http://hadoop.apache.org/core/docs/current/hdfs_design.html


Thanks,
Lohit

- Original Message 
From: Erik Holstad [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, November 14, 2008 5:08:03 PM
Subject: Cleaning up files in HDFS?

Hi!
We would like to run a delete script that deletes all files older than
x days that are stored in lib l in hdfs, what is the best way of doing that?

Regards Erik



Re: How to concatenate hadoop files to a single hadoop file

2008-10-02 Thread Michael Andrews

You might be able to use hars:

http://hadoop.apache.org/core/docs/current/hadoop_archives.html

On 10/2/08 2:51 PM, Steve Gao [EMAIL PROTECTED] wrote:

Anybody knows? Thanks a lot.

--- On Thu, 10/2/08, Steve Gao [EMAIL PROTECTED] wrote:
From: Steve Gao [EMAIL PROTECTED]
Subject: How to concatenate hadoop files to a single hadoop file
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Date: Thursday, October 2, 2008, 3:17 PM

Suppose I have 3 files in Hadoop that I want to cat them to a single
file. I know it can be done by hadoop dfs -cat to a local file and
updating it to Hadoop. But it's very expensive for large files. Is there an
internal way to do this in Hadoop itself? Thanks









Expiring temporary files

2008-09-25 Thread Michael Andrews
Hi,

We have a requirement to essentially expire temporary files that are no longer
needed in an HDFS share.  I have noticed some traffic on this very same issue
and was wondering how best to approach the problem and/or contribute.
Basically, we need to remove a user specified subset of files from HDFS based
on mtime or atime.

Possible Approaches:
  - Mount HDFS over FUSE and use standard tmpreaper utility.
  - Implement a Hadoop version of tmpreaper using FileSystem, and PathFilter
classes.
  - Place temporary files in .Trash like directory and use Trash classes
checkpoint and expunge methods.  It would be nice here if the user could
choose to expire all checkpoints except the N most recent checkpoints, or
incrementally expire checkpoints to free up space.

Thanks for the feedback,

Michael


Passing TupleWritable between map and reduce

2008-08-07 Thread Michael Andrews
Hi,

I am a new hadoop developer and am struggling to understand why I cannot pass 
TupleWritable between a map and reduce function.  I have modified the wordcount 
example to demonstrate the issue.  Also I am using hadoop 0.17.1.

package wordcount; import java.io.IOException; import java.util.*; import 
org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import 
org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.join.*; public 
class WordCount {public static class Map extends MapReduceBase implements 
MapperLongWritable, Text, Text, TupleWritable {private final static 
IntWritable one = new IntWritable(1);private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollectorText, 
TupleWritable output, Reporter reporter) throws IOException {
String line = value.toString();StringTokenizer tokenizer = new 
StringTokenizer(line);TupleWritable tuple = new TupleWritable(new 
Writable[] { one } );while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());output.collect(word, 
tuple);}}}public static class Reduce extends 
MapReduceBase implements ReducerText, TupleWritable, Text, TupleWritable {
public void reduce(Text key, IteratorTupleWritable values, 
OutputCollectorText, TupleWritable output, Reporter reporter) throws 
IOException {IntWritable i = new IntWritable();int sum 
= 0;while (values.hasNext()) {i = ((IntWritable) 
values.next().get(0));sum += i.get();}
TupleWritable tuple = new TupleWritable(new Writable[] { new IntWritable(sum) } 
);output.collect(key, tuple);}}public static void 
main(String[] args) throws Exception {JobConf conf = new 
JobConf(WordCount.class);conf.setJobName(wordcount);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(TupleWritable.class);
conf.setMapperClass(Map.class);conf.setReducerClass(Reduce.class);  
  conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);} }
The output is always empty tuples ('[]').  Using the debugger, I have 
determined that the line:
TupleWritable tuple = new TupleWritable(new Writable[] { one } );

Is properly constructing the desired tuple.  I am not sure if it is being 
outputed correctly by output.collect as I cannot find the field in the 
OutputCollector data structure.  When I check in the reduce method the values 
are always empty tuples.  I have a feeling it has something to do with this 
line in the JavaDoc:

TupleWritable(Writable[] vals)
  Initialize tuple with storage; unknown whether any of them contain 
written values.

Thanks in advance for any all help,

Michael




Re: Passing TupleWritable between map and reduce

2008-08-07 Thread Michael Andrews

Sorry about the massive code chunk, I am not used to this mail client, I 
attached the file instead.

On 8/7/08 4:18 PM, Michael Andrews [EMAIL PROTECTED] wrote:

Hi,

I am a new hadoop developer and am struggling to understand why I cannot pass 
TupleWritable between a map and reduce function.  I have modified the wordcount 
example to demonstrate the issue.  Also I am using hadoop 0.17.1.

Is properly constructing the desired tuple.  I am not sure if it is being 
outputed correctly by output.collect as I cannot find the field in the 
OutputCollector data structure.  When I check in the reduce method the values 
are always empty tuples.  I have a feeling it has something to do with this 
line in the JavaDoc:

TupleWritable(Writable[] vals)
  Initialize tuple with storage; unknown whether any of them contain 
written values.

Thanks in advance for any all help,

Michael





Re: Passing TupleWritable between map and reduce

2008-08-07 Thread Michael Andrews
OK thanks for the information.  I guess it seems strange to want to use 
TupleWritable in this way, but this just seemed like the right thing to do this 
based on the API docs. Is it more idiomatic to inherit from Writable when 
processing structured data?  Again, I am really new to the hadoop community but 
I will try to file something with JIRA on this. Not really sure how to proceed 
with a patch, maybe I could just try and clarify the docs?

On 8/7/08 4:38 PM, Chris Douglas [EMAIL PROTECTED] wrote:

You need access to TupleWritable::setWritten(int). If you want to use
TupleWritable outside the join package, then you need to make this
(and probably related methods, like clearWritten(int)) public and
recompile.

Please file a JIRA if you think it should be more general. -C

On Aug 7, 2008, at 4:18 PM, Michael Andrews wrote:

 Hi,

 I am a new hadoop developer and am struggling to understand why I
 cannot pass TupleWritable between a map and reduce function.  I have
 modified the wordcount example to demonstrate the issue.  Also I am
 using hadoop 0.17.1.

 package wordcount; import java.io.IOException; import java.util.*;
 import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapred.*; import
 org.apache.hadoop.mapred.join.*; public class WordCount {public
 static class Map extends MapReduceBase implements
 MapperLongWritable, Text, Text, TupleWritable {private
 final static IntWritable one = new IntWritable(1);private
 Text word = new Text();public void map(LongWritable key,
 Text value, OutputCollectorText, TupleWritable output, Reporter
 reporter) throws IOException {String line =
 value.toString();StringTokenizer tokenizer = new
 StringTokenizer(line);TupleWritable tuple = new
 TupleWritable(new Writable[] { one } );while
 (tokenizer.hasMoreTokens())
 {word.set(tokenizer.nextToken());
 output.collect(word, tuple);}}}public
 static class Reduce extends MapReduceBase implements ReducerText,
 TupleWritable, Text, TupleWritable {public void reduce(Text
 key, IteratorTupleWritable values, OutputCollectorText,
 TupleWritable output, Reporter reporter) throws IOException
 {IntWritable i = new IntWritable();int sum =
 0;while (values.hasNext()) {i =
 ((IntWritable) values.next().get(0));sum +=
 i.get();}TupleWritable tuple = new
 TupleWritable(new Writable[] { new IntWritable(sum) } );
 output.collect(key, tuple);}}public static void
 main(String[] args) throws Exception {JobConf conf = new
 JobConf(WordCount.class);
 conf.setJobName(wordcount);
 conf.setOutputKeyClass(Text.class);
 conf.setOutputValueClass(TupleWritable.class);
 conf.setMapperClass(Map.class);
 conf.setReducerClass(Reduce.class);
 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(TextOutputFormat.class);
 FileInputFormat.setInputPaths(conf, new Path(args[0]));
 FileOutputFormat.setOutputPath(conf, new Path(args[1]));
 JobClient.runJob(conf);} }
 The output is always empty tuples ('[]').  Using the debugger, I
 have determined that the line:
TupleWritable tuple = new TupleWritable(new Writable[] { one } );

 Is properly constructing the desired tuple.  I am not sure if it is
 being outputed correctly by output.collect as I cannot find the
 field in the OutputCollector data structure.  When I check in the
 reduce method the values are always empty tuples.  I have a feeling
 it has something to do with this line in the JavaDoc:

 TupleWritable(Writable[] vals)
  Initialize tuple with storage; unknown whether any of them
 contain written values.

 Thanks in advance for any all help,

 Michael