Re: *.gz input files

2009-06-04 Thread jason hadoop
General speaking, the .gz extension will be recognized by the input formats that inherit from TextInputFormat, and the correct thing will happen. Is there by chance an error in your log files about codec loading failure. What version of hadoop are you using, and can you provide a few more details

Re: *.gz input files

2009-06-04 Thread Ian Soboroff
If you're case is like mine, where you have lots of .gz files and you don't want splits in the middle of those files, you can use the code I just sent in the thread about traversing subdirectories. In brief, your RecordReader could do something like: public static class MyRecordReader

Re: *.gz input files

2009-06-03 Thread Alex Loddengaard
Hi Adam, Gzipped files don't play that nicely with Hadoop, because they aren't splittable. Can you use bzip2 instead? bzip2 files play more nicely with Hadoop, because they're splittable. If you're stuck with gzip, then take a look here: . I don

*.gz input files

2009-06-03 Thread Adam Silberstein
Hi, I have some hadoop code that works properly when the input files are not compressed, but it is not working for the gzipped versions of those files. My files are named with *.gz, but the format is not being recognized. I'm under the impression I don't need to set any JobConf parameters to ind

RE: .gz input files having less output than uncompressed version

2009-05-07 Thread Malcolm Matalka
: Re: .gz input files having less output than uncompressed version Hi, What input format are you using for the GZipped file? I don't believe there is a GZip input format although some people have discussed whether it is feasible... Cheers Tim On Thu, May 7, 2009 at 9:05 PM, Malcolm Matal

Re: .gz input files having less output than uncompressed version

2009-05-07 Thread tim robertson
Hi, What input format are you using for the GZipped file? I don't believe there is a GZip input format although some people have discussed whether it is feasible... Cheers Tim On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka wrote: > Problem: > > I am comparing two jobs.  The both have the sa

.gz input files having less output than uncompressed version

2009-05-07 Thread Malcolm Matalka
Problem: I am comparing two jobs. The both have the same input content, however in one job the input file has been gziped, and in the other it has not. I get far less output rows in the gzipped result than I do in the uncompressed version: Lines in output: Gzipped: 86851 Uncompressed: 65693