here is a part of a shell script i wrote which deals with compressed input and produces compressed output (for streaming)
> hadoop dfs -rmr $4 hadoop jar /usr/local/share/hadoop/contrib/streaming/hadoop-*-streaming.jar -mapper $1 -reducer $2 -input $3/* -output $4 -file $1 -file $2 -jobconf mapred.job.name="$5" -jobconf stream.recordreader.compression=gzip \ -jobconf mapred.output.compress=true \ -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec 2009/7/14 Dmitry Pushkarev <u...@stanford.edu> > Dear hadoop users, > > > > Sorry for probably very common question, but is there a way to process > folder with .gz files with streaming? > > In manual they only describe how to create GZIPped output, but I couldn't > figure out how to use GZIPped files for input. > > > > Right now I create a list of these files and process them like "hadoop -cat > $file |gzip -dc |" but that doesn't use data-locality of archives (each > file > is 64MB - exactly one block). > > > > A sample code or link to manual would be greatly appreciated. > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.