RE: CompressionCodec in MapReduce

Devaraj k Wed, 11 Apr 2012 01:40:43 -0700

Hi Grzegorz,

    You can find the below properties for Job input and output compression:


The below prop is used by the codec factory. This codec will be taken based on 
the type(i.e suffix) of the file. By default the LineRecordReador which is used 
by FileInputFormat uses this. If you want the compression for inputs in 
otherway you can write input format according to that.

core-site.xml:
---------------

<property> 
  <name>io.compression.codecs</name> 
  
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec</value>
 
  <description>A list of the compression codec classes that can be used 
               for compression/decompression.</description> 
</property> 


   I am not sure which version of hadoop you are using. I am giving the props 
for newer and older versions. These are the props you need to configure if you 
want to compress job outputs. These works only when the output format is 
FileOutputFormat.

mapred-site.xml:(for version 0.23  and later)
---------------------------------------------------

<property> 
  <name>mapreduce.output.fileoutputformat.compress</name> 
  <value>false</value> 
  <description>Should the job outputs be compressed? 
  </description> 
</property> 

<property> 
  <name>mapreduce.output.fileoutputformat.compression.type</name> 
  <value>RECORD</value> 
  <description>If the job outputs are to compressed as SequenceFiles, how 
should 
               they be compressed? Should be one of NONE, RECORD or BLOCK. 
  </description> 
</property> 

<property> 
  <name>mapreduce.output.fileoutputformat.compression.codec</name> 
  <value>org.apache.hadoop.io.compress.DefaultCodec</value> 
  <description>If the job outputs are compressed, how should they be 
compressed? 
  </description> 
</property> 




mapred-site.xml:(for older versions)
------------------------------------------

<property> 
  <name>mapred.output.compress</name> 
  <value>false</value> 
  <description>Should the job outputs be compressed? 
  </description> 
</property> 

<property> 
  <name>mapred.output.compression.type</name> 
  <value>RECORD</value> 
  <description>If the job outputs are to compressed as SequenceFiles, how 
should 
               they be compressed? Should be one of NONE, RECORD or BLOCK. 
  </description> 
</property> 

<property> 
  <name>mapred.output.compression.codec</name> 
  <value>org.apache.hadoop.io.compress.DefaultCodec</value> 
  <description>If the job outputs are compressed, how should they be 
compressed? 
  </description> 
</property> 


If you want to use compression with your custom input and out formats, you can 
implement the compression in those classes.


Thanks
Devaraj
________________________________________
From: Grzegorz Gunia [sawt...@student.agh.edu.pl]
Sent: Wednesday, April 11, 2012 1:46 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: CompressionCodec in MapReduce

Thanks for you reply! That clears some thing up
There is but one problem... My CompressionCodec has to be instantiated on a 
per-file basis, meaning it needs to know the name of the file it is to 
compress/decompress. I'm guessing that would not be possible with the current 
implementation?

Or if so, how would I proceed with injecting it with the file name?
--
Greg

W dniu 2012-04-11 10:12, Zizon Qiu pisze:
append your custom codec full class name in "io.compression.codecs" either in 
mapred-site.xml or in the configuration object pass to Job constructor.

the map reduce framework will try to guess the compress algorithm using the 
input files suffix.

if any CompressionCodec.getDefaultExtension() register in the configuration 
match the suffix,hadoop will try to instantiate the codec and decompress for 
you ,if succeed,automatically.

the default value for "io.compression.codecs" is 
"org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"

On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia 
<sawt...@student.agh.edu.pl<mailto:sawt...@student.agh.edu.pl>> wrote:
Hello,
I am trying to apply a custom CompressionCodec to work with MapReduce jobs, but 
I haven't found a way to inject it during the reading of input data, or during 
the write of the job results.
Am I missing something, or is there no support for compressed files in the 
filesystem?

I am well aware of how to set it up to work during the intermitent phases of 
the MapReduce operation, but I just can't find a way to apply it BEFORE the job 
takes place...
Is there any other way except simply uncompressing the files I need prior to 
scheduling a job?

Huge thanks for any help you can give me!
--
Greg

RE: CompressionCodec in MapReduce

Reply via email to