Rahul-da

I found bz2 pretty slow (although splittable) so I switched to snappy (only 
sequence files are splittable but compress-decompress is fast)

Thanks
Sanjay

From: Rahul Bhattacharjee 
<rahul.rec....@gmail.com<mailto:rahul.rec....@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
<user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:53 PM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
<user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: Re: Now give .gz file as input to the MAP

Nothing special is required for process .gz files using MR. however , as Sanjay 
mentioned , verify the codec's configured in core-site and another thing to 
note is that these files are not splittable.

You might want to use bz2 , these are splittable.

Thanks,
Rahul


On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian 
<sanjay.subraman...@wizecommerce.com<mailto:sanjay.subraman...@wizecommerce.com>>
 wrote:

hadoopConf.set("mapreduce.job.inputformat.class", 
"com.wizecommerce.utils.mapred.TextInputFormat");

hadoopConf.set("mapreduce.job.outputformat.class", 
"com.wizecommerce.utils.mapred.TextOutputFormat");

No special settings required for reading Gzip except these above

I u want to output Gzip


hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");

hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", 
"org.apache.hadoop.io.compress.GzipCodec");


Make sure Gzip codec is defined in core-site.xml
<!-- core-site.xml -->
<property>
    <name>io.compression.codecs</name>
    
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</value>
</property>

I have a question

Why are u using GZIP as input to Map ? These are not splittableā€¦Unless u have 
to read multilines (like lines between a BEGIN and END block in a log file) and 
send it as one record to the mapper

Also in Non-splitable Snappy Codec is better

Good Luck


sanjay

From: samir das mohapatra 
<samir.help...@gmail.com<mailto:samir.help...@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
<user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:07 PM
To: "cdh-u...@cloudera.com<mailto:cdh-u...@cloudera.com>" 
<cdh-u...@cloudera.com<mailto:cdh-u...@cloudera.com>>, 
"user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
<user@hadoop.apache.org<mailto:user@hadoop.apache.org>>, 
"user-h...@hadoop.apache.org<mailto:user-h...@hadoop.apache.org>" 
<user-h...@hadoop.apache.org<mailto:user-h...@hadoop.apache.org>>
Subject: Now give .gz file as input to the MAP

Hi All,
    Did any one worked on, how to pass the .gz file as  file input for 
mapreduce job ?

Regards,
samir.

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

Reply via email to