Re: Now give .gz file as input to the MAP

Rahul Bhattacharjee Wed, 12 Jun 2013 10:49:18 -0700

Yeah I too found that quite slow and memory hungry !

Thanks,
Rahul-da



On Wed, Jun 12, 2013 at 11:13 PM, Sanjay Subramanian <
sanjay.subraman...@wizecommerce.com> wrote:

>  Rahul-da
>
>  I found bz2 pretty slow (although splittable) so I switched to snappy
> (only sequence files are splittable but compress-decompress is fast)
>
>  Thanks
> Sanjay
>
>   From: Rahul Bhattacharjee <rahul.rec....@gmail.com>
> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Date: Tuesday, June 11, 2013 9:53 PM
> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Subject: Re: Now give .gz file as input to the MAP
>
>   Nothing special is required for process .gz files using MR. however ,
> as Sanjay mentioned , verify the codec's configured in core-site and
> another thing to note is that these files are not splittable.
>
>  You might want to use bz2 , these are splittable.
>
> Thanks,
> Rahul
>
>
> On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <
> sanjay.subraman...@wizecommerce.com> wrote:
>
>>  hadoopConf.set("mapreduce.job.inputformat.class",
>> "com.wizecommerce.utils.mapred.TextInputFormat");
>>
>> hadoopConf.set("mapreduce.job.outputformat.class",
>> "com.wizecommerce.utils.mapred.TextOutputFormat");
>>  No special settings required for reading Gzip except these above
>>
>>  I u want to output Gzip
>>
>>  hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
>>
>> hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec",
>> "org.apache.hadoop.io.compress.GzipCodec");
>>
>> Make sure Gzip codec is defined in core-site.xml
>>  <!-- core-site.xml -->
>>  <property>
>>      <name>io.compression.codecs</name>
>>      <value
>> >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</
>> value>
>>  </property>
>>
>>  I have a question
>>
>>  Why are u using GZIP as input to Map ? These are not splittable…Unless
>> u have to read multilines (like lines between a BEGIN and END block in a
>> log file) and send it as one record to the mapper
>>
>>  Also in Non-splitable Snappy Codec is better
>>
>>  Good Luck
>>
>>
>>  sanjay
>>
>>   From: samir das mohapatra <samir.help...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Date: Tuesday, June 11, 2013 9:07 PM
>> To: "cdh-u...@cloudera.com" <cdh-u...@cloudera.com>, "
>> user@hadoop.apache.org" <user@hadoop.apache.org>, "
>> user-h...@hadoop.apache.org" <user-h...@hadoop.apache.org>
>> Subject: Now give .gz file as input to the MAP
>>
>>   Hi All,
>>     Did any one worked on, how to pass the .gz file as  file input for
>> mapreduce job ?
>>
>> Regards,
>> samir.
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>

Re: Now give .gz file as input to the MAP

Reply via email to