Yeah I too found that quite slow and memory hungry ! Thanks, Rahul-da
On Wed, Jun 12, 2013 at 11:13 PM, Sanjay Subramanian < sanjay.subraman...@wizecommerce.com> wrote: > Rahul-da > > I found bz2 pretty slow (although splittable) so I switched to snappy > (only sequence files are splittable but compress-decompress is fast) > > Thanks > Sanjay > > From: Rahul Bhattacharjee <rahul.rec....@gmail.com> > Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org> > Date: Tuesday, June 11, 2013 9:53 PM > To: "user@hadoop.apache.org" <user@hadoop.apache.org> > Subject: Re: Now give .gz file as input to the MAP > > Nothing special is required for process .gz files using MR. however , > as Sanjay mentioned , verify the codec's configured in core-site and > another thing to note is that these files are not splittable. > > You might want to use bz2 , these are splittable. > > Thanks, > Rahul > > > On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian < > sanjay.subraman...@wizecommerce.com> wrote: > >> hadoopConf.set("mapreduce.job.inputformat.class", >> "com.wizecommerce.utils.mapred.TextInputFormat"); >> >> hadoopConf.set("mapreduce.job.outputformat.class", >> "com.wizecommerce.utils.mapred.TextOutputFormat"); >> No special settings required for reading Gzip except these above >> >> I u want to output Gzip >> >> hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true"); >> >> hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", >> "org.apache.hadoop.io.compress.GzipCodec"); >> >> Make sure Gzip codec is defined in core-site.xml >> <!-- core-site.xml --> >> <property> >> <name>io.compression.codecs</name> >> <value >> >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</ >> value> >> </property> >> >> I have a question >> >> Why are u using GZIP as input to Map ? These are not splittableā¦Unless >> u have to read multilines (like lines between a BEGIN and END block in a >> log file) and send it as one record to the mapper >> >> Also in Non-splitable Snappy Codec is better >> >> Good Luck >> >> >> sanjay >> >> From: samir das mohapatra <samir.help...@gmail.com> >> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org> >> Date: Tuesday, June 11, 2013 9:07 PM >> To: "cdh-u...@cloudera.com" <cdh-u...@cloudera.com>, " >> user@hadoop.apache.org" <user@hadoop.apache.org>, " >> user-h...@hadoop.apache.org" <user-h...@hadoop.apache.org> >> Subject: Now give .gz file as input to the MAP >> >> Hi All, >> Did any one worked on, how to pass the .gz file as file input for >> mapreduce job ? >> >> Regards, >> samir. >> >> CONFIDENTIALITY NOTICE >> ====================== >> This email message and any attachments are for the exclusive use of the >> intended recipient(s) and may contain confidential and privileged >> information. Any unauthorized review, use, disclosure or distribution is >> prohibited. If you are not the intended recipient, please contact the >> sender by reply email and destroy all copies of the original message along >> with any attachments, from your computer system. If you are the intended >> recipient, please be advised that the content of this message is subject to >> access, review and disclosure by the sender's Email System Administrator. >> > > > CONFIDENTIALITY NOTICE > ====================== > This email message and any attachments are for the exclusive use of the > intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or distribution is > prohibited. If you are not the intended recipient, please contact the > sender by reply email and destroy all copies of the original message along > with any attachments, from your computer system. If you are the intended > recipient, please be advised that the content of this message is subject to > access, review and disclosure by the sender's Email System Administrator. >