RE: map output compression codec setting issue

Devaraj Das Thu, 06 Sep 2007 08:13:45 -0700

> >4. Reduce reported progress when consuming compressed map 
> outputs: is 
> >generally incorrect, with reducers reporting over 220% 
> completion. This 
> >is regardless of whether native compression is used or not.
> 
> This smells like a bug, please file a jira asap!
> I'm guessing this could be due to the fact that we are 
> checking the size of uncompressed key/value pairs rather than 
> the compressed sizes. Devaraj?
>


Riccardo, pls file a jira issue for this one. 
Thanks,
Devaraj. 

> -----Original Message-----
> From: Arun C Murthy [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, September 06, 2007 12:01 AM
> To: Nt Never
> Cc: Devaraj Das; [email protected]
> Subject: Re: map output compression codec setting issue
> 
> Riccardo,
> 
> On Wed, Sep 05, 2007 at 10:10:31AM -0700, Nt Never wrote:
> >Hi Arun,
> >
> >thanks for your reply, I am CCing this e-mail to hadoop-dev. I will 
> >create the appropriate JIRA tickets today. Here are a few insights 
> >about my experience with Hadoop compression (all my comments 
> apply to 0.13.0):
> >
> 
> Thanks!
> 
> >1. Map output compression: besides the issue I mentioned to you guys 
> >about choosing two different codecs for map output and overall job 
> >output, it works very well for us. I have been using non-native map 
> >output compression on jobs that generate over 6Tb of data with no 
> >problems. Since I am using 0.13.0, because of HADOOP-1193, I 
> could test 
> >LZO native on very small jobs only. Our benchmarks show no 
> degradation 
> >in performance whatsoever when using native-LZO.
> 
> That is good to hear, please keep us posted on things you 
> notice with 0.14.* and beyond (i.e. post H-1193).
> 
> >2. Compression type configuration: we noticed a small issue with the 
> >configuration here. If "io.seqfile.compression,type" is set 
> to NONE in 
> >hadoop-site.xml, M/R jobs will not do any compression and 
> there is no 
> >way to override it programmatically. As a matter of fact, 
> each worker 
> >machine will end up using the value read from the local hadoop conf 
> >folder. I like the fact that each worker reads this property locally 
> >when creating generic SequenceFile(s), but, IMHO, the 
> behavior of M/R 
> >jobs should be set in JobConf only. This issue is very easy 
> to reproduce.
> 
> This is a known bug where JobConf is overridden by 
> hadoop-site.xml, please see:
> http://issues.apache.org/jira/browse/HADOOP-785
> 
> >3. Non-native GzipCodec: the codec returns Java's 
> >java.util.zip.GzipOutputStream and 
> java.util.zip.GzipInputStream when 
> >native compression is not available. However, lines 197, 
> 238, 299, and 
> >357 of SequenceFile (basically all the createWriter() methods that 
> >select a compression codec) will throw an 
> IllegalArgumentException if 
> >the GzipCodec is selected but the native library is *not* 
> available. Why is that?
> 
> The issue with java.util.zip.GzipInputStream is that it 
> doesn't let u access the underlying decompressor, hence we 
> cannot do a 'reset' and reuse it - this is required for SequenceFiles.
> 
> See http://issues.apache.org/jira/browse/HADOOP-441#action_12430068
> 
> >4. Reduce reported progress when consuming compressed map 
> outputs: is 
> >generally incorrect, with reducers reporting over 220% 
> completion. This 
> >is regardless of whether native compression is used or not.
> 
> This smells like a bug, please file a jira asap!
> I'm guessing this could be due to the fact that we are 
> checking the size of uncompressed key/value pairs rather than 
> the compressed sizes. Devaraj?
> 
> thanks,
> Arun
> 
> >
> >Best,
> >
> >Riccardo
> >
> >
> >On 9/5/07, Arun C Murthy <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi Riccardo,
> >>
> >> On Tue, Sep 04, 2007 at 12:12:19PM -0700, Nt Never wrote:
> >> >Thanks Devaraj, good to hear from you.
> >> >
> >> >Actually, if you guys are interested, I have been testing Hadoop
> >> compression
> >> >(native and non-native), in the last 5 days on a cluster of 200 
> >> >machines (running 0.12.3, with HDFS as file system). I have a few 
> >> >insights you
> >> guys
> >> >might be interested into. I am just trying to figure out what the 
> >> >proper channels would be, that is why I contacted you 
> first. Thanks.
> >> >
> >>
> >> You are absolutely correct. Please file a jira (and a patch if you 
> >> are so inclined! *smile*) to request a separate property 
> for the 2 codecs.
> >>
> >> We'd love to hear any insights/opinion/ideas about the compression 
> >> stuff you've been working on, please don't hesitate to mail 
> >> hadoop-dev@ or file jira issues about any of them...
> >>
> >> thanks!
> >> Arun
> >>
> >> >Riccardo
> >> >
> >> >
> >> >On 9/4/07, Devaraj Das <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >>  Hi Riccardo,
> >> >> Thanks for contacting me. I am doing good and hope you 
> are doing 
> >> >> great too!
> >> >> I am copying this mail to Arun who is our compression 
> expert. Arun 
> >> >> pls respond to the mail.
> >> >> Thanks,
> >> >> Devaraj
> >> >>
> >> >>  ------------------------------
> >> >> *From:* Nt Never [mailto:[EMAIL PROTECTED]
> >> >> *Sent:* Tuesday, September 04, 2007 10:24 PM
> >> >> *To:* [EMAIL PROTECTED]
> >> >> *Subject:* map output compression codec setting issue
> >> >>
> >> >> Hi Devaraj,
> >> >>
> >> >> how have you been doing? I finally got around to do 
> some extensive
> >> testing
> >> >> with Hadoop's compression. I am aware of HADOOP-1193 and 
> >> >> HADOOP-1545,
> >> so I
> >> >> am waiting for the release of 0.15.0 before I do more 
> benchmarks.
> >> However,
> >> >> I noticed what seems to be a bug in JobConf. The property "
> >> >> mapred.output.compression.codec" is used when setting 
> and getting 
> >> >> the
> >> map
> >> >> output compression codec, thus making it impossible to use a 
> >> >> different
> >> codec
> >> >> for map outputs and overall job outputs. The methods 
> that affect 
> >> >> this behavior are in line 341-371 of JobConf in Hadoop 0.13.0:
> >> >>
> >> >> /**
> >> >>    * Set the given class as the  compression codec for 
> the map outputs.
> >> >>    * @param codecClass the CompressionCodec class that will 
> >> >> compress
> >> the
> >> >>    *                   map outputs
> >> >>    */
> >> >>   public void setMapOutputCompressorClass(Class<? extends
> >> >> CompressionCodec> codecClass) {
> >> >>     setCompressMapOutput(true);
> >> >>     setClass("mapred.output.compression.codec", codecClass,
> >> >>              CompressionCodec.class);
> >> >>   }
> >> >>
> >> >>   /**
> >> >>    * Get the codec for compressing the map outputs
> >> >>    * @param defaultValue the value to return if it is not set
> >> >>    * @return the CompressionCodec class that should be used to 
> >> >> compress the
> >> >>    *   map outputs
> >> >>    * @throws IllegalArgumentException if the class was 
> specified, 
> >> >> but
> >> not
> >> >> found
> >> >>    */
> >> >>   public Class<? extends CompressionCodec>
> >> getMapOutputCompressorClass(Class<?
> >> >> extends CompressionCodec> defaultValue) {
> >> >>     String name = get( "mapred.output.compression.codec");
> >> >>     if (name == null) {
> >> >>       return defaultValue;
> >> >>     } else {
> >> >>       try {
> >> >>         return getClassByName(name).asSubclass( 
> >> >> CompressionCodec.class
> >> );
> >> >>       } catch (ClassNotFoundException e) {
> >> >>         throw new IllegalArgumentException("Compression 
> codec " + 
> >> >> name
> >> +
> >> >>                                            " was not 
> found.", e);
> >> >>       }
> >> >>     }
> >> >>   }
> >> >>
> >> >> This could be easily fixed by using a different property, for 
> >> >> example,
> >> "
> >> >> map.output.compression.codec". Should I create an issue on JIRA 
> >> >> for
> >> this?
> >> >> Thanks.
> >> >>
> >> >> Riccardo
> >> >>
> >> >>
> >>
>

RE: map output compression codec setting issue

Reply via email to