> >4. Reduce reported progress when consuming compressed map > outputs: is > >generally incorrect, with reducers reporting over 220% > completion. This > >is regardless of whether native compression is used or not. > > This smells like a bug, please file a jira asap! > I'm guessing this could be due to the fact that we are > checking the size of uncompressed key/value pairs rather than > the compressed sizes. Devaraj? >
Riccardo, pls file a jira issue for this one. Thanks, Devaraj. > -----Original Message----- > From: Arun C Murthy [mailto:[EMAIL PROTECTED] > Sent: Thursday, September 06, 2007 12:01 AM > To: Nt Never > Cc: Devaraj Das; [email protected] > Subject: Re: map output compression codec setting issue > > Riccardo, > > On Wed, Sep 05, 2007 at 10:10:31AM -0700, Nt Never wrote: > >Hi Arun, > > > >thanks for your reply, I am CCing this e-mail to hadoop-dev. I will > >create the appropriate JIRA tickets today. Here are a few insights > >about my experience with Hadoop compression (all my comments > apply to 0.13.0): > > > > Thanks! > > >1. Map output compression: besides the issue I mentioned to you guys > >about choosing two different codecs for map output and overall job > >output, it works very well for us. I have been using non-native map > >output compression on jobs that generate over 6Tb of data with no > >problems. Since I am using 0.13.0, because of HADOOP-1193, I > could test > >LZO native on very small jobs only. Our benchmarks show no > degradation > >in performance whatsoever when using native-LZO. > > That is good to hear, please keep us posted on things you > notice with 0.14.* and beyond (i.e. post H-1193). > > >2. Compression type configuration: we noticed a small issue with the > >configuration here. If "io.seqfile.compression,type" is set > to NONE in > >hadoop-site.xml, M/R jobs will not do any compression and > there is no > >way to override it programmatically. As a matter of fact, > each worker > >machine will end up using the value read from the local hadoop conf > >folder. I like the fact that each worker reads this property locally > >when creating generic SequenceFile(s), but, IMHO, the > behavior of M/R > >jobs should be set in JobConf only. This issue is very easy > to reproduce. > > This is a known bug where JobConf is overridden by > hadoop-site.xml, please see: > http://issues.apache.org/jira/browse/HADOOP-785 > > >3. Non-native GzipCodec: the codec returns Java's > >java.util.zip.GzipOutputStream and > java.util.zip.GzipInputStream when > >native compression is not available. However, lines 197, > 238, 299, and > >357 of SequenceFile (basically all the createWriter() methods that > >select a compression codec) will throw an > IllegalArgumentException if > >the GzipCodec is selected but the native library is *not* > available. Why is that? > > The issue with java.util.zip.GzipInputStream is that it > doesn't let u access the underlying decompressor, hence we > cannot do a 'reset' and reuse it - this is required for SequenceFiles. > > See http://issues.apache.org/jira/browse/HADOOP-441#action_12430068 > > >4. Reduce reported progress when consuming compressed map > outputs: is > >generally incorrect, with reducers reporting over 220% > completion. This > >is regardless of whether native compression is used or not. > > This smells like a bug, please file a jira asap! > I'm guessing this could be due to the fact that we are > checking the size of uncompressed key/value pairs rather than > the compressed sizes. Devaraj? > > thanks, > Arun > > > > >Best, > > > >Riccardo > > > > > >On 9/5/07, Arun C Murthy <[EMAIL PROTECTED]> wrote: > >> > >> Hi Riccardo, > >> > >> On Tue, Sep 04, 2007 at 12:12:19PM -0700, Nt Never wrote: > >> >Thanks Devaraj, good to hear from you. > >> > > >> >Actually, if you guys are interested, I have been testing Hadoop > >> compression > >> >(native and non-native), in the last 5 days on a cluster of 200 > >> >machines (running 0.12.3, with HDFS as file system). I have a few > >> >insights you > >> guys > >> >might be interested into. I am just trying to figure out what the > >> >proper channels would be, that is why I contacted you > first. Thanks. > >> > > >> > >> You are absolutely correct. Please file a jira (and a patch if you > >> are so inclined! *smile*) to request a separate property > for the 2 codecs. > >> > >> We'd love to hear any insights/opinion/ideas about the compression > >> stuff you've been working on, please don't hesitate to mail > >> hadoop-dev@ or file jira issues about any of them... > >> > >> thanks! > >> Arun > >> > >> >Riccardo > >> > > >> > > >> >On 9/4/07, Devaraj Das <[EMAIL PROTECTED]> wrote: > >> >> > >> >> Hi Riccardo, > >> >> Thanks for contacting me. I am doing good and hope you > are doing > >> >> great too! > >> >> I am copying this mail to Arun who is our compression > expert. Arun > >> >> pls respond to the mail. > >> >> Thanks, > >> >> Devaraj > >> >> > >> >> ------------------------------ > >> >> *From:* Nt Never [mailto:[EMAIL PROTECTED] > >> >> *Sent:* Tuesday, September 04, 2007 10:24 PM > >> >> *To:* [EMAIL PROTECTED] > >> >> *Subject:* map output compression codec setting issue > >> >> > >> >> Hi Devaraj, > >> >> > >> >> how have you been doing? I finally got around to do > some extensive > >> testing > >> >> with Hadoop's compression. I am aware of HADOOP-1193 and > >> >> HADOOP-1545, > >> so I > >> >> am waiting for the release of 0.15.0 before I do more > benchmarks. > >> However, > >> >> I noticed what seems to be a bug in JobConf. The property " > >> >> mapred.output.compression.codec" is used when setting > and getting > >> >> the > >> map > >> >> output compression codec, thus making it impossible to use a > >> >> different > >> codec > >> >> for map outputs and overall job outputs. The methods > that affect > >> >> this behavior are in line 341-371 of JobConf in Hadoop 0.13.0: > >> >> > >> >> /** > >> >> * Set the given class as the compression codec for > the map outputs. > >> >> * @param codecClass the CompressionCodec class that will > >> >> compress > >> the > >> >> * map outputs > >> >> */ > >> >> public void setMapOutputCompressorClass(Class<? extends > >> >> CompressionCodec> codecClass) { > >> >> setCompressMapOutput(true); > >> >> setClass("mapred.output.compression.codec", codecClass, > >> >> CompressionCodec.class); > >> >> } > >> >> > >> >> /** > >> >> * Get the codec for compressing the map outputs > >> >> * @param defaultValue the value to return if it is not set > >> >> * @return the CompressionCodec class that should be used to > >> >> compress the > >> >> * map outputs > >> >> * @throws IllegalArgumentException if the class was > specified, > >> >> but > >> not > >> >> found > >> >> */ > >> >> public Class<? extends CompressionCodec> > >> getMapOutputCompressorClass(Class<? > >> >> extends CompressionCodec> defaultValue) { > >> >> String name = get( "mapred.output.compression.codec"); > >> >> if (name == null) { > >> >> return defaultValue; > >> >> } else { > >> >> try { > >> >> return getClassByName(name).asSubclass( > >> >> CompressionCodec.class > >> ); > >> >> } catch (ClassNotFoundException e) { > >> >> throw new IllegalArgumentException("Compression > codec " + > >> >> name > >> + > >> >> " was not > found.", e); > >> >> } > >> >> } > >> >> } > >> >> > >> >> This could be easily fixed by using a different property, for > >> >> example, > >> " > >> >> map.output.compression.codec". Should I create an issue on JIRA > >> >> for > >> this? > >> >> Thanks. > >> >> > >> >> Riccardo > >> >> > >> >> > >> >
