Riccardo, On Wed, Sep 05, 2007 at 10:10:31AM -0700, Nt Never wrote: >Hi Arun, > >thanks for your reply, I am CCing this e-mail to hadoop-dev. I will create >the appropriate JIRA tickets today. Here are a few insights about my >experience with Hadoop compression (all my comments apply to 0.13.0): >
Thanks! >1. Map output compression: besides the issue I mentioned to you guys about >choosing two different codecs for map output and overall job output, it >works very well for us. I have been using non-native map output compression >on jobs that generate over 6Tb of data with no problems. Since I am using >0.13.0, because of HADOOP-1193, I could test LZO native on very small jobs >only. Our benchmarks show no degradation in performance whatsoever when >using native-LZO. That is good to hear, please keep us posted on things you notice with 0.14.* and beyond (i.e. post H-1193). >2. Compression type configuration: we noticed a small issue with the >configuration here. If "io.seqfile.compression,type" is set to NONE in >hadoop-site.xml, M/R jobs will not do any compression and there is no way to >override it programmatically. As a matter of fact, each worker machine will >end up using the value read from the local hadoop conf folder. I like the >fact that each worker reads this property locally when creating generic >SequenceFile(s), but, IMHO, the behavior of M/R jobs should be set in >JobConf only. This issue is very easy to reproduce. This is a known bug where JobConf is overridden by hadoop-site.xml, please see: http://issues.apache.org/jira/browse/HADOOP-785 >3. Non-native GzipCodec: the codec returns Java's >java.util.zip.GzipOutputStream and java.util.zip.GzipInputStream when native >compression is not available. However, lines 197, 238, 299, and 357 of >SequenceFile (basically all the createWriter() methods that select a >compression codec) will throw an IllegalArgumentException if the GzipCodec >is selected but the native library is *not* available. Why is that? The issue with java.util.zip.GzipInputStream is that it doesn't let u access the underlying decompressor, hence we cannot do a 'reset' and reuse it - this is required for SequenceFiles. See http://issues.apache.org/jira/browse/HADOOP-441#action_12430068 >4. Reduce reported progress when consuming compressed map outputs: is >generally incorrect, with reducers reporting over 220% completion. This is >regardless of whether native compression is used or not. This smells like a bug, please file a jira asap! I'm guessing this could be due to the fact that we are checking the size of uncompressed key/value pairs rather than the compressed sizes. Devaraj? thanks, Arun > >Best, > >Riccardo > > >On 9/5/07, Arun C Murthy <[EMAIL PROTECTED]> wrote: >> >> Hi Riccardo, >> >> On Tue, Sep 04, 2007 at 12:12:19PM -0700, Nt Never wrote: >> >Thanks Devaraj, good to hear from you. >> > >> >Actually, if you guys are interested, I have been testing Hadoop >> compression >> >(native and non-native), in the last 5 days on a cluster of 200 machines >> >(running 0.12.3, with HDFS as file system). I have a few insights you >> guys >> >might be interested into. I am just trying to figure out what the proper >> >channels would be, that is why I contacted you first. Thanks. >> > >> >> You are absolutely correct. Please file a jira (and a patch if you are so >> inclined! *smile*) to request a separate property for the 2 codecs. >> >> We'd love to hear any insights/opinion/ideas about the compression stuff >> you've been working on, please don't hesitate to mail hadoop-dev@ or file >> jira issues about any of them... >> >> thanks! >> Arun >> >> >Riccardo >> > >> > >> >On 9/4/07, Devaraj Das <[EMAIL PROTECTED]> wrote: >> >> >> >> Hi Riccardo, >> >> Thanks for contacting me. I am doing good and hope you are doing great >> >> too! >> >> I am copying this mail to Arun who is our compression expert. Arun pls >> >> respond to the mail. >> >> Thanks, >> >> Devaraj >> >> >> >> ------------------------------ >> >> *From:* Nt Never [mailto:[EMAIL PROTECTED] >> >> *Sent:* Tuesday, September 04, 2007 10:24 PM >> >> *To:* [EMAIL PROTECTED] >> >> *Subject:* map output compression codec setting issue >> >> >> >> Hi Devaraj, >> >> >> >> how have you been doing? I finally got around to do some extensive >> testing >> >> with Hadoop's compression. I am aware of HADOOP-1193 and HADOOP-1545, >> so I >> >> am waiting for the release of 0.15.0 before I do more benchmarks. >> However, >> >> I noticed what seems to be a bug in JobConf. The property " >> >> mapred.output.compression.codec" is used when setting and getting the >> map >> >> output compression codec, thus making it impossible to use a different >> codec >> >> for map outputs and overall job outputs. The methods that affect this >> >> behavior are in line 341-371 of JobConf in Hadoop 0.13.0: >> >> >> >> /** >> >> * Set the given class as the compression codec for the map outputs. >> >> * @param codecClass the CompressionCodec class that will compress >> the >> >> * map outputs >> >> */ >> >> public void setMapOutputCompressorClass(Class<? extends >> >> CompressionCodec> codecClass) { >> >> setCompressMapOutput(true); >> >> setClass("mapred.output.compression.codec", codecClass, >> >> CompressionCodec.class); >> >> } >> >> >> >> /** >> >> * Get the codec for compressing the map outputs >> >> * @param defaultValue the value to return if it is not set >> >> * @return the CompressionCodec class that should be used to compress >> >> the >> >> * map outputs >> >> * @throws IllegalArgumentException if the class was specified, but >> not >> >> found >> >> */ >> >> public Class<? extends CompressionCodec> >> getMapOutputCompressorClass(Class<? >> >> extends CompressionCodec> defaultValue) { >> >> String name = get( "mapred.output.compression.codec"); >> >> if (name == null) { >> >> return defaultValue; >> >> } else { >> >> try { >> >> return getClassByName(name).asSubclass( CompressionCodec.class >> ); >> >> } catch (ClassNotFoundException e) { >> >> throw new IllegalArgumentException("Compression codec " + name >> + >> >> " was not found.", e); >> >> } >> >> } >> >> } >> >> >> >> This could be easily fixed by using a different property, for example, >> " >> >> map.output.compression.codec". Should I create an issue on JIRA for >> this? >> >> Thanks. >> >> >> >> Riccardo >> >> >> >> >>
