[ http://issues.apache.org/jira/browse/HADOOP-474?page=comments#action_12430621 ] Owen O'Malley commented on HADOOP-474: --------------------------------------
> Do we need io.compression.codecs if the codecs provide extensions? If so, > then what's the > point of having the codec's provide extensions? Good point. I shouldn't encode the information twice. We do want the io.compression.codecs so that it is easy to extend the list of potential codecs. I could either make io.compression.codecs a straight list of codec classes or remove the getDefaultExtension method. Thoughts? > Should the output property be instead named > mapred.text.output.compression.codec? > In other words, will we use this same property to name the codecs for other > output formats, or is > this property for text files alone (in which case it should have 'text' in > its name)? I would think that the SequenceFileOutputFormat (under HADOOP-441) should use the same property. At least that was my thought. > support compressed text files as input and output > ------------------------------------------------- > > Key: HADOOP-474 > URL: http://issues.apache.org/jira/browse/HADOOP-474 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Affects Versions: 0.5.0 > Reporter: Owen O'Malley > Assigned To: Owen O'Malley > Fix For: 0.6.0 > > > I'd like TextInputFomat and TextOutputFormat to automatically compress and > uncompress text files when they are read and written. Furthermore, I'd like > to be able to use custom compressors as defined in HADOOP-441. Therefore, I > propose: > Adding a map of compression codecs in the server config files: > io.compression.codecs = "<suffix>=<codec class>,..." > so the default would be something like: > <property> > <name>io.compression.codecs</name> > > <value>.gz=org.apache.hadoop.io.GZipCodec,.Z=org.apache.hadoop.io.ZipCodec</value> > <description>A list of file suffixes and the codecs for them.</description> > </property> > note that the suffix can include multiple "." so you could support suffixes > like ".tar.gz", but they are just treated as literals against the end of the > filename. > If the TextInputFormat is dealing with such a file, it: > 1. makes a single split > 2. decompresses automatically > On the output side, if mapred.output.compress is true, then TextOutputFormat > would use a new property mapred.output.compression.codec that would define > the codec to use to compress the outputs, defaulting to gzip. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira