[ http://issues.apache.org/jira/browse/HADOOP-474?page=all ]
Doug Cutting updated HADOOP-474: -------------------------------- Status: Resolved (was: Patch Available) Resolution: Fixed I just committed this. Thanks, Owen! > support compressed text files as input and output > ------------------------------------------------- > > Key: HADOOP-474 > URL: http://issues.apache.org/jira/browse/HADOOP-474 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Affects Versions: 0.5.0 > Reporter: Owen O'Malley > Assigned To: Owen O'Malley > Fix For: 0.6.0 > > Attachments: text-gz-2.patch, text-gz-3.patch, text-gz.patch > > > I'd like TextInputFomat and TextOutputFormat to automatically compress and > uncompress text files when they are read and written. Furthermore, I'd like > to be able to use custom compressors as defined in HADOOP-441. Therefore, I > propose: > Adding a map of compression codecs in the server config files: > io.compression.codecs = "<suffix>=<codec class>,..." > so the default would be something like: > <property> > <name>io.compression.codecs</name> > > <value>.gz=org.apache.hadoop.io.GZipCodec,.Z=org.apache.hadoop.io.ZipCodec</value> > <description>A list of file suffixes and the codecs for them.</description> > </property> > note that the suffix can include multiple "." so you could support suffixes > like ".tar.gz", but they are just treated as literals against the end of the > filename. > If the TextInputFormat is dealing with such a file, it: > 1. makes a single split > 2. decompresses automatically > On the output side, if mapred.output.compress is true, then TextOutputFormat > would use a new property mapred.output.compression.codec that would define > the codec to use to compress the outputs, defaulting to gzip. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira