[jira] Commented: (HADOOP-474) support compressed text files as input and output

Owen O'Malley (JIRA) Fri, 25 Aug 2006 13:51:35 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-474?page=comments#action_12430621 ] 
            
Owen O'Malley commented on HADOOP-474:
--------------------------------------


> Do we need io.compression.codecs if the codecs provide extensions? If so, 
> then what's the 
> point of having the codec's provide extensions?

Good point. I shouldn't encode the information twice. We do want the 
io.compression.codecs so that it is easy to extend the list of potential 
codecs. I could either make io.compression.codecs a straight list of codec 
classes or remove the getDefaultExtension method. Thoughts?

> Should the output property be instead named 
> mapred.text.output.compression.codec? 
> In other words, will we use this same property to name the codecs for other 
> output formats, or is 
> this property for text files alone (in which case it should have 'text' in 
> its name)? 

I would think that the SequenceFileOutputFormat (under HADOOP-441) should use 
the same property. At least that was my thought.

> support compressed text files as input and output
> -------------------------------------------------
>
>                 Key: HADOOP-474
>                 URL: http://issues.apache.org/jira/browse/HADOOP-474
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.5.0
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>             Fix For: 0.6.0
>
>
> I'd like TextInputFomat and TextOutputFormat to automatically compress and 
> uncompress text files when they are read and written. Furthermore, I'd like 
> to be able to use custom compressors as defined in HADOOP-441. Therefore, I 
> propose:
> Adding a map of compression codecs in the server config files:
> io.compression.codecs = "<suffix>=<codec class>,..."
> so the default would be something like:
> <property>
>   <name>io.compression.codecs</name>
>   
> <value>.gz=org.apache.hadoop.io.GZipCodec,.Z=org.apache.hadoop.io.ZipCodec</value>
>   <description>A list of file suffixes and the codecs for them.</description>
> </property>
> note that the suffix can include multiple "." so you could support suffixes 
> like ".tar.gz", but they are just treated as literals against the end of the 
> filename.
> If the TextInputFormat is dealing with such a file, it:
>   1. makes a single split
>   2. decompresses automatically
> On the output side, if mapred.output.compress is true, then TextOutputFormat 
> would use a new property mapred.output.compression.codec that would define 
> the codec to use to compress the outputs,  defaulting to gzip. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-474) support compressed text files as input and output

Reply via email to