[ http://issues.apache.org/jira/browse/HADOOP-474?page=all ]

Owen O'Malley updated HADOOP-474:
---------------------------------

    Attachment: text-gz.patch

This patch does:
  1. Fixes TextInputFormat to work with non-ascii UTF-8
  2. Adds a gzip codec for reading and writing .gz files
  3. Exposes Text.set(byte[], int, int) so that you can set the Text to a 
non-zero offset and length.
  4. Renames Text.validateUTF to validateUTF8
  5. Adds a CompressionCodecFactory that finds a codec based on a filename 
extension. The factory includes static methods to set/get the list of codecs in 
a Configuration.
  6. Adds test cases for reading gziped files for TextInputFormat
  7. Adds test cases for reading UTF8 via TextInputFormat
  8. Adds test cases for the CompressionCodecFactory
  9. InputFormatBase gets a new virtual to determine whether a file is 
splittable.
  10. TextInputFormat exposes a readLine method that reads bytes until a 
newline.
  11. TextOutputFormat will write compressed text files with a configurable 
compression codec.
  12. Removes a extra loop through the splits to count the number of bytes.

> support compressed text files as input and output
> -------------------------------------------------
>
>                 Key: HADOOP-474
>                 URL: http://issues.apache.org/jira/browse/HADOOP-474
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.5.0
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>             Fix For: 0.6.0
>
>         Attachments: text-gz.patch
>
>
> I'd like TextInputFomat and TextOutputFormat to automatically compress and 
> uncompress text files when they are read and written. Furthermore, I'd like 
> to be able to use custom compressors as defined in HADOOP-441. Therefore, I 
> propose:
> Adding a map of compression codecs in the server config files:
> io.compression.codecs = "<suffix>=<codec class>,..."
> so the default would be something like:
> <property>
>   <name>io.compression.codecs</name>
>   
> <value>.gz=org.apache.hadoop.io.GZipCodec,.Z=org.apache.hadoop.io.ZipCodec</value>
>   <description>A list of file suffixes and the codecs for them.</description>
> </property>
> note that the suffix can include multiple "." so you could support suffixes 
> like ".tar.gz", but they are just treated as literals against the end of the 
> filename.
> If the TextInputFormat is dealing with such a file, it:
>   1. makes a single split
>   2. decompresses automatically
> On the output side, if mapred.output.compress is true, then TextOutputFormat 
> would use a new property mapred.output.compression.codec that would define 
> the codec to use to compress the outputs,  defaulting to gzip. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to