HADOOP-1824 | Proposed implementation

Goel, Ankur Wed, 26 Dec 2007 04:11:44 -0800

 
Hi,
   I am working on developing an InputFormat for zip files
as required by HADOOP-1824. For the same I would like to propose
a simple approach and invite comments and suggestions from the 
community for my implementation.


Implementation Approach
-----------------------

1. Implement class ZipInputFormat to extend FileInputFormat.

2. Override the getSplits() method to read each file's
   InputStream and construct a ZipInputStream out of it.

3. Create FileSplits in a way that each file split has the following
   properties
        *  FileSplit.start = start index of a zip entry.
      *  FileSplit.length = end index of a zip entry.
      *  fileSplit.file = Zip file.
      *  Sum of compressed size of zip entries <= splitSize

   For e.g. start = 3, length = 6 signifies that zip entries 3 to 6 
   will be read from the zip file of this split.

4. Implement class ZipRecordReader to read each zip entry in its split
   Using LineRecordReader.

I think I might be required to deal with compressionCodecFatory and
other
classes related to compression. How exactly, is not very clear to me.
So any hints here would be useful.

Apart from the above please let me know if there is anything that I am 
missing.

Thanks
-Ankur

HADOOP-1824 | Proposed implementation

Reply via email to