Hi, I am working on developing an InputFormat for zip files as required by HADOOP-1824. For the same I would like to propose a simple approach and invite comments and suggestions from the community for my implementation.
Implementation Approach ----------------------- 1. Implement class ZipInputFormat to extend FileInputFormat. 2. Override the getSplits() method to read each file's InputStream and construct a ZipInputStream out of it. 3. Create FileSplits in a way that each file split has the following properties * FileSplit.start = start index of a zip entry. * FileSplit.length = end index of a zip entry. * fileSplit.file = Zip file. * Sum of compressed size of zip entries <= splitSize For e.g. start = 3, length = 6 signifies that zip entries 3 to 6 will be read from the zip file of this split. 4. Implement class ZipRecordReader to read each zip entry in its split Using LineRecordReader. I think I might be required to deal with compressionCodecFatory and other classes related to compression. How exactly, is not very clear to me. So any hints here would be useful. Apart from the above please let me know if there is anything that I am missing. Thanks -Ankur