[ https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557061#action_12557061 ]
Doug Cutting commented on HADOOP-1824: -------------------------------------- > 2. Override the getSplits() method to read each file's InputStream I think getSplits() should construct a split for each element of java.util.zip.ZipFile#entries(). > 3. Create FileSplits [ ... ] We should probably extend FileSplit or InputSplit specifically for zip files. The fields needed per split are the archive file's path and the path of the file within the archive. I don't think there's much point in supporting splits smaller than a file within the zip archive, so start and end offsets are not required here. > 4. Implement class ZipRecordReader to read each zip entry in its split Using LineRecordReader. We should be able to use LineRecordReader directly, passing its constructor the result of ZipFile#getInputStream(). > want InputFormat for zip files > ------------------------------ > > Key: HADOOP-1824 > URL: https://issues.apache.org/jira/browse/HADOOP-1824 > Project: Hadoop > Issue Type: New Feature > Components: mapred > Reporter: Doug Cutting > > HDFS is inefficient with large numbers of small files. Thus one might pack > many small files into large, compressed, archives. But, for efficient > map-reduce operation, it is desireable to be able to split inputs into > smaller chunks, with one or more small original file per split. The zip > format, unlike tar, permits enumeration of files in the archive without > scanning the entire archive. Thus a zip InputFormat could efficiently permit > splitting large archives into splits that contain one or more archived files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.