[ https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ankur updated HADOOP-1824: -------------------------- Affects Version/s: 0.15.2 Status: Patch Available (was: Open) * This patch does not modify any existing source file and adds 3 new files 1. ZipInputFormat.java 2. ZipSplit.java 3. TestZipInputFormat.java * The ZipInputFormat simply creates one split for each zip entry in an input zip file. * Each split is of type ZipSplit and is read using a LineRecordReader. * TestZipInputFormat is the unit test code that tests the ZipInputFormat with different zip files having different number of entries. * More information is available in the javadoc > want InputFormat for zip files > ------------------------------ > > Key: HADOOP-1824 > URL: https://issues.apache.org/jira/browse/HADOOP-1824 > Project: Hadoop > Issue Type: New Feature > Components: mapred > Affects Versions: 0.15.2 > Reporter: Doug Cutting > > HDFS is inefficient with large numbers of small files. Thus one might pack > many small files into large, compressed, archives. But, for efficient > map-reduce operation, it is desireable to be able to split inputs into > smaller chunks, with one or more small original file per split. The zip > format, unlike tar, permits enumeration of files in the archive without > scanning the entire archive. Thus a zip InputFormat could efficiently permit > splitting large archives into splits that contain one or more archived files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.