[ https://issues.apache.org/jira/browse/MAPREDUCE-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17356189#comment-17356189 ]
Sebastien Crocquevieille commented on MAPREDUCE-210: ---------------------------------------------------- [~indrajeetapache], [~cutting] quick ping here. Any chance of waking up this issue from its deep slumber? If the previous work done on this issue is too dusty, as [~harisekhon] said there is a 3rd party format here: https://github.com/cotdp/com-cotdp-hadoop/tree/master/src/main/java/com/cotdp/hadoop With the associated blog post: [http://cutler.io/2012/07/hadoop-processing-zip-files-in-mapreduce/] We'd all be terribly grateful :) > want InputFormat for zip files > ------------------------------ > > Key: MAPREDUCE-210 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-210 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Reporter: Doug Cutting > Assignee: indrajit > Priority: Major > Attachments: ZipInputFormat_fixed.patch > > > HDFS is inefficient with large numbers of small files. Thus one might pack > many small files into large, compressed, archives. But, for efficient > map-reduce operation, it is desireable to be able to split inputs into > smaller chunks, with one or more small original file per split. The zip > format, unlike tar, permits enumeration of files in the archive without > scanning the entire archive. Thus a zip InputFormat could efficiently permit > splitting large archives into splits that contain one or more archived files. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org