[
https://issues.apache.org/jira/browse/HADOOP-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daehyun Kim updated HADOOP-4652:
--------------------------------
Attachment: (was: HADOOP-4652-forCONTRIB.patch)
> RAgzip: multiple map tasks for a large gzipped file
> ---------------------------------------------------
>
> Key: HADOOP-4652
> URL: https://issues.apache.org/jira/browse/HADOOP-4652
> Project: Hadoop Core
> Issue Type: Improvement
> Components: io, mapred, native
> Affects Versions: 0.18.2
> Reporter: Daehyun Kim
>
> Currently, the hadoop processes gzipped files with only one map.
> We have made a patch that enables multiple map tasks for one large *gzipped*
> file. We call the patch RAgzip.
> To process multiple map tasks for gzipped file, you may use RAgzip by just
> changing InputFormat to RAGZIPInputFormat.
> The option used in RAGZIPInputFormat can be found at the javadoc of
> RAGZIPInputFormat part.
> RAgzip uses zlib's inflatePrime function which supports random access on a
> gzipped file.
> Since the inflatePrime is supported from the version of 1.2.2.4, it requires
> zlib 1.2.2.4 or higher. (We tested on zlib 1.2.3)
> RAgzip requires the preprocessing step that creates an access point (.ap)
> file, which is like the index of the gzipped file chunks.
> The access point(.ap) file is located in same path of the gzipped file.
> If there is a "/user/hadoop/test.gz", the .ap file is created with
> "/user/hadoop/test.gz.ap".
> We made two patches.
> 1. One makes changes in the source of the hadoop core. This is the main
> patch.
> If the zlib version of the hadoop cluster is greater than 1.2.2.4, you should
> use this patch.
> 2. On the other hand, if there is a computer with zlib version less than
> 1.2.2.4 in hadoop cluster, you should use the other patch.
> This patch uses static link library of the zlib. So if you compile this patch
> once at a computer with zlib version greater than 1.2.2.4, RAgzip can be used
> in hadoop cluster even if a computer with zlib version less than 1.2.2.4
> exists in the cluster.
> As you know, second patch creates jar
> file(build/contrib/ragzip/hadoop-x.xx.x-dev-ragzip.jar) as a result of its
> installation.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.