[ https://issues.apache.org/jira/browse/HADOOP-6901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rick Weber resolved HADOOP-6901. -------------------------------- Resolution: Abandoned Marking as abandoned. Issues is 14 years old and Dumbo usage is no longer and issue/problem. > Parsing large compressed files with HADOOP-1722 spawns multiple mappers per > file > -------------------------------------------------------------------------------- > > Key: HADOOP-6901 > URL: https://issues.apache.org/jira/browse/HADOOP-6901 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 0.21.0 > Environment: Hadoop v0.20.2 + HADOOP-1722 > Reporter: Rick Weber > Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > This was originally discovered while using Dumbo to parse a very large (2G) > compressed file. By default, Dumbo will attempt to use the AutoInputFormat > as the input format. > Here is my use case: > I have a large (2Gb) compressed file. It's compressed using the default > method, which is Gzip based and is unsplittable. Due to the size, the > default implementation of AutoInputFormat says that this file is splittable. > As a result, this file is split into about 35 parts, and each one is assigned > to a Map job. > However, since the file itself is unsplittable, each Map job winds up parsing > the file again from the beginning. This basically means my job has 35x the > data, and takes 35x long to run. > If I set "-inputformat text", this problem does not appear in dumbo. If I > manually call the streaming jar and use AutoInputFormat, this > problem appears. > Looking at the code in AutoInputFormat, it appears that it uses the default > isSplittable() method from InputFormat, which indicates everything is > splittable. I think that this class should define it's own isSplittable > method similar to what is defined in the TextInputFormat class, which > basically says it's splittable if it's not compressed. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org