[ 
https://issues.apache.org/jira/browse/HADOOP-6901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Weber resolved HADOOP-6901.
--------------------------------
    Resolution: Abandoned

Marking as abandoned.  Issues is 14 years old and Dumbo usage is no longer and 
issue/problem.

> Parsing large compressed files with HADOOP-1722 spawns multiple mappers per 
> file
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-6901
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6901
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.21.0
>         Environment: Hadoop v0.20.2 + HADOOP-1722
>            Reporter: Rick Weber
>            Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> This was originally discovered while using Dumbo to parse a very large (2G) 
> compressed file.  By default, Dumbo will attempt to use the AutoInputFormat 
> as the input format.  
> Here is my use case:
> I have a large (2Gb) compressed file.  It's compressed using the default 
> method, which is Gzip based and is unsplittable.  Due to the size, the 
> default implementation of AutoInputFormat says that this file is splittable. 
> As a result, this file is split into about 35 parts, and each one is assigned 
> to a Map job.
> However, since the file itself is unsplittable, each Map job winds up parsing 
> the file again from the beginning.  This basically means my job has 35x the 
> data, and takes 35x long to run.
> If I set "-inputformat text", this problem does not appear in dumbo. If I 
> manually call the streaming jar and use AutoInputFormat, this
> problem appears.
> Looking at the code in AutoInputFormat, it appears that it uses the default 
> isSplittable() method from InputFormat, which indicates everything is 
> splittable.  I think that this class should define it's own isSplittable 
> method similar to what is defined in the TextInputFormat class, which 
> basically says it's splittable if it's not compressed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to