[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531362#comment-14531362
 ] 

Niels Basjes commented on MAPREDUCE-2094:
-----------------------------------------

I understand you want the error message to be 'clean'. Normally I would do that 
too.
This message can however only appear if you are using (or have been using for a 
long time)  a (usually custom) FileInputFormat that has been corrupting your 
results (for perhaps even years ... note I created this bug report about 4.5 
years ago).
I think it is important to clarify the impact of the problem the author of the 
custom code introduced themselves so they immediately understand what went 
wrong.
And yes... this message is bit over the top ...

Atleast let's make the message easier to understand what went wrong and make 
aware of the historical implications.
How about {{"A split was attempted for a file that is being decompressed by " + 
codec.getClass().getSimpleName() + " which does not support splitting. Note 
that this would have corrupted the data in older Hadoop versions."}}

> org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
> unsafe default behaviour that is different from the documented behaviour.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2094
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>            Reporter: Niels Basjes
>            Assignee: Niels Basjes
>              Labels: BB2015-05-TBR
>         Attachments: MAPREDUCE-2094-2011-05-19.patch, 
> MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, 
> MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, 
> MAPREDUCE-2094-2015-05-05-2328.patch, 
> MAPREDUCE-2094-FileInputFormat-docs-v2.patch
>
>
> When implementing a custom derivative of FileInputFormat we ran into the 
> effect that a large Gzipped input file would be processed several times. 
> A near 1GiB file would be processed around 36 times in its entirety. Thus 
> producing garbage results and taking up a lot more CPU time than needed.
> It took a while to figure out and what we found is that the default 
> implementation of the isSplittable method in 
> [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
> http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
>  ] is simply "return true;". 
> This is a very unsafe default and is in contradiction with the JavaDoc of the 
> method which states: "Is the given filename splitable? Usually, true, but if 
> the file is stream compressed, it will not be. " . The actual implementation 
> effectively does "Is the given filename splitable? Always true, even if the 
> file is stream compressed using an unsplittable compression codec. "
> For our situation (where we always have Gzipped input) we took the easy way 
> out and simply implemented an isSplittable in our class that does "return 
> false; "
> Now there are essentially 3 ways I can think of for fixing this (in order of 
> what I would find preferable):
> # Implement something that looks at the used compression of the file (i.e. do 
> migrate the implementation from TextInputFormat to FileInputFormat). This 
> would make the method do what the JavaDoc describes.
> # "Force" developers to think about it and make this method abstract.
> # Use a "safe" default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to