[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated MAPREDUCE-5572:
--------------------------------

    Description: 
The custom RecordReader class in MultiFileWordCount (MultiFileLineRecordReader) 
has been replaced in newer examples with a better implementation which uses the 
CombineFileInputFormat, which doesn't feature this bug.  However, this bug 
nevertheless still exists in 1.x versions of the MultiFileWordCount which rely 
on the mapred API.


The older MultiFileWordCount implementation defines the getPos() as follows:

long currentOffset = currentStream == null ? 0 : currentStream.getPos();
...

This is meant to prevent errors when underlying stream is null. But it doesn't 
gaurantee to work: The RawLocalFileSystem, for example, currectly will close 
the underlying file stream once it is consumed, and the currentStream will thus 
throw a NullPointerException when trying to access the null stream.

This is only seen when running this in the context where the MapTask class, 
which is only relevant in mapred.* API, calls getPos() twice in tandem, before 
and after reading a record.

This custom record reader should be gaurded, or else eliminated, since it 
assumes something which is not in the FileSystem contract:  That a getPos will 
always return a integral value.



  was:
The custom RecordReader class defines the getPos() as follows:

long currentOffset = currentStream == null ? 0 : currentStream.getPos();
...

This is meant to prevent errors when underlying stream is null. But it doesn't 
gaurantee to work: The RawLocalFileSystem, for example, currectly will close 
the underlying file stream once it is consumed, and the currentStream will thus 
throw a NullPointerException when trying to access the null stream.

This is only seen when running this in the context where the MapTask class, 
which is only relevant in mapred.* API, calls getPos() twice in tandem, before 
and after reading a record.

This custom record reader should be gaurded, or else eliminated, since it 
assumes something which is not in the FileSystem contract:  That a getPos will 
always return a integral value.

        Summary: Provide alternative logic for getPos() implementation in 
custom RecordReader of mapred implementation of MultiFileWordCount  (was: 
Provide alternative logic for getPos() implementation in custom RecordReader)

> Provide alternative logic for getPos() implementation in custom RecordReader 
> of mapred implementation of MultiFileWordCount
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5572
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5572
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: examples
>    Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.3, 1.2.1, 1.2.2
>            Reporter: jay vyas
>            Priority: Minor
>
> The custom RecordReader class in MultiFileWordCount 
> (MultiFileLineRecordReader) has been replaced in newer examples with a better 
> implementation which uses the CombineFileInputFormat, which doesn't feature 
> this bug.  However, this bug nevertheless still exists in 1.x versions of the 
> MultiFileWordCount which rely on the mapred API.
> The older MultiFileWordCount implementation defines the getPos() as follows:
> long currentOffset = currentStream == null ? 0 : currentStream.getPos();
> ...
> This is meant to prevent errors when underlying stream is null. But it 
> doesn't gaurantee to work: The RawLocalFileSystem, for example, currectly 
> will close the underlying file stream once it is consumed, and the 
> currentStream will thus throw a NullPointerException when trying to access 
> the null stream.
> This is only seen when running this in the context where the MapTask class, 
> which is only relevant in mapred.* API, calls getPos() twice in tandem, 
> before and after reading a record.
> This custom record reader should be gaurded, or else eliminated, since it 
> assumes something which is not in the FileSystem contract:  That a getPos 
> will always return a integral value.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to