[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13768433#comment-13768433
 ] 

jay vyas commented on MAPREDUCE-5511:
-------------------------------------

Another note: The newer implementations of multifilewordcount in mapreduce.* 
that dont provide a RecordReader.getPos() implementation don't have this 
problem.   

So this really is related also to support for the multifilewordcount class.  

With new filesystem implementations which mapreduce can work on top of, it is 
important to define the expected semantics of getPos() for FSInputStreams.


                
> Multifilewc and the mapred.* API:  Is the use of getPos() valid?
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-5511
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5511
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: examples
>            Reporter: jay vyas
>            Priority: Minor
>
> The MultiFileWordCount class in the hadoop examples libraries uses a record 
> reader which switches between files.  This behaviour can cause the 
> RawLocalFileSystem to break in a concurrent environment because of the way 
> buffering works (in RawLocalFileSystem, switching between streams results in 
> a temproraily "null" inner stream, and that inner stream is called by the 
> getPos() implementation in the custom RecordReader for MultiFileWordCount). 
> There are basically 2 ways to handle this:
> 1) Wrap the getPos() implementation in the object returned by open() in the 
> RawLocalFileSystem to cache the value of getPos() everytime it is called, so 
> that calls to getPos() can return a valid long even if underlying stream is 
> null. OR
> 2) Update the RecordReader in multifilewc to not rely on the inner input 
> stream and cache the position / return 0 if the stream cannot return a valid 
> value. 
> The final question here is:  Is the RecordReader for MultiFileWordCount doing 
> the right thing ?  Or is it breaking the contract of getPos()... and 
> really... what SHOULD getPos() return if the underlying stream has already 
> been consumed? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to