[ https://issues.apache.org/jira/browse/MAPREDUCE-5511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
jay vyas updated MAPREDUCE-5511: -------------------------------- Affects Version/s: 1.0.0 1.2.0 > Multifilewc and the mapred.* API: Is the use of getPos() valid? > ---------------------------------------------------------------- > > Key: MAPREDUCE-5511 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5511 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: examples > Affects Versions: 1.0.0, 1.2.0 > Reporter: jay vyas > Priority: Minor > > The MultiFileWordCount class in the hadoop examples libraries uses a record > reader which switches between files. This behaviour can cause the > RawLocalFileSystem to break in a concurrent environment because of the way > buffering works (in RawLocalFileSystem, switching between streams results in > a temproraily "null" inner stream, and that inner stream is called by the > getPos() implementation in the custom RecordReader for MultiFileWordCount). > There are basically 2 ways to handle this: > 1) Wrap the getPos() implementation in the object returned by open() in the > RawLocalFileSystem to cache the value of getPos() everytime it is called, so > that calls to getPos() can return a valid long even if underlying stream is > null. OR > 2) Update the RecordReader in multifilewc to not rely on the inner input > stream and cache the position / return 0 if the stream cannot return a valid > value. > The final question here is: Is the RecordReader for MultiFileWordCount doing > the right thing ? Or is it breaking the contract of getPos()... and > really... what SHOULD getPos() return if the underlying stream has already > been consumed? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira