Tim, Its pretty interesting to read, I once dug in for another user around here. Check out this archive post: http://search-hadoop.com/m/cRmJ3gTtN32 - Make sure to also read the LineReader sources (a layer under the LineRecordReader explained above), where we also can see the beyond-block-boundary fetch happen at the bytes level :)
On Wed, Sep 19, 2012 at 10:03 PM, Tim Robertson <timrobertson...@gmail.com> wrote: > Thanks for the explanation HJ - I always meant to look into that bit of code > to work out how it did it. > > Tim > > > > > On Wed, Sep 19, 2012 at 6:24 PM, Harsh J <ha...@cloudera.com> wrote: >> >> Hi Tim, >> >> Splits don't look at newlines in the TextInputFormat at least. So >> since the computed splits > default map numbers, I think a perfect >> file of 10 blocks will spawn only 10 mappers. The mapper's record >> reader is the one that reads until a newline (even after the end of >> its block length bytes). >> >> On Wed, Sep 19, 2012 at 9:16 PM, Tim Robertson >> <timrobertson...@gmail.com> wrote: >> > I think the splitting recognises the end of line, so you might get 11 >> > but >> > otherwise that looks correct. >> > >> > >> > >> > On Wed, Sep 19, 2012 at 5:42 PM, Pedro Sá da Costa <psdc1...@gmail.com> >> > wrote: >> >> >> >> >> >> >> >> If I've an input file of 640MB in size, and a split size of 64Mb, this >> >> file will be partitioned in 10 splits, and each split will be processed >> >> by a >> >> map task, right? >> >> >> >> -- >> >> Best regards, >> >> >> > >> >> >> >> -- >> Harsh J > > -- Harsh J