Thanks for the clarification. On 23 April 2012 12:52, Harsh J <ha...@cloudera.com> wrote:
> Dan, > > Split and reading a whole file as a chunk are two slightly different > things. The former controls if your files ought to be split across > mappers (useful if there are multiple blocks of file in HDFS). The > latter needs to be achieved differently. > > The TextInputFormat provides by default a LineRecordReader, which as > it name goes - reads whatever stream is provided to it line-by-line. > This is regardless of the file's block splits (a very different thing > than line splits). > > You need to implement your own "RecordReader" and return it from your > InputFormat to do what you want it to - i.e. read the whole stream > into an object and then pass it out to the Mapper. > > On Mon, Apr 23, 2012 at 5:10 PM, Dan Drew <wirefr...@googlemail.com> > wrote: > > I require each input file to be processed by each mapper as a whole. > > > > I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override > > isSplitable() to invariably return false. > > > > The job is configured to use this subclass as the input format class via > > setInputFormatClass(). The job runs without error, yet the logs reveal > > files are still processed line by line by the mappers. > > > > Any help would be greatly appreciated, > > Thanks > > > > -- > Harsh J >