Thanks for the clarification.

On 23 April 2012 12:52, Harsh J <ha...@cloudera.com> wrote:

> Dan,
>
> Split and reading a whole file as a chunk are two slightly different
> things. The former controls if your files ought to be split across
> mappers (useful if there are multiple blocks of file in HDFS). The
> latter needs to be achieved differently.
>
> The TextInputFormat provides by default a LineRecordReader, which as
> it name goes - reads whatever stream is provided to it line-by-line.
> This is regardless of the file's block splits (a very different thing
> than line splits).
>
> You need to implement your own "RecordReader" and return it from your
> InputFormat to do what you want it to - i.e. read the whole stream
> into an object and then pass it out to the Mapper.
>
> On Mon, Apr 23, 2012 at 5:10 PM, Dan Drew <wirefr...@googlemail.com>
> wrote:
> > I require each input file to be processed by each mapper as a whole.
> >
> > I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override
> > isSplitable() to invariably return false.
> >
> > The job is configured to use this subclass as the input format class via
> > setInputFormatClass(). The job runs without error, yet the logs reveal
> > files are still processed line by line by the mappers.
> >
> > Any help would be greatly appreciated,
> > Thanks
>
>
>
> --
> Harsh J
>

Reply via email to