Re: isSplitable() problem

2012-04-24 Thread Dan Drew
I have chosen to use Jay's suggestion as a quick workaround and am pleased
to report that it seems to work well on small test inputs.

My question now is, are the mappers guaranteed to receive the file's lines
in order?

Browsing the source suggests this is so, but I just want to make sure as my
understanding of Hadoop is transubstantial.

Thank you for your patience in answering my questions.

On 23 April 2012 14:28, Harsh J ha...@cloudera.com wrote:

 Jay,

 On Mon, Apr 23, 2012 at 6:43 PM, JAX jayunit...@gmail.com wrote:
  Curious : Seems like you could aggregate the results in the mapper as a
 local variable or list of strings--- is there a way to know that your
 mapper has just read the LAST line of an input split?

 True. Can be one way to do it (unless aggregation of 'records' needs
 to happen live, and you don't wish to store it all in memory).

  Is there a cleanup or finalize method in mappers that is run at the
 end of a whole steam read to support these sort of chunked, in memor map/r
 operations?

 Yes there is. See:

 Old API:
 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html
 (See Closeable's close())

 New API:
 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html#cleanup(org.apache.hadoop.mapreduce.Mapper.Context)


 --
 Harsh J



isSplitable() problem

2012-04-23 Thread Dan Drew
I require each input file to be processed by each mapper as a whole.

I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override
isSplitable() to invariably return false.

The job is configured to use this subclass as the input format class via
setInputFormatClass(). The job runs without error, yet the logs reveal
files are still processed line by line by the mappers.

Any help would be greatly appreciated,
Thanks


Re: isSplitable() problem

2012-04-23 Thread Dan Drew
Thanks for the clarification.

On 23 April 2012 12:52, Harsh J ha...@cloudera.com wrote:

 Dan,

 Split and reading a whole file as a chunk are two slightly different
 things. The former controls if your files ought to be split across
 mappers (useful if there are multiple blocks of file in HDFS). The
 latter needs to be achieved differently.

 The TextInputFormat provides by default a LineRecordReader, which as
 it name goes - reads whatever stream is provided to it line-by-line.
 This is regardless of the file's block splits (a very different thing
 than line splits).

 You need to implement your own RecordReader and return it from your
 InputFormat to do what you want it to - i.e. read the whole stream
 into an object and then pass it out to the Mapper.

 On Mon, Apr 23, 2012 at 5:10 PM, Dan Drew wirefr...@googlemail.com
 wrote:
  I require each input file to be processed by each mapper as a whole.
 
  I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override
  isSplitable() to invariably return false.
 
  The job is configured to use this subclass as the input format class via
  setInputFormatClass(). The job runs without error, yet the logs reveal
  files are still processed line by line by the mappers.
 
  Any help would be greatly appreciated,
  Thanks



 --
 Harsh J