Counting records

Peter Marron Mon, 23 Jul 2012 07:25:41 -0700

Hi,

I am a complete noob with Hadoop and MapReduce and I have a question that
is probably silly, but I still don't know the answer.


For the purposes of discussion I'll assume that I'm using a standard 
TextInputFormat.
(I don't think that this changes things too much.)

To simplify (a fair bit) I want to count all the records that meet specific 
criteria.
I would like to use MapReduce because I anticipate large sources and I want to
get the performance and reliability that MapReduce offers.

So the obvious and simple approach is to have my Mapper check whether each
record meets the criteria and emit a 0 or a 1. Then I could use a combiner
which accumulates (like a LongSumReducer) and use this as a reducer as well, 
and I am
sure that that would work fine.

However it seems massive overkill to have all those "1"s and "0"s emitted and 
stored on disc.
It seems tempting to have the Mapper accumulate the count for all of the 
records that it
sees and then just emit once at the end the total value. This seems simple 
enough, except
that the Mapper doesn't seem to have any easy way to know when it is presented 
with the
last record.

Now I could just make the Mapper take a copy of the OutputCollector for each 
record called and then in the
close method it could do a single emit. However, although, this looks like it 
would work
with the current implementation, there seem to be no guarantees that the 
collector is valid
at the time that the close is called. This just seems ugly.

Or I could get the Mapper to record the first offset that it sees and read the 
split length
using report.getInputSplit().getLength() and then it could monitor how far it 
is through the
split and it should be able to detect the last record. It looks like the 
MapRunner class creates
a Mapper object and uses it to process a split, and so it looks like it's safe 
to store state in
the mapper class between invocations of the map method. (But is this just an 
implementation
artefact? Is the mapper class supposed to be completely stateless?)

Maybe I should have a custom InputFormat class and have it flag the last record
by placing some extra information in the key? (Assuming that the InputFormant
has enough information from the split to be able to detect the last record,
which seems reasonable enough.)

Is there some "blessed" way to do this? Or am I barking up the wrong tree 
because I should
really just generate all those "1"s and "0"s and accept the overhead?

Regards,

Peter Marron
Trillium Software UK Limited

Counting records

Reply via email to