Yeah, I thought about using counters but I was worried about
what happens if a Mapper task fails. Does the counter get adjusted to
remove any contributions that the failed Mapper made before
another replacement Mapper is started? Otherwise in the case of any
Mapper failure I'm going to get an overcount am I not?
Or is there some way to make sure that counters have
the correct semantics in the face of failures?
Peter Marron
-Original Message-
From: Dave Shine
[mailto:Dave.Shine@channelintelligence.
com]
Sent: 23 July 2012 15:35
To: common-user@hadoop.apache.org
Subject: RE: Counting records
You could just use a counter and never
emit anything from the Map(). Use the
getCounter(MyRecords,
RecordTypeToCount).increment(1)
whenever you find the type of record you
are looking for. Never call
output.collect(). Call the job with
reduceTasks(0). When the job finishes,
you can programmatically get the values
of all counters including the one you
create in the Map() method.
Dave Shine
Sr. Software Engineer
321.939.5093 direct | 407.314.0122
mobile CI Boost(tm) Clients Outperform
Online(tm) www.ciboost.com
-Original Message-
From: Peter Marron
[mailto:Peter.Marron@trilliumsoftware.
com]
Sent: Monday, July 23, 2012 10:25 AM
To: common-user@hadoop.apache.org
Subject: Counting records
Hi,
I am a complete noob with Hadoop and
MapReduce and I have a question that is
probably silly, but I still don't know the
answer.
For the purposes of discussion I'll assume
that I'm using a standard
TextInputFormat.
(I don't think that this changes things too
much.)
To simplify (a fair bit) I want to count all
the records that meet specific criteria.
I would like to use MapReduce because I
anticipate large sources and I want to
get the performance and reliability that
MapReduce offers.
So the obvious and simple approach is to
have my Mapper check whether each
record meets the criteria and emit a 0 or
a 1. Then I could use a combiner which
accumulates (like a LongSumReducer)
and use this as a reducer as well, and I
am sure that that would work fine.
However it seems massive overkill to
have all those 1s and 0s emitted and
stored on disc.
It seems tempting to have the Mapper
accumulate the count for all of the
records that it sees and then just emit
once at the end the total value. This
seems simple enough, except that the
Mapper doesn't seem to have any easy
way to know when it is presented with
the last record.
Now I could just make the Mapper take a
copy of the OutputCollector for each
record called and then in the close
method it could do a single emit.
However, although, this looks like it
would work with the current
implementation, there seem to be no
guarantees that the collector is valid at
the time that the close is called. This just
seems ugly.
Or I could get the Mapper to record the
first offset that it sees and read the split
length using
report.getInputSplit().getLength() and
then it could monitor how far it is
through the split and it should be able to
detect the last record. It looks like the
MapRunner class creates a Mapper
object and uses it to process a split, and
so it looks like it's safe to store state in
the mapper class between invocations of
the map method. (But is this just an
implementation artefact? Is the mapper
class supposed to be completely
stateless?)
Maybe I should have a custom
InputFormat class and have it flag the
last record by placing some extra
information in the key? (Assuming that
the InputFormant has enough
information from the split to be able to
detect the last record, which seems
reasonable enough.)
Is there some blessed way to do this?
Or am I barking up the wrong tree
because I should really just generate all
those 1s and 0s and accept the
overhead?
Regards,
Peter Marron
Trillium Software UK Limited
The information contained in this email
message is considered confidential and
proprietary to the sender and is intended
solely for review and use by the named
recipient. Any unauthorized review, use
or distribution is strictly prohibited. If you
have received this message in error,
please advise the sender by reply email
and delete the message.