Re: Counting records

Michael Segel Mon, 23 Jul 2012 16:06:37 -0700

If the task fails the counter for that task is not used. 

So if you have speculative execution turned on and the JT kills a task, it 
won't affect your end results.


Again the only major caveat is that the counters are in memory so if you have a 
lot of counters... 

On Jul 23, 2012, at 4:52 PM, Peter Marron wrote:

> Yeah, I thought about using counters but I was worried about
> what happens if a Mapper task fails. Does the counter get adjusted to
> remove any contributions that the failed Mapper made before
> another replacement Mapper is started? Otherwise in the case of any
> Mapper failure I'm going to get an overcount am I not?
> 
> Or is there some way to make sure that counters have
> the correct semantics in the face of failures?
> 
> Peter Marron
> 
>> -----Original Message-----
>> From: Dave Shine
>> [mailto:Dave.Shine@channelintelligence.
>> com]
>> Sent: 23 July 2012 15:35
>> To: common-user@hadoop.apache.org
>> Subject: RE: Counting records
>> 
>> You could just use a counter and never
>> emit anything from the Map().  Use the
>> getCounter("MyRecords",
>> "RecordTypeToCount").increment(1)
>> whenever you find the type of record you
>> are looking for.  Never call
>> output.collect().  Call the job with
>> reduceTasks(0).  When the job finishes,
>> you can programmatically get the values
>> of all counters including the one you
>> create in the Map() method.
>> 
>> 
>> Dave Shine
>> Sr. Software Engineer
>> 321.939.5093 direct |  407.314.0122
>> mobile CI Boost(tm) Clients  Outperform
>> Online(tm)  www.ciboost.com
>> 
>> 
>> -----Original Message-----
>> From: Peter Marron
>> [mailto:Peter.Marron@trilliumsoftware.
>> com]
>> Sent: Monday, July 23, 2012 10:25 AM
>> To: common-user@hadoop.apache.org
>> Subject: Counting records
>> 
>> Hi,
>> 
>> I am a complete noob with Hadoop and
>> MapReduce and I have a question that is
>> probably silly, but I still don't know the
>> answer.
>> 
>> For the purposes of discussion I'll assume
>> that I'm using a standard
>> TextInputFormat.
>> (I don't think that this changes things too
>> much.)
>> 
>> To simplify (a fair bit) I want to count all
>> the records that meet specific criteria.
>> I would like to use MapReduce because I
>> anticipate large sources and I want to
>> get the performance and reliability that
>> MapReduce offers.
>> 
>> So the obvious and simple approach is to
>> have my Mapper check whether each
>> record meets the criteria and emit a 0 or
>> a 1. Then I could use a combiner which
>> accumulates (like a LongSumReducer)
>> and use this as a reducer as well, and I
>> am sure that that would work fine.
>> 
>> However it seems massive overkill to
>> have all those "1"s and "0"s emitted and
>> stored on disc.
>> It seems tempting to have the Mapper
>> accumulate the count for all of the
>> records that it sees and then just emit
>> once at the end the total value. This
>> seems simple enough, except that the
>> Mapper doesn't seem to have any easy
>> way to know when it is presented with
>> the last record.
>> 
>> Now I could just make the Mapper take a
>> copy of the OutputCollector for each
>> record called and then in the close
>> method it could do a single emit.
>> However, although, this looks like it
>> would work with the current
>> implementation, there seem to be no
>> guarantees that the collector is valid at
>> the time that the close is called. This just
>> seems ugly.
>> 
>> Or I could get the Mapper to record the
>> first offset that it sees and read the split
>> length using
>> report.getInputSplit().getLength() and
>> then it could monitor how far it is
>> through the split and it should be able to
>> detect the last record. It looks like the
>> MapRunner class creates a Mapper
>> object and uses it to process a split, and
>> so it looks like it's safe to store state in
>> the mapper class between invocations of
>> the map method. (But is this just an
>> implementation artefact? Is the mapper
>> class supposed to be completely
>> stateless?)
>> 
>> Maybe I should have a custom
>> InputFormat class and have it flag the
>> last record by placing some extra
>> information in the key? (Assuming that
>> the InputFormant has enough
>> information from the split to be able to
>> detect the last record, which seems
>> reasonable enough.)
>> 
>> Is there some "blessed" way to do this?
>> Or am I barking up the wrong tree
>> because I should really just generate all
>> those "1"s and "0"s and accept the
>> overhead?
>> 
>> Regards,
>> 
>> Peter Marron
>> Trillium Software UK Limited
>> 
>> 
>> The information contained in this email
>> message is considered confidential and
>> proprietary to the sender and is intended
>> solely for review and use by the named
>> recipient. Any unauthorized review, use
>> or distribution is strictly prohibited. If you
>> have received this message in error,
>> please advise the sender by reply email
>> and delete the message.
> 
> 
>

Re: Counting records

Reply via email to