Counting records

2012-07-23 Thread Peter Marron
Hi,

I am a complete noob with Hadoop and MapReduce and I have a question that
is probably silly, but I still don't know the answer.

For the purposes of discussion I'll assume that I'm using a standard 
TextInputFormat.
(I don't think that this changes things too much.)

To simplify (a fair bit) I want to count all the records that meet specific 
criteria.
I would like to use MapReduce because I anticipate large sources and I want to
get the performance and reliability that MapReduce offers.

So the obvious and simple approach is to have my Mapper check whether each
record meets the criteria and emit a 0 or a 1. Then I could use a combiner
which accumulates (like a LongSumReducer) and use this as a reducer as well, 
and I am
sure that that would work fine.

However it seems massive overkill to have all those 1s and 0s emitted and 
stored on disc.
It seems tempting to have the Mapper accumulate the count for all of the 
records that it
sees and then just emit once at the end the total value. This seems simple 
enough, except
that the Mapper doesn't seem to have any easy way to know when it is presented 
with the
last record.

Now I could just make the Mapper take a copy of the OutputCollector for each 
record called and then in the
close method it could do a single emit. However, although, this looks like it 
would work
with the current implementation, there seem to be no guarantees that the 
collector is valid
at the time that the close is called. This just seems ugly.

Or I could get the Mapper to record the first offset that it sees and read the 
split length
using report.getInputSplit().getLength() and then it could monitor how far it 
is through the
split and it should be able to detect the last record. It looks like the 
MapRunner class creates
a Mapper object and uses it to process a split, and so it looks like it's safe 
to store state in
the mapper class between invocations of the map method. (But is this just an 
implementation
artefact? Is the mapper class supposed to be completely stateless?)

Maybe I should have a custom InputFormat class and have it flag the last record
by placing some extra information in the key? (Assuming that the InputFormant
has enough information from the split to be able to detect the last record,
which seems reasonable enough.)

Is there some blessed way to do this? Or am I barking up the wrong tree 
because I should
really just generate all those 1s and 0s and accept the overhead?

Regards,

Peter Marron
Trillium Software UK Limited



Re: Counting records

2012-07-23 Thread Kai Voigt
Hi,

an additional idea is to use the counter API inside the framework.

http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/ has 
a good example.

Kai

Am 23.07.2012 um 16:25 schrieb Peter Marron:

 I am a complete noob with Hadoop and MapReduce and I have a question that
 is probably silly, but I still don't know the answer.
 
 To simplify (a fair bit) I want to count all the records that meet specific 
 criteria.
 I would like to use MapReduce because I anticipate large sources and I want to
 get the performance and reliability that MapReduce offers.

-- 
Kai Voigt
k...@123.org






RE: Counting records

2012-07-23 Thread Dave Shine
You could just use a counter and never emit anything from the Map().  Use the 
getCounter(MyRecords, RecordTypeToCount).increment(1) whenever you find the 
type of record you are looking for.  Never call output.collect().  Call the job 
with reduceTasks(0).  When the job finishes, you can programmatically get the 
values of all counters including the one you create in the Map() method.


Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile
CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com


-Original Message-
From: Peter Marron [mailto:peter.mar...@trilliumsoftware.com]
Sent: Monday, July 23, 2012 10:25 AM
To: common-user@hadoop.apache.org
Subject: Counting records

Hi,

I am a complete noob with Hadoop and MapReduce and I have a question that is 
probably silly, but I still don't know the answer.

For the purposes of discussion I'll assume that I'm using a standard 
TextInputFormat.
(I don't think that this changes things too much.)

To simplify (a fair bit) I want to count all the records that meet specific 
criteria.
I would like to use MapReduce because I anticipate large sources and I want to 
get the performance and reliability that MapReduce offers.

So the obvious and simple approach is to have my Mapper check whether each 
record meets the criteria and emit a 0 or a 1. Then I could use a combiner 
which accumulates (like a LongSumReducer) and use this as a reducer as well, 
and I am sure that that would work fine.

However it seems massive overkill to have all those 1s and 0s emitted and 
stored on disc.
It seems tempting to have the Mapper accumulate the count for all of the 
records that it sees and then just emit once at the end the total value. This 
seems simple enough, except that the Mapper doesn't seem to have any easy way 
to know when it is presented with the last record.

Now I could just make the Mapper take a copy of the OutputCollector for each 
record called and then in the close method it could do a single emit. However, 
although, this looks like it would work with the current implementation, there 
seem to be no guarantees that the collector is valid at the time that the close 
is called. This just seems ugly.

Or I could get the Mapper to record the first offset that it sees and read the 
split length using report.getInputSplit().getLength() and then it could monitor 
how far it is through the split and it should be able to detect the last 
record. It looks like the MapRunner class creates a Mapper object and uses it 
to process a split, and so it looks like it's safe to store state in the mapper 
class between invocations of the map method. (But is this just an 
implementation artefact? Is the mapper class supposed to be completely 
stateless?)

Maybe I should have a custom InputFormat class and have it flag the last record 
by placing some extra information in the key? (Assuming that the InputFormant 
has enough information from the split to be able to detect the last record, 
which seems reasonable enough.)

Is there some blessed way to do this? Or am I barking up the wrong tree 
because I should really just generate all those 1s and 0s and accept the 
overhead?

Regards,

Peter Marron
Trillium Software UK Limited


The information contained in this email message is considered confidential and 
proprietary to the sender and is intended solely for review and use by the 
named recipient. Any unauthorized review, use or distribution is strictly 
prohibited. If you have received this message in error, please advise the 
sender by reply email and delete the message.


Re: Counting records

2012-07-23 Thread Michael Segel
Look at using a dynamic counter. 

You don't need to set up or declare an enum. 
The only caveat is that counters are passed back to the JT by each task and are 
stored in memory. 


On Jul 23, 2012, at 9:32 AM, Kai Voigt wrote:

 http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/



RE: Counting records

2012-07-23 Thread Peter Marron
Yeah, I thought about using counters but I was worried about
what happens if a Mapper task fails. Does the counter get adjusted to
remove any contributions that the failed Mapper made before
another replacement Mapper is started? Otherwise in the case of any
Mapper failure I'm going to get an overcount am I not?

Or is there some way to make sure that counters have
the correct semantics in the face of failures?

Peter Marron

 -Original Message-
 From: Dave Shine
 [mailto:Dave.Shine@channelintelligence.
 com]
 Sent: 23 July 2012 15:35
 To: common-user@hadoop.apache.org
 Subject: RE: Counting records
 
 You could just use a counter and never
 emit anything from the Map().  Use the
 getCounter(MyRecords,
 RecordTypeToCount).increment(1)
 whenever you find the type of record you
 are looking for.  Never call
 output.collect().  Call the job with
 reduceTasks(0).  When the job finishes,
 you can programmatically get the values
 of all counters including the one you
 create in the Map() method.
 
 
 Dave Shine
 Sr. Software Engineer
 321.939.5093 direct |  407.314.0122
 mobile CI Boost(tm) Clients  Outperform
 Online(tm)  www.ciboost.com
 
 
 -Original Message-
 From: Peter Marron
 [mailto:Peter.Marron@trilliumsoftware.
 com]
 Sent: Monday, July 23, 2012 10:25 AM
 To: common-user@hadoop.apache.org
 Subject: Counting records
 
 Hi,
 
 I am a complete noob with Hadoop and
 MapReduce and I have a question that is
 probably silly, but I still don't know the
 answer.
 
 For the purposes of discussion I'll assume
 that I'm using a standard
 TextInputFormat.
 (I don't think that this changes things too
 much.)
 
 To simplify (a fair bit) I want to count all
 the records that meet specific criteria.
 I would like to use MapReduce because I
 anticipate large sources and I want to
 get the performance and reliability that
 MapReduce offers.
 
 So the obvious and simple approach is to
 have my Mapper check whether each
 record meets the criteria and emit a 0 or
 a 1. Then I could use a combiner which
 accumulates (like a LongSumReducer)
 and use this as a reducer as well, and I
 am sure that that would work fine.
 
 However it seems massive overkill to
 have all those 1s and 0s emitted and
 stored on disc.
 It seems tempting to have the Mapper
 accumulate the count for all of the
 records that it sees and then just emit
 once at the end the total value. This
 seems simple enough, except that the
 Mapper doesn't seem to have any easy
 way to know when it is presented with
 the last record.
 
 Now I could just make the Mapper take a
 copy of the OutputCollector for each
 record called and then in the close
 method it could do a single emit.
 However, although, this looks like it
 would work with the current
 implementation, there seem to be no
 guarantees that the collector is valid at
 the time that the close is called. This just
 seems ugly.
 
 Or I could get the Mapper to record the
 first offset that it sees and read the split
 length using
 report.getInputSplit().getLength() and
 then it could monitor how far it is
 through the split and it should be able to
 detect the last record. It looks like the
 MapRunner class creates a Mapper
 object and uses it to process a split, and
 so it looks like it's safe to store state in
 the mapper class between invocations of
 the map method. (But is this just an
 implementation artefact? Is the mapper
 class supposed to be completely
 stateless?)
 
 Maybe I should have a custom
 InputFormat class and have it flag the
 last record by placing some extra
 information in the key? (Assuming that
 the InputFormant has enough
 information from the split to be able to
 detect the last record, which seems
 reasonable enough.)
 
 Is there some blessed way to do this?
 Or am I barking up the wrong tree
 because I should really just generate all
 those 1s and 0s and accept the
 overhead?
 
 Regards,
 
 Peter Marron
 Trillium Software UK Limited
 
 
 The information contained in this email
 message is considered confidential and
 proprietary to the sender and is intended
 solely for review and use by the named
 recipient. Any unauthorized review, use
 or distribution is strictly prohibited. If you
 have received this message in error,
 please advise the sender by reply email
 and delete the message.




Re: Counting records

2012-07-23 Thread Michael Segel
If the task fails the counter for that task is not used. 

So if you have speculative execution turned on and the JT kills a task, it 
won't affect your end results.  

Again the only major caveat is that the counters are in memory so if you have a 
lot of counters... 

On Jul 23, 2012, at 4:52 PM, Peter Marron wrote:

 Yeah, I thought about using counters but I was worried about
 what happens if a Mapper task fails. Does the counter get adjusted to
 remove any contributions that the failed Mapper made before
 another replacement Mapper is started? Otherwise in the case of any
 Mapper failure I'm going to get an overcount am I not?
 
 Or is there some way to make sure that counters have
 the correct semantics in the face of failures?
 
 Peter Marron
 
 -Original Message-
 From: Dave Shine
 [mailto:Dave.Shine@channelintelligence.
 com]
 Sent: 23 July 2012 15:35
 To: common-user@hadoop.apache.org
 Subject: RE: Counting records
 
 You could just use a counter and never
 emit anything from the Map().  Use the
 getCounter(MyRecords,
 RecordTypeToCount).increment(1)
 whenever you find the type of record you
 are looking for.  Never call
 output.collect().  Call the job with
 reduceTasks(0).  When the job finishes,
 you can programmatically get the values
 of all counters including the one you
 create in the Map() method.
 
 
 Dave Shine
 Sr. Software Engineer
 321.939.5093 direct |  407.314.0122
 mobile CI Boost(tm) Clients  Outperform
 Online(tm)  www.ciboost.com
 
 
 -Original Message-
 From: Peter Marron
 [mailto:Peter.Marron@trilliumsoftware.
 com]
 Sent: Monday, July 23, 2012 10:25 AM
 To: common-user@hadoop.apache.org
 Subject: Counting records
 
 Hi,
 
 I am a complete noob with Hadoop and
 MapReduce and I have a question that is
 probably silly, but I still don't know the
 answer.
 
 For the purposes of discussion I'll assume
 that I'm using a standard
 TextInputFormat.
 (I don't think that this changes things too
 much.)
 
 To simplify (a fair bit) I want to count all
 the records that meet specific criteria.
 I would like to use MapReduce because I
 anticipate large sources and I want to
 get the performance and reliability that
 MapReduce offers.
 
 So the obvious and simple approach is to
 have my Mapper check whether each
 record meets the criteria and emit a 0 or
 a 1. Then I could use a combiner which
 accumulates (like a LongSumReducer)
 and use this as a reducer as well, and I
 am sure that that would work fine.
 
 However it seems massive overkill to
 have all those 1s and 0s emitted and
 stored on disc.
 It seems tempting to have the Mapper
 accumulate the count for all of the
 records that it sees and then just emit
 once at the end the total value. This
 seems simple enough, except that the
 Mapper doesn't seem to have any easy
 way to know when it is presented with
 the last record.
 
 Now I could just make the Mapper take a
 copy of the OutputCollector for each
 record called and then in the close
 method it could do a single emit.
 However, although, this looks like it
 would work with the current
 implementation, there seem to be no
 guarantees that the collector is valid at
 the time that the close is called. This just
 seems ugly.
 
 Or I could get the Mapper to record the
 first offset that it sees and read the split
 length using
 report.getInputSplit().getLength() and
 then it could monitor how far it is
 through the split and it should be able to
 detect the last record. It looks like the
 MapRunner class creates a Mapper
 object and uses it to process a split, and
 so it looks like it's safe to store state in
 the mapper class between invocations of
 the map method. (But is this just an
 implementation artefact? Is the mapper
 class supposed to be completely
 stateless?)
 
 Maybe I should have a custom
 InputFormat class and have it flag the
 last record by placing some extra
 information in the key? (Assuming that
 the InputFormant has enough
 information from the split to be able to
 detect the last record, which seems
 reasonable enough.)
 
 Is there some blessed way to do this?
 Or am I barking up the wrong tree
 because I should really just generate all
 those 1s and 0s and accept the
 overhead?
 
 Regards,
 
 Peter Marron
 Trillium Software UK Limited
 
 
 The information contained in this email
 message is considered confidential and
 proprietary to the sender and is intended
 solely for review and use by the named
 recipient. Any unauthorized review, use
 or distribution is strictly prohibited. If you
 have received this message in error,
 please advise the sender by reply email
 and delete the message.