RE: Counting records

2012-07-23 Thread Dave Shine
You could just use a counter and never emit anything from the Map().  Use the 
getCounter(MyRecords, RecordTypeToCount).increment(1) whenever you find the 
type of record you are looking for.  Never call output.collect().  Call the job 
with reduceTasks(0).  When the job finishes, you can programmatically get the 
values of all counters including the one you create in the Map() method.


Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile
CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com


-Original Message-
From: Peter Marron [mailto:peter.mar...@trilliumsoftware.com]
Sent: Monday, July 23, 2012 10:25 AM
To: common-user@hadoop.apache.org
Subject: Counting records

Hi,

I am a complete noob with Hadoop and MapReduce and I have a question that is 
probably silly, but I still don't know the answer.

For the purposes of discussion I'll assume that I'm using a standard 
TextInputFormat.
(I don't think that this changes things too much.)

To simplify (a fair bit) I want to count all the records that meet specific 
criteria.
I would like to use MapReduce because I anticipate large sources and I want to 
get the performance and reliability that MapReduce offers.

So the obvious and simple approach is to have my Mapper check whether each 
record meets the criteria and emit a 0 or a 1. Then I could use a combiner 
which accumulates (like a LongSumReducer) and use this as a reducer as well, 
and I am sure that that would work fine.

However it seems massive overkill to have all those 1s and 0s emitted and 
stored on disc.
It seems tempting to have the Mapper accumulate the count for all of the 
records that it sees and then just emit once at the end the total value. This 
seems simple enough, except that the Mapper doesn't seem to have any easy way 
to know when it is presented with the last record.

Now I could just make the Mapper take a copy of the OutputCollector for each 
record called and then in the close method it could do a single emit. However, 
although, this looks like it would work with the current implementation, there 
seem to be no guarantees that the collector is valid at the time that the close 
is called. This just seems ugly.

Or I could get the Mapper to record the first offset that it sees and read the 
split length using report.getInputSplit().getLength() and then it could monitor 
how far it is through the split and it should be able to detect the last 
record. It looks like the MapRunner class creates a Mapper object and uses it 
to process a split, and so it looks like it's safe to store state in the mapper 
class between invocations of the map method. (But is this just an 
implementation artefact? Is the mapper class supposed to be completely 
stateless?)

Maybe I should have a custom InputFormat class and have it flag the last record 
by placing some extra information in the key? (Assuming that the InputFormant 
has enough information from the split to be able to detect the last record, 
which seems reasonable enough.)

Is there some blessed way to do this? Or am I barking up the wrong tree 
because I should really just generate all those 1s and 0s and accept the 
overhead?

Regards,

Peter Marron
Trillium Software UK Limited


The information contained in this email message is considered confidential and 
proprietary to the sender and is intended solely for review and use by the 
named recipient. Any unauthorized review, use or distribution is strictly 
prohibited. If you have received this message in error, please advise the 
sender by reply email and delete the message.


RE: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Dave Shine
I've just started playing with the Fair Scheduler.  To specify the pool at job 
submission time you set the mapred.fairscheduler.pool property on the Job 
Conf to the name of the pool you want the job to use.

Dave


-Original Message-
From: Merto Mertek [mailto:masmer...@gmail.com]
Sent: Thursday, March 01, 2012 9:33 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop fair scheduler doubt: allocate jobs to pool

From the fairscheduler docs I assume the following should work:

property
 namemapred.fairscheduler.poolnameproperty/name
   valuepool.name/value
/property

property
  namepool.name/name
  value${mapreduce.job.group.name}/value
/property

which means that the default pool will be the group of the user that has 
submitted the job. In your case I think that allocations.xml is correct. If you 
want to explicitly define a job to specific pool from your allocation.xml file 
you can define it as follows:

Configuration conf3 = conf;
conf3.set(pool.name, pool3); // conf.set(propriety.name, value)

Let me know if it works..


On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote:

 How can I set the fair scheduler such that all jobs submitted from a
 particular user group go to a pool with the group name?

 I have setup fair scheduler and I have two users: A and B (belonging
 to the user group hadoop)

 When these users submit hadoop jobs, the jobs from A got to a pool
 named A and the jobs from B go to a pool named B.
  I want them to go to a pool with their group name, So I tried adding
 the following to mapred-site.xml:

 property
  namemapred.fairscheduler.poolnameproperty/name
 valuegroup.name/value
 /property

 But instead the jobs now go to the default pool.
 I want the jobs submitted by A and B to go to the pool named hadoop.
 How do I do that?
 also how can I explicity set a job to any specified pool?

 I have set the allocation file (fair-scheduler.xml) like this:

 allocations
  pool name=hadoop
minMaps1/minMaps
minReduces1/minReduces
maxMaps3/maxMaps
maxReduces3/maxReduces
  /pool
  userMaxJobsDefault5/userMaxJobsDefault
 /allocations

 Any help is greatly appreciated.
 Thanks,
 Austin


The information contained in this email message is considered confidential and 
proprietary to the sender and is intended solely for review and use by the 
named recipient. Any unauthorized review, use or distribution is strictly 
prohibited. If you have received this message in error, please advise the 
sender by reply email and delete the message.