Re: counter for number of mapper records

Chris Douglas Thu, 25 Sep 2008 18:24:14 -0700

The counter for reduce input records is updated as the reduces consumerecords. Processing it in a reduce is meaningless. Relying on countersfor job correctness (particularly from the same job) is risky at best.

Neither maps nor reduces are given a set number of records, unlessthis is specifically engineered. If you want the total number ofrecords output from the maps in the reduce, you'll probably need towrite a MapRunnable, track records output using a filterOutputCollector, and output that count- with key/value types matchingthose of your job- before you exit the run method, for each reduce(you'll likely need a custom partitioner to recognize that each countrecord is special and belongs to a particular reduce, or well-designedhashCode functions). You'll also need a custom comparator thatconsiders counts less than data, while preserving the ordering of yourdata. Finally, you'll want a grouping comparator to take all thecounts at the front of your reduce as a single set of records. Thereduce will collapse all these records into the total record countfrom the map before processing the data following it. Note that it maynot match the number of reduce input records if you're running with acombiner.

If your job is deterministic, you can write a version of your map thatemits counts, follow the wordcount example to get a map output recordcount, read and store the result in the config, and start your own jobwith this information.

If it's only a sanity check and not part of your computation, youprobably want to query and verify it from your driver when the job isfinished. An example is in TestReduceFetch. -C


On Sep 25, 2008, at 4:52 PM, Sandy wrote:

Thanks, Lohit.
I took a look at the Task_Counter.properties file and fiured outthat I
would like to use REDUCE_INPUT_RECORDS.
I want to access this within my reduce function, just to check thevalue.
In order to do this, I tried to include
import    org.apache.hadoop.mapred.Task;

and I had the following added to my Reduce function code:

reduce() {
...
Counters counter = new Counters();
long reduce_records =counter.getCounter(Task.Counter.REDUCE_INPUT_RECORDS);
...
}
However, it seems that I'm not allowed to do this. I get thefollowing when
I try to compile:
error 1: org.apache.hadoop.mapred.Task is not public in
org.apache.hadoop.mapred; cannot be accessed from outside package
error 2: package Task does not exist
I suspect that the second error is a result from the first. All Iwant to dois access the value of this counter. Could you please explain howthis can
be accomplished?

Thanks in advance,

-SM

On Wed, Sep 24, 2008 at 7:40 PM, lohit <[EMAIL PROTECTED]> wrote:
Yes, take a look at
src/mapred/org/apache/hadoop/mapred/Task_Counter.properties

Those are all the counters available for a task.
-Lohit



----- Original Message ----
From: Sandy <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, September 24, 2008 5:09:39 PM
Subject: counter for number of mapper records
If I understand correctly, each mapper is sent a set number ofrecords. Isthere a counter or variable that tells you how many records is sentto a
particular mapper?
Likewise, is there a similar thing for reducers?

Thanks in advance.

-SM

Re: counter for number of mapper records

Reply via email to