The counter for reduce input records is updated as the reduces consume records. Processing it in a reduce is meaningless. Relying on counters for job correctness (particularly from the same job) is risky at best.

Neither maps nor reduces are given a set number of records, unless this is specifically engineered. If you want the total number of records output from the maps in the reduce, you'll probably need to write a MapRunnable, track records output using a filter OutputCollector, and output that count- with key/value types matching those of your job- before you exit the run method, for each reduce (you'll likely need a custom partitioner to recognize that each count record is special and belongs to a particular reduce, or well-designed hashCode functions). You'll also need a custom comparator that considers counts less than data, while preserving the ordering of your data. Finally, you'll want a grouping comparator to take all the counts at the front of your reduce as a single set of records. The reduce will collapse all these records into the total record count from the map before processing the data following it. Note that it may not match the number of reduce input records if you're running with a combiner.

If your job is deterministic, you can write a version of your map that emits counts, follow the wordcount example to get a map output record count, read and store the result in the config, and start your own job with this information.

If it's only a sanity check and not part of your computation, you probably want to query and verify it from your driver when the job is finished. An example is in TestReduceFetch. -C

On Sep 25, 2008, at 4:52 PM, Sandy wrote:

Thanks, Lohit.

I took a look at the Task_Counter.properties file and fiured out that I
would like to use REDUCE_INPUT_RECORDS.

I want to access this within my reduce function, just to check the value.

In order to do this, I tried to include
import    org.apache.hadoop.mapred.Task;

and I had the following added to my Reduce function code:

reduce() {
...
Counters counter = new Counters();
long reduce_records = counter.getCounter(Task.Counter.REDUCE_INPUT_RECORDS);
...
}

However, it seems that I'm not allowed to do this. I get the following when
I try to compile:
error 1: org.apache.hadoop.mapred.Task is not public in
org.apache.hadoop.mapred; cannot be accessed from outside package
error 2: package Task does not exist

I suspect that the second error is a result from the first. All I want to do is access the value of this counter. Could you please explain how this can
be accomplished?

Thanks in advance,

-SM

On Wed, Sep 24, 2008 at 7:40 PM, lohit <[EMAIL PROTECTED]> wrote:

Yes, take a look at
src/mapred/org/apache/hadoop/mapred/Task_Counter.properties

Those are all the counters available for a task.
-Lohit



----- Original Message ----
From: Sandy <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, September 24, 2008 5:09:39 PM
Subject: counter for number of mapper records

If I understand correctly, each mapper is sent a set number of records. Is there a counter or variable that tells you how many records is sent to a
particular mapper?
Likewise, is there a similar thing for reducers?

Thanks in advance.

-SM



Reply via email to