The counter for reduce input records is updated as the reduces consume
records. Processing it in a reduce is meaningless. Relying on counters
for job correctness (particularly from the same job) is risky at best.
Neither maps nor reduces are given a set number of records, unless
this is specifically engineered. If you want the total number of
records output from the maps in the reduce, you'll probably need to
write a MapRunnable, track records output using a filter
OutputCollector, and output that count- with key/value types matching
those of your job- before you exit the run method, for each reduce
(you'll likely need a custom partitioner to recognize that each count
record is special and belongs to a particular reduce, or well-designed
hashCode functions). You'll also need a custom comparator that
considers counts less than data, while preserving the ordering of your
data. Finally, you'll want a grouping comparator to take all the
counts at the front of your reduce as a single set of records. The
reduce will collapse all these records into the total record count
from the map before processing the data following it. Note that it may
not match the number of reduce input records if you're running with a
combiner.
If your job is deterministic, you can write a version of your map that
emits counts, follow the wordcount example to get a map output record
count, read and store the result in the config, and start your own job
with this information.
If it's only a sanity check and not part of your computation, you
probably want to query and verify it from your driver when the job is
finished. An example is in TestReduceFetch. -C
On Sep 25, 2008, at 4:52 PM, Sandy wrote:
Thanks, Lohit.
I took a look at the Task_Counter.properties file and fiured out
that I
would like to use REDUCE_INPUT_RECORDS.
I want to access this within my reduce function, just to check the
value.
In order to do this, I tried to include
import org.apache.hadoop.mapred.Task;
and I had the following added to my Reduce function code:
reduce() {
...
Counters counter = new Counters();
long reduce_records =
counter.getCounter(Task.Counter.REDUCE_INPUT_RECORDS);
...
}
However, it seems that I'm not allowed to do this. I get the
following when
I try to compile:
error 1: org.apache.hadoop.mapred.Task is not public in
org.apache.hadoop.mapred; cannot be accessed from outside package
error 2: package Task does not exist
I suspect that the second error is a result from the first. All I
want to do
is access the value of this counter. Could you please explain how
this can
be accomplished?
Thanks in advance,
-SM
On Wed, Sep 24, 2008 at 7:40 PM, lohit <[EMAIL PROTECTED]> wrote:
Yes, take a look at
src/mapred/org/apache/hadoop/mapred/Task_Counter.properties
Those are all the counters available for a task.
-Lohit
----- Original Message ----
From: Sandy <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, September 24, 2008 5:09:39 PM
Subject: counter for number of mapper records
If I understand correctly, each mapper is sent a set number of
records. Is
there a counter or variable that tells you how many records is sent
to a
particular mapper?
Likewise, is there a similar thing for reducers?
Thanks in advance.
-SM