Hi,
I was running the Sort example in Hadoop 0.20.2 (hadoop-0.20.2-examples.jar)
over an input data size of 100GB (generated using randomwriter) with
800mappers (I was using 128MB of HDFS block size) and 4 reducers over a 3
machine cluster with 2 slave nodes. While the input and output were 100GB,
I would like to clarify my earlier question: I found that each reducer
reports FILE_BYTES_READ as around 78GB and HDFS_BYTES_WRITTEN as 25GB and
REDUCE_SHUFFLE_BYTES as 25GB. So, why is the FILE_BYTES_READ 78GB and not
just 25GB?
Thanks,
Virajith
On Wed, Jun 29, 2011 at 10:29 AM, Virajith Jalapa
Virajith,
The FILE_BYTES_READ also counts all the reads of spilled records done
during sorting of the various outputs between the MR phases.
On Wed, Jun 29, 2011 at 6:30 PM, Virajith Jalaparti
wrote:
> I would like to clarify my earlier question: I found that each reducer
> reports FILE_BYTES_RE
Great, that makes a lot of sense now! Thanks a lot Harsh!
A related question: what does REDUCE_SHUFFLE_BYTES represent? Is it the size
of the sorted output of the shuffle phase?
Thanks,
Virajith
On Wed, Jun 29, 2011 at 2:10 PM, Harsh J wrote:
> Virajith,
>
> The FILE_BYTES_READ also counts all
So I have a custom Key which is used for a join. It contains two fields, a
boolean (is primary key) and an int (key). Hashcode only looks at the key
field, so that it gets sent to the same reducer. Compare places the pkey at
the top of the list (if sorted using compare). This works nicely, except
t
You probably need to implement a custom comparator that you use as the grouping
comparator that compares the primary key, and then if they are the same
compares the int part of the key.
--Aaron
-
From: Trevor Adams [mai
So, that kind of makes sense but why would it not group the other values
then? There are a bunch of the exact same key (only 1 primary record, so
only 1 that is different per set) and it is my understanding that they would
be grouped together (without the primary key) if I didn't do anything
differ
I dunno, I just know that when I use a separate comparator for my custom key
(does something similar to yours, although 2 or 3 additional secondary fields
to group on) it works as it should.
--Aaron
-
From: Trevor Adams
So after I created the RawComparator for the key it worked as expected.
Thanks.
-Trevor
On Wed, Jun 29, 2011 at 2:47 PM, Aaron Baff wrote:
> I dunno, I just know that when I use a separate comparator for my custom
> key (does something similar to yours, although 2 or 3 additional secondary
> fi
Hi,
I was wondering what scheduling algorithm is used in Hadoop (version
0.20.2 in particular), for a ReduceTask to determine in what order it is
supposed to read the map outputs from the various mappers that have been
run? In particular, suppose we have 10maps called map1, map2,,
map10.
On 06/29/2011 05:28 PM, Virajith Jalaparti wrote:
Hi,
I was wondering what scheduling algorithm is used in Hadoop (version
0.20.2 in particular), for a ReduceTask to determine in what order it is
supposed to read the map outputs from the various mappers that have been
run? In particular, suppose
Hi,
I guess I did not frame my question properly. What was actually meant
was this: After the map phase, the output of each map is partitioned
based on the key value and written to disk as a single file. Now, the
ReducerTask starts up and has to read the intermediate values from the
partition
I am trying to run an application where I try to generate the cartesion
product of two potentially large data sets. In reality I only need the
cartesian product of
values in the set with a particular integer key. I am considering a design
where the first mappers run through the values of set A emit
Exact same bucket is possible, exact same machine (if that is what you had
in mind) probably not. The partitioner breaks the data up for the reducers,
so if they map to the same partition they will be done by the same reducer.
If you can partition the data such that the output of one reducer partit
14 matches
Mail list logo