Intermediate data size of Sort example

2011-06-29 Thread Virajith Jalaparti
Hi, I was running the Sort example in Hadoop 0.20.2 (hadoop-0.20.2-examples.jar) over an input data size of 100GB (generated using randomwriter) with 800mappers (I was using 128MB of HDFS block size) and 4 reducers over a 3 machine cluster with 2 slave nodes. While the input and output were 100GB,

Re: Intermediate data size of Sort example

2011-06-29 Thread Virajith Jalaparti
I would like to clarify my earlier question: I found that each reducer reports FILE_BYTES_READ as around 78GB and HDFS_BYTES_WRITTEN as 25GB and REDUCE_SHUFFLE_BYTES as 25GB. So, why is the FILE_BYTES_READ 78GB and not just 25GB? Thanks, Virajith On Wed, Jun 29, 2011 at 10:29 AM, Virajith Jalapa

Re: Intermediate data size of Sort example

2011-06-29 Thread Harsh J
Virajith, The FILE_BYTES_READ also counts all the reads of spilled records done during sorting of the various outputs between the MR phases. On Wed, Jun 29, 2011 at 6:30 PM, Virajith Jalaparti wrote: > I would like to clarify my earlier question: I found that each reducer > reports FILE_BYTES_RE

Re: Intermediate data size of Sort example

2011-06-29 Thread Virajith Jalaparti
Great, that makes a lot of sense now! Thanks a lot Harsh! A related question: what does REDUCE_SHUFFLE_BYTES represent? Is it the size of the sorted output of the shuffle phase? Thanks, Virajith On Wed, Jun 29, 2011 at 2:10 PM, Harsh J wrote: > Virajith, > > The FILE_BYTES_READ also counts all

Reduce method called same key twice

2011-06-29 Thread Trevor Adams
So I have a custom Key which is used for a join. It contains two fields, a boolean (is primary key) and an int (key). Hashcode only looks at the key field, so that it gets sent to the same reducer. Compare places the pkey at the top of the list (if sorted using compare). This works nicely, except t

RE: Reduce method called same key twice

2011-06-29 Thread Aaron Baff
You probably need to implement a custom comparator that you use as the grouping comparator that compares the primary key, and then if they are the same compares the int part of the key. --Aaron - From: Trevor Adams [mai

Re: Reduce method called same key twice

2011-06-29 Thread Trevor Adams
So, that kind of makes sense but why would it not group the other values then? There are a bunch of the exact same key (only 1 primary record, so only 1 that is different per set) and it is my understanding that they would be grouped together (without the primary key) if I didn't do anything differ

RE: Reduce method called same key twice

2011-06-29 Thread Aaron Baff
I dunno, I just know that when I use a separate comparator for my custom key (does something similar to yours, although 2 or 3 additional secondary fields to group on) it works as it should. --Aaron - From: Trevor Adams

Re: Reduce method called same key twice

2011-06-29 Thread Trevor Adams
So after I created the RawComparator for the key it worked as expected. Thanks. -Trevor On Wed, Jun 29, 2011 at 2:47 PM, Aaron Baff wrote: > I dunno, I just know that when I use a separate comparator for my custom > key (does something similar to yours, although 2 or 3 additional secondary > fi

How does a ReduceTask determine which MapTask output to read?

2011-06-29 Thread Virajith Jalaparti
Hi, I was wondering what scheduling algorithm is used in Hadoop (version 0.20.2 in particular), for a ReduceTask to determine in what order it is supposed to read the map outputs from the various mappers that have been run? In particular, suppose we have 10maps called map1, map2,, map10.

Re: How does a ReduceTask determine which MapTask output to read?

2011-06-29 Thread David Rosenstrauch
On 06/29/2011 05:28 PM, Virajith Jalaparti wrote: Hi, I was wondering what scheduling algorithm is used in Hadoop (version 0.20.2 in particular), for a ReduceTask to determine in what order it is supposed to read the map outputs from the various mappers that have been run? In particular, suppose

Re: How does a ReduceTask determine which MapTask output to read?

2011-06-29 Thread Virajith Jalaparti
Hi, I guess I did not frame my question properly. What was actually meant was this: After the map phase, the output of each map is partitioned based on the key value and written to disk as a single file. Now, the ReducerTask starts up and has to read the intermediate values from the partition

Is there a way to insure that different jobs have the same number of reducers

2011-06-29 Thread Steve Lewis
I am trying to run an application where I try to generate the cartesion product of two potentially large data sets. In reality I only need the cartesian product of values in the set with a particular integer key. I am considering a design where the first mappers run through the values of set A emit

Re: Is there a way to insure that different jobs have the same number of reducers

2011-06-29 Thread Trevor Adams
Exact same bucket is possible, exact same machine (if that is what you had in mind) probably not. The partitioner breaks the data up for the reducers, so if they map to the same partition they will be done by the same reducer. If you can partition the data such that the output of one reducer partit