Hmm. Thanks, Miles. I'm going to mull over that for the evening.
As an aside, I took another look at the job history file for the second phase, and I noticed something interesting. Even though I specified that the number of map tasks be 8 using JobConf, and that the maximum number of map tasks be 8 in my hadoop-site.xml file, I see the following: Task Summary ============================ Kind Total Successful Failed Killed StartTime FinishTime Map 787 787 0 0 18-Sep-2008 18:55:14 19-Sep-2008 04:13:46 (9hrs, 18mins, 32sec) Reduce 4 4 0 0 18-Sep-2008 18:55:19 19-Sep-2008 07:31:33 (12hrs, 36mins, 14sec) ============================ This could be another reason for a lack of speedup: 787 maps are being spawned -- but how can this be? I specifically had setNumMapTasks() to be 8 in my code, and also set mapred.tasktracker.tasks.maximum to 8 in hadoop-site.xml; how is it possible for there to be more than 8 maps to show up? Thanks again, -SM On Fri, Sep 19, 2008 at 4:04 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: > for the word count example, you would have each mapper work-out the > counts of all words, but in-memory. (eg having a hash table and > increment the count each time you saw it). the merge phrase would > then simply add together each word-count pair which had the same > count. more concretely, the mapper would emit the 2-perm count pairs > and the reducer would simply sum them up. > > continuing with word counting, the basic MR way of doing it maximises > disk space usage and minimises memory usage. what i imagine is that > you can have stages inbetween --use more memory than the pure MR > stateless approach, but gain in speed. > > returning to your example, you may not need to read-in that 48 GB of > data if your counts are fairly dense (you see many instances of a > given 2-pem entry). if they are sparse, then you are probably doomed. > > Miles > > 2008/9/19 Sandy <[EMAIL PROTECTED]>: > > Miles, > > > > Thanks for your response. > > > > I think I understand.. basically, I'm adding a combiner class that > computes > > the partial results in phase 2. correct (just like in the word count > > example)? > > > > However, even if I do that, I don't think it gets rid of the overhead of > > reading 48 GB from disk back into memory, which I think may be a large > part > > of the problem. Would you agree that that may be the largest source of > > overhead? > > > > Is there something in hadoop that will allow me to prevent the writing > of > > the output of first reduce phase to disk, and instead just store it and > > place it directly as input for the second reduce? > > > > Thanks, > > > > -SM > > > > On Fri, Sep 19, 2008 at 3:13 PM, Miles Osborne <[EMAIL PROTECTED]> > wrote: > > > >> if each mapper only sees a relatively small chunk of the data, then > >> why not have each one compute the counting of 2-perms in memory. > >> > >> you would then get the reducer to merge these partial results together. > >> > >> (details are left to the reader ...) > >> > >> Miles > >> > >> 2008/9/19 Sandy <[EMAIL PROTECTED]>: > >> > Hi, > >> > > >> > I have a serious problem that I'm not sure how to fix. I have two M/R > >> phases > >> > that calculates a matrix in parallel. It works... but it's slower than > >> the > >> > serial version (by about 100 times). > >> > > >> > Here is a toy program that works similarly to my application. In this > >> > example I'm having different random numbers being generated, per given > >> line > >> > of input, and then creating a n x n matrix that counts how many of > these > >> > random numbers were shared. > >> > > >> > ------------------- > >> > first map phase() { > >> > input: key = offset, value = line of text (embedded line number), ln > >> > generate k random numbers, k1 .. kn > >> > emit: <ki, ln > > >> > } > >> > > >> > first reduce phase() { > >> > input: key = ki, value = list(ln) > >> > if list size is greater than one: > >> > for every 2-permutation p: > >> > emit: <p, 1> > >> > //example: list = 1 2 3 > >> > //emit: <(1,2), 1> > >> > //emit: <(2,3), 1> > >> > //emit: <(1,3), 1> > >> > } > >> > > >> > second map phase() { > >> > input: key = offset, value = (i, j) 1 > >> > //dummy function. acts as a transition to reduce > >> > parse value into two tokens [(i,j)] and [1] > >> > emit: <(i,j), 1> > >> > } > >> > > >> > second reduce() { > >> > input: key = (i,j) value = list(1) > >> > //wordcount > >> > sum up the list of ones > >> > emit: <(i,j), sum(list(1))> > >> > } > >> > ------------------ > >> > > >> > Now here's the problem: > >> > > >> > Let's suppose the file is 27MB. > >> > The time it takes for the first map phase is about 3 minutes. > >> > The time it takes for the first reduce phase is about 1 hour. > >> > The size of the intermediary files that are produced by this first M/R > >> phase > >> > is 48GB. > >> > > >> > The time it takes for the second map phase is 9 hours (and this > function > >> is > >> > just a dummy funtion!!) > >> > The time it takes for the second reduce phase is 12 hours > >> > > >> > I have been trying to change the number of maps and reduce tasks, but > >> that > >> > doesn't seem to really chip away at the massive number of > 2-permutations > >> > that need to be taken care of in the second M/R phase. At least not on > my > >> > current machine. > >> > > >> > > >> > Has anyone implemented a matrix in parallel using MapReduce? If so, Is > >> this > >> > normal or expected behavior? I do realize that I have a small input > file, > >> > and that this may impact speedup. The most powerful machine I have to > run > >> > this M/R implementation is a MacPro that has two processors, each with > >> four > >> > cores, and 4 different hard disks of 1 TB each. > >> > > >> > Does anyone have any suggestions on what I can change (either with the > >> > approach or the cluster setup -- do I need more machines?) in order to > >> make > >> > this faster? I am current running 8 map tasks and 4 reduce tasks. I am > >> going > >> > to change it 10 map tasks and 9 reduce tasks and see if that helps > any, > >> but > >> > I'm seriously wondering if this is not going to give me much of a > change > >> > since I only have one machine. > >> > > >> > > >> > Any insight is greatly appreciated. > >> > > >> > Thanks! > >> > > >> > -SM > >> > > >> > >> > >> > >> -- > >> The University of Edinburgh is a charitable body, registered in > >> Scotland, with registration number SC005336. > >> > > > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. >