if each mapper only sees a relatively small chunk of the data, then why not have each one compute the counting of 2-perms in memory.
you would then get the reducer to merge these partial results together. (details are left to the reader ...) Miles 2008/9/19 Sandy <[EMAIL PROTECTED]>: > Hi, > > I have a serious problem that I'm not sure how to fix. I have two M/R phases > that calculates a matrix in parallel. It works... but it's slower than the > serial version (by about 100 times). > > Here is a toy program that works similarly to my application. In this > example I'm having different random numbers being generated, per given line > of input, and then creating a n x n matrix that counts how many of these > random numbers were shared. > > ------------------- > first map phase() { > input: key = offset, value = line of text (embedded line number), ln > generate k random numbers, k1 .. kn > emit: <ki, ln > > } > > first reduce phase() { > input: key = ki, value = list(ln) > if list size is greater than one: > for every 2-permutation p: > emit: <p, 1> > //example: list = 1 2 3 > //emit: <(1,2), 1> > //emit: <(2,3), 1> > //emit: <(1,3), 1> > } > > second map phase() { > input: key = offset, value = (i, j) 1 > //dummy function. acts as a transition to reduce > parse value into two tokens [(i,j)] and [1] > emit: <(i,j), 1> > } > > second reduce() { > input: key = (i,j) value = list(1) > //wordcount > sum up the list of ones > emit: <(i,j), sum(list(1))> > } > ------------------ > > Now here's the problem: > > Let's suppose the file is 27MB. > The time it takes for the first map phase is about 3 minutes. > The time it takes for the first reduce phase is about 1 hour. > The size of the intermediary files that are produced by this first M/R phase > is 48GB. > > The time it takes for the second map phase is 9 hours (and this function is > just a dummy funtion!!) > The time it takes for the second reduce phase is 12 hours > > I have been trying to change the number of maps and reduce tasks, but that > doesn't seem to really chip away at the massive number of 2-permutations > that need to be taken care of in the second M/R phase. At least not on my > current machine. > > > Has anyone implemented a matrix in parallel using MapReduce? If so, Is this > normal or expected behavior? I do realize that I have a small input file, > and that this may impact speedup. The most powerful machine I have to run > this M/R implementation is a MacPro that has two processors, each with four > cores, and 4 different hard disks of 1 TB each. > > Does anyone have any suggestions on what I can change (either with the > approach or the cluster setup -- do I need more machines?) in order to make > this faster? I am current running 8 map tasks and 4 reduce tasks. I am going > to change it 10 map tasks and 9 reduce tasks and see if that helps any, but > I'm seriously wondering if this is not going to give me much of a change > since I only have one machine. > > > Any insight is greatly appreciated. > > Thanks! > > -SM > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.