On Tue, Feb 10, 2009 at 5:31 PM, James Marca <[email protected]> wrote: > I have a situation where I want to run two different reduce functions > on the output of a single map function. Like suppose I want one > reduce function to get the count of all objects in each group (for > example, documents with or without attachments), and another reduce to > compute some other aggregate, like the average and standard deviation > of a value, (like the average size of attached documents). (Yes, I > know this is a stupid example, as the averaging reduce function will > also have the count, but my real case is too complicated to write > easily). > > Should one strive for a minimal set of reduce functions per map (one > reduce for all three count, average, std deviation), or does it make > sense to identically copy the maps and make multiple reduce functions > (one reduce _each_ for count, mean, std dev)? (again, ignore the fact > that you compute count and mean when computing std dev, etc) > > I have a feeling from reading the various docs that identical map > functions are only executed once in CouchDB. If that is true, then is > it _also_ true that having lots of reduce functions for one map is not > any more expensive (in terms of space and computational speed) than > trying for a minimal set of map-reduce pairs. Any advice on this? > > Thanks in advance, > James >
You're reading of the docs are spot on. If you have byte identical map functions, only a single btree is used for both maps. At the moment, the only way to reuse a single btree with multiple reduce functions is to do exactly what you suggested and copy your maps and then attach your reduce functions as necessary. Before I go on, I should mention that the best way to figure this out would be to setup a couple benchmarks and measure if there's any noticeable difference between having multiple reduce functions vs. one complex one. That said, with each reduce function, you're adding a round-trip through the view server every time a reduce is called. I would cautiously lean towards thinking that this isn't going to be as much overhead as you might think. Ie, I find it more likely that the view generation is going to be dominated by other things than this. The space requirement should be roughly related to the output that either method would produce. Ie, multiple reduce methods isn't in and of itself going to cause you to run into problems. The only overhead I can think of is a bit more for the Erlang serialization of a slightly different term format for either case. HTH, Paul Davis
