Cooccurrence analysis is commonly used in recommendations. These produce large intermediates.
Come on over to the Mahout project if you would like to talk to a bunch of people who work on these problems. On Fri, Apr 29, 2011 at 9:31 PM, elton sky <eltonsky9...@gmail.com> wrote: > Thank you for suggestions: > > Weblog analysis, market basket analysis and generating search index. > > I guess for these applications we need more reduces than maps, for handling > large intermediate output, isn't it. Besides, the input split for map > should > be smaller than usual, because the workload for spill and merge on map's > local disk is heavy. > > -Elton > > On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley <omal...@apache.org> > wrote: > > > On Fri, Apr 29, 2011 at 5:02 AM, elton sky <eltonsky9...@gmail.com> > wrote: > > > > > For my benchmark purpose, I am looking for some non-trivial, real life > > > applications which creates *bigger* output than its input. Trivial > > example > > > I > > > can think about is cross join... > > > > > > > As you say, almost all cross join jobs have that property. The other case > > that almost always fits into that category is generating an index. For > > example, if your input is a corpus of documents and you want to generate > > the > > list of documents that contain each word, the output (and especially the > > shuffle data) is much larger than the input. > > > > -- Owen > > >