Thank you for suggestions: Weblog analysis, market basket analysis and generating search index.
I guess for these applications we need more reduces than maps, for handling large intermediate output, isn't it. Besides, the input split for map should be smaller than usual, because the workload for spill and merge on map's local disk is heavy. -Elton On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley <omal...@apache.org> wrote: > On Fri, Apr 29, 2011 at 5:02 AM, elton sky <eltonsky9...@gmail.com> wrote: > > > For my benchmark purpose, I am looking for some non-trivial, real life > > applications which creates *bigger* output than its input. Trivial > example > > I > > can think about is cross join... > > > > As you say, almost all cross join jobs have that property. The other case > that almost always fits into that category is generating an index. For > example, if your input is a corpus of documents and you want to generate > the > list of documents that contain each word, the output (and especially the > shuffle data) is much larger than the input. > > -- Owen >