Cooccurrence analysis is commonly used in recommendations.  These produce
large intermediates.

Come on over to the Mahout project if you would like to talk to a bunch of
people who work on these problems.

On Fri, Apr 29, 2011 at 9:31 PM, elton sky <eltonsky9...@gmail.com> wrote:

> Thank you for suggestions:
>
> Weblog analysis, market basket analysis and generating search index.
>
> I guess for these applications we need more reduces than maps, for handling
> large intermediate output, isn't it. Besides, the input split for map
> should
> be smaller than usual,  because the workload for spill and merge on map's
> local disk is heavy.
>
> -Elton
>
> On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley <omal...@apache.org>
> wrote:
>
> > On Fri, Apr 29, 2011 at 5:02 AM, elton sky <eltonsky9...@gmail.com>
> wrote:
> >
> > > For my benchmark purpose, I am looking for some non-trivial, real life
> > > applications which creates *bigger* output than its input. Trivial
> > example
> > > I
> > > can think about is cross join...
> > >
> >
> > As you say, almost all cross join jobs have that property. The other case
> > that almost always fits into that category is generating an index. For
> > example, if your input is a corpus of documents and you want to generate
> > the
> > list of documents that contain each word, the output (and especially the
> > shuffle data) is much larger than the input.
> >
> > -- Owen
> >
>

Reply via email to