Combiners can be called zero or more times.  That can happen on the map
side or on the reduce side.

On Thu, Sep 27, 2012 at 4:56 AM, Sigurd Spieckermann <
sigurd.spieckerm...@gmail.com> wrote:

> @Jake: Could you please elaborate on how exactly the combiner can be called
> before the reducer gets the data? Do you mean the combiner is called at the
> datanode that instantiates reducer tasks? I thought the combiner is just
> called after the map task has finished and still on that datanode.
>
> 2012/9/26 Jake Mannix <jake.man...@gmail.com>
>
> > It should also be noted that the Combiner does not only run for the
> mappers
> > -
> > they can be used one (or more) times after mapping, and then one or more
> > times before the reducer gets the results.  It's not quite so simple as
> to
> > say that
> > you get combiners used only (and always) on the outputs of each map task.
> >
> > Also, your technique of doing "mapside caching" is in fact fairly common,
> > but
> > don't worry about trying to reuse the JVM-level data, that hack will be
> > ugly and
> > not worth it: simply store local variables in your Mapper instance, and
> > emit
> > the collected / aggregated data once during the close() method of the
> > Mapper.
> >
> > We do this in e.g. our LDA implementation (see for example
> > o.a.m.clustering.lda.cvb.CachingCVB0Mapper ).
> >
> > On Wed, Sep 26, 2012 at 7:50 AM, Sigurd Spieckermann <
> > sigurd.spieckerm...@gmail.com> wrote:
> >
> > > I see. The description of the SVD implementation made me wonder if I
> had
> > a
> > > wrong understanding of Hadoop or of Mahout and whether I could
> > potentially
> > > tune my code. When I was thinking about how to combine across multiple
> > map
> > > tasks (given it was/is not configurable in Hadoop), I was considering
> to
> > > reuse the JVM and delay the emits until a few (or all) map tasks have
> > > completed, but in terms of implementation I encountered difficulties to
> > > ensure I'm at least emitting at the end of the very last map task on
> the
> > > datanode. It would have ended up in some hack storing data in a
> singleton
> > > class or static variable or something like that and emitting at some
> > later
> > > point. But I see your point about the trade-off I would be making
> towards
> > > more performance (maybe) but potentially not, e.g. if a datanode fails
> > > after some map tasks have finished but some haven't.
> > >
> > > I guess my questions have been answered. Thank you very much guys for
> the
> > > clarifying discussion!
> > >
> > > 2012/9/26 Sebastian Schelter <s...@apache.org>
> > >
> > > > I think this comes down to scalability vs performance.
> > > >
> > > > If you were to combine the outputs of several map tasks (that run in
> > > > different VMs), you would have to make these tasks dependent on each
> > > > other, which might cause lots of problems: straggling tasks will slow
> > > > down all the other tasks on the machine, you might have to restart
> all
> > > > tasks if one fails, etc.
> > > >
> > > > --sebastian
> > > >
> > > >
> > > > On 26.09.2012 16:22, Sigurd Spieckermann wrote:
> > > > > OK, we've been on the same page then up to the point where you say
> > one
> > > > > split stores multiple columns/rows, not just one. In that case, if
> > > there
> > > > > are N splits on each datanode, there will still be N (big, but
> > probably
> > > > > sparse) matrices transferred over the network because only the
> emits
> > of
> > > > one
> > > > > split can be combined. Also, there is some upper bound of the size
> > of a
> > > > > matrix by the HDFS block size and the number of nonzero elements
> per
> > > > > column/row right? So the bigger the matrix, the less the gain by
> the
> > > > > combiner because fewer stripes fit into one split. Is this no
> problem
> > > in
> > > > > practice? And about my particular application problem, is there any
> > way
> > > > to
> > > > > combine the outputs of multiple map tasks, not just the results of
> a
> > > > single
> > > > > map task?
> > > > >
> > > > > 2012/9/26 Sebastian Schelter <s...@apache.org>
> > > > >
> > > > >> Hi Sigurd,
> > > > >>
> > > > >> I think that's the misconception then: "each stripe (column/row)
> is
> > > > >> stored in a single file".
> > > > >>
> > > > >> Each split contains (IntWritable, VectorWritable)-tuples, for the
> > > first
> > > > >> matrix, these represent the columns, for the second, these
> represent
> > > the
> > > > >> rows.
> > > > >>
> > > > >> In order to compute the outer products, these two inputs are
> joined
> > > via
> > > > >> a map-side join conducted by Hadoop's composite input format. This
> > is
> > > a
> > > > >> very effective way, because you can exploit data locality. If you
> > have
> > > > >> two matching input splits on the same machine, there is no network
> > > > >> traffic involved in joining them.
> > > > >>
> > > > >> Note that this approach only works if both inputs are partitioned
> > and
> > > > >> sorted in the same way.
> > > > >>
> > > > >> --sebastian
> > > > >>
> > > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>

Reply via email to