Combiners can be called zero or more times. That can happen on the map side or on the reduce side.
On Thu, Sep 27, 2012 at 4:56 AM, Sigurd Spieckermann < sigurd.spieckerm...@gmail.com> wrote: > @Jake: Could you please elaborate on how exactly the combiner can be called > before the reducer gets the data? Do you mean the combiner is called at the > datanode that instantiates reducer tasks? I thought the combiner is just > called after the map task has finished and still on that datanode. > > 2012/9/26 Jake Mannix <jake.man...@gmail.com> > > > It should also be noted that the Combiner does not only run for the > mappers > > - > > they can be used one (or more) times after mapping, and then one or more > > times before the reducer gets the results. It's not quite so simple as > to > > say that > > you get combiners used only (and always) on the outputs of each map task. > > > > Also, your technique of doing "mapside caching" is in fact fairly common, > > but > > don't worry about trying to reuse the JVM-level data, that hack will be > > ugly and > > not worth it: simply store local variables in your Mapper instance, and > > emit > > the collected / aggregated data once during the close() method of the > > Mapper. > > > > We do this in e.g. our LDA implementation (see for example > > o.a.m.clustering.lda.cvb.CachingCVB0Mapper ). > > > > On Wed, Sep 26, 2012 at 7:50 AM, Sigurd Spieckermann < > > sigurd.spieckerm...@gmail.com> wrote: > > > > > I see. The description of the SVD implementation made me wonder if I > had > > a > > > wrong understanding of Hadoop or of Mahout and whether I could > > potentially > > > tune my code. When I was thinking about how to combine across multiple > > map > > > tasks (given it was/is not configurable in Hadoop), I was considering > to > > > reuse the JVM and delay the emits until a few (or all) map tasks have > > > completed, but in terms of implementation I encountered difficulties to > > > ensure I'm at least emitting at the end of the very last map task on > the > > > datanode. It would have ended up in some hack storing data in a > singleton > > > class or static variable or something like that and emitting at some > > later > > > point. But I see your point about the trade-off I would be making > towards > > > more performance (maybe) but potentially not, e.g. if a datanode fails > > > after some map tasks have finished but some haven't. > > > > > > I guess my questions have been answered. Thank you very much guys for > the > > > clarifying discussion! > > > > > > 2012/9/26 Sebastian Schelter <s...@apache.org> > > > > > > > I think this comes down to scalability vs performance. > > > > > > > > If you were to combine the outputs of several map tasks (that run in > > > > different VMs), you would have to make these tasks dependent on each > > > > other, which might cause lots of problems: straggling tasks will slow > > > > down all the other tasks on the machine, you might have to restart > all > > > > tasks if one fails, etc. > > > > > > > > --sebastian > > > > > > > > > > > > On 26.09.2012 16:22, Sigurd Spieckermann wrote: > > > > > OK, we've been on the same page then up to the point where you say > > one > > > > > split stores multiple columns/rows, not just one. In that case, if > > > there > > > > > are N splits on each datanode, there will still be N (big, but > > probably > > > > > sparse) matrices transferred over the network because only the > emits > > of > > > > one > > > > > split can be combined. Also, there is some upper bound of the size > > of a > > > > > matrix by the HDFS block size and the number of nonzero elements > per > > > > > column/row right? So the bigger the matrix, the less the gain by > the > > > > > combiner because fewer stripes fit into one split. Is this no > problem > > > in > > > > > practice? And about my particular application problem, is there any > > way > > > > to > > > > > combine the outputs of multiple map tasks, not just the results of > a > > > > single > > > > > map task? > > > > > > > > > > 2012/9/26 Sebastian Schelter <s...@apache.org> > > > > > > > > > >> Hi Sigurd, > > > > >> > > > > >> I think that's the misconception then: "each stripe (column/row) > is > > > > >> stored in a single file". > > > > >> > > > > >> Each split contains (IntWritable, VectorWritable)-tuples, for the > > > first > > > > >> matrix, these represent the columns, for the second, these > represent > > > the > > > > >> rows. > > > > >> > > > > >> In order to compute the outer products, these two inputs are > joined > > > via > > > > >> a map-side join conducted by Hadoop's composite input format. This > > is > > > a > > > > >> very effective way, because you can exploit data locality. If you > > have > > > > >> two matching input splits on the same machine, there is no network > > > > >> traffic involved in joining them. > > > > >> > > > > >> Note that this approach only works if both inputs are partitioned > > and > > > > >> sorted in the same way. > > > > >> > > > > >> --sebastian > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -jake > > >