Yeah that's right, he said 20 features, oops. And yes he says he's talking
about the recs only too, so that's not right either. That seems way too
long relative to factorization. And the factorization seems quite fast; how
many machines, and how many iterations?

I thought the shape of the computation was to cache B' (yes whose columns
are B rows) and multiply against the rows of A. There again probably wrong
given the latest timing info.


On Wed, Mar 6, 2013 at 10:25 AM, Josh Devins <h...@joshdevins.com> wrote:

> So the 80 hour estimate is _only_ for the U*M', top-n calculation and not
> the factorization. Factorization is on the order of 2-hours. For the
> interested, here's the pertinent code from the ALS `RecommenderJob`:
>
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148
>
> I'm sure this can be optimised, but by an order of magnitude? Something to
> try out, I'll report back if I find anything concrete.
>
>
>
> On 6 March 2013 11:13, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> > Well, it would definitely not be the for time I counted incorrectly.
> >  Anytime I do arithmetic the result should be considered suspect.  I do
> > think my numbers are correct, but then again, I always do.
> >
> > But the OP did say 20 dimensions which gives me back 5x.
> >
> > Inclusion of learning time is a good suspect.  In the other side of the
> > ledger, if the multiply is doing any column wise access it is a likely
> > performance bug.  The computation is AB'. Perhaps you refer to rows of B
> > which are the columns of B'.
> >
> > Sent from my sleepy thumbs set to typing on my iPhone.
> >
> > On Mar 6, 2013, at 4:16 AM, Sean Owen <sro...@gmail.com> wrote:
> >
> > > If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728
> Tflops
> > --
> > > I think you're missing an "M", and the features by an order of
> magnitude.
> > > That's still 1 day on an 8-core machine by this rule of thumb.
> > >
> > > The 80 hours is the model building time too (right?), not the time to
> > > multiply U*M'. This is dominated by iterations when building from
> > scratch,
> > > and I expect took 75% of that 80 hours. So if the multiply was 20 hours
> > --
> > > on 10 machines -- on Hadoop, then that's still slow but not out of the
> > > question for Hadoop, given it's usually a 3-6x slowdown over a parallel
> > > in-core implementation.
> > >
> > > I'm pretty sure what exists in Mahout here can be optimized further at
> > the
> > > Hadoop level; I don't know that it's doing the multiply badly though.
> In
> > > fact I'm pretty sure it's caching cols in memory, which is a bit of
> > > 'cheating' to speed up by taking a lot of memory.
> > >
> > >
> > > On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> > >
> > >> Hmm... each users recommendations seems to be about 2.8 x 20M Flops =
> > 60M
> > >> Flops.  You should get about a Gflop per core in Java so this should
> > about
> > >> 60 ms.  You can make this faster with more cores or by using ATLAS.
> > >>
> > >> Are you expecting 3 million unique people every 80 hours?  If no, then
> > it
> > >> is probably more efficient to compute the recommendations on the fly.
> > >>
> > >> How many recommendations per second are you expecting?  If you have 1
> > >> million uniques per day (just for grins) and we assume 20,000 s/day to
> > >> allow for peak loading, you have to do 50 queries per second peak.
>  This
> > >> seems to require 3 cores.  Use 16 to be safe.
> > >>
> > >> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours.
> >  I
> > >> think that your map-reduce is under performing by about a factor of
> 10.
> > >> This is quite plausible with bad arrangement of the inner loops.  I
> > think
> > >> that you would have highest performance computing the recommendations
> > for a
> > >> few thousand items by a few thousand users at a time.  It might be
> just
> > >> about as fast to do all items against a few users at a time.  The
> reason
> > >> for this is that dense matrix multiply requires c n x k + m x k memory
> > ops,
> > >> but n x k x m arithmetic ops.  If you can re-use data many times, you
> > can
> > >> balance memory channel bandwidth against CPU speed.  Typically you
> need
> > 20
> > >> or more re-uses to really make this fly.
> > >>
> > >>
> >
>

Reply via email to