So the good news is that the patch runs ;) The bad news is that it's
slower, going from 1600-1800ms to ~2500ms to calculate a single users' topK
recommendations. For kicks, I ran a couple other experiments, progressively
removing code to isolate the problem area. Results are detailed here:
I've created a small benchmark to play around with the way the dot
product is computed. It tries to mimic Josh's usecase: multiplying 2.5M
dense item vectors of size 20 by a dense user vector of size 20.
https://gist.github.com/sscdotopen/5108988
I compared using the .dot() method from the
Hi Josh,
I made another attempt today. It directly computes the dot products,
introduces a mutable version of RecommendedItem and uses Lucene's
PriorityQueue to keep the top k.
I hope this gives you some improvements.
Here's the patch (must be applied against trunk):
I'm running a job right now that uses your static `dot` method from your
previous post, ontop of v0.7 (nothing from trunk). This has cut the time
down by about 1/3 but it's still around 500ms per user. I'll give your
latest patch a go hopefully tomorrow and report back.
We're working on another
Well, it would definitely not be the for time I counted incorrectly. Anytime I
do arithmetic the result should be considered suspect. I do think my numbers
are correct, but then again, I always do.
But the OP did say 20 dimensions which gives me back 5x.
Inclusion of learning time is a
So the 80 hour estimate is _only_ for the U*M', top-n calculation and not
the factorization. Factorization is on the order of 2-hours. For the
interested, here's the pertinent code from the ALS `RecommenderJob`:
Hi Josh,
The factorization should be quite a bit faster with the current trunk,
as we reworked the QR decomposition used for solving the least squares
problems of ALS.
I think we can also remove a lot of object instantiations in
ParallelALSFactorizationJob.
/s
On 06.03.2013 11:25, Josh Devins
Yeah that's right, he said 20 features, oops. And yes he says he's talking
about the recs only too, so that's not right either. That seems way too
long relative to factorization. And the factorization seems quite fast; how
many machines, and how many iterations?
I thought the shape of the
The factorization at 2-hours is kind of a non-issue (certainly fast
enough). It was run with (if I recall correctly) 30 reducers across a 35
node cluster, with 10 iterations.
I was a bit shocked at how long the recommendation step took and will throw
some timing debug in to see where the problem
Btw, all important jobs in ALS are map-only, so its the number of map
slotes that counts.
On 06.03.2013 12:11, Sean Owen wrote:
OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
probably, as most machines can handle 2 reducers at once).
I think the recommendation step
OK and he mentioned that 10 mappers were running, when it ought to be able
to use several per machine. The # of mappers is a function of the input
size really, so probably needs to turn down the max file split size to
induce more mappers?
On Wed, Mar 6, 2013 at 11:16 AM, Sebastian Schelter
I would recommend against a mutable object on maintenance grounds.
Better is to keep the threshold that a new score must meet and only
construct the object on need. That cuts the allocation down to negligible
levels.
On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen sro...@gmail.com wrote:
OK, that's
That sounds way too long. How is the U matrix stored? What type?
On Wed, Mar 6, 2013 at 6:44 AM, Josh Devins h...@joshdevins.com wrote:
First bit of feedback. The `M.forEachPair` loop is about 1600-1800 millis
per user (recall the size is ~2.6M users x ~2.8M items). There doesn't
appear to
That too, even better. Isn't that already done? Could be in one place but
not another. IIRC there were also cases where it was a lot easier to pass
around an object internally and mutability solved the performance issue,
without much risk since it was only internal. You can (nay, must) always
copy
I already committed a fix in that direction. I modified our
FixedSizePriorityQueue to allow inspection of its head for direct
comparison. This obviates the need to instantiate a Comparable and offer
it to the queue.
/s
On 06.03.2013 17:01, Ted Dunning wrote:
I would recommend against a mutable
Got sidetracked today but I'll run Sebastian's version in trunk tomorrow
and report back.
On 6 March 2013 17:07, Sebastian Schelter s...@apache.org wrote:
I already committed a fix in that direction. I modified our
FixedSizePriorityQueue to allow inspection of its head for direct
comparison.
Hi all,
I have a conceptually simple problem. A user-item matrix, A, whose
dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS with
20 features reduces this in the usual way to A = UM'. Trying to generate
top-n (where n=100) recommendations for all users in U is quite a long
Without any tricks, yes you have to do this much work to really know which
are the largest values in UM' for every row. There's not an obvious twist
that speeds it up.
(Do you really want to compute all user recommendations? how many of the
2.6M are likely to be active soon, or, ever?)
First,
Thanks Sean, at least I know I'm mostly on the right track ;)
So in our case (a large, social, consumer website), this is already a small
subset of all users (and items for that matter) and is really only the
active users. In fact, in future iterations, the number of users will
likely grow by
Ah OK, so this is quite a big problem. Still, it is quite useful to be able
to make recommendations in real-time, or near-real-time. It saves the
relatively quite large cost of precomputing, and lets you respond
immediately to new data. If the site has a lot of occasional or new users,
that can
Hmm... each users recommendations seems to be about 2.8 x 20M Flops = 60M
Flops. You should get about a Gflop per core in Java so this should about
60 ms. You can make this faster with more cores or by using ATLAS.
Are you expecting 3 million unique people every 80 hours? If no, then it
is
21 matches
Mail list logo