Re: Algorithm scalability

First Qaxy Wed, 05 May 2010 08:00:11 -0700

Thanks Sean. This is the information I was looking for and I'll start digging. 
Yes I do understand at this point the distributed vs non-distributed pros and 
cons. The distributed ones allow you to scale well given you have enough well 
connected nodes and doesn't have a memory constrain. However there's a 
performance price to pay. The non distributes ones are faster but bounded by 
the memory available on the machine.
Out of curiosity - sorry if this have been answered before - would it be 
possible to combine the two approaches so you could break the data set in 
batches that could fit in memory and use a non-distributed algorithm to provide 
results for each batch and then use Hadoop to merge the results in a sensible 
way? This would improve performance while scaling (this is different than the 
pseudo approach where you simply distribute the work on the same model). I 
didn't give it much though but I think this might work some limited cases. 
-qf
--- On Wed, 5/5/10, Sean Owen <[email protected]> wrote:

From: Sean Owen <[email protected]>
Subject: Re: Algorithm scalability
To: [email protected]
Received: Wednesday, May 5, 2010, 6:56 AM

On Wed, May 5, 2010 at 11:38 AM, First Qaxy <[email protected]> wrote:
> With regards to the use of the in-memory algorithms - I was under the 
> impression that those would not work on this model. Is there a rule of thumb 
> that connects the model characteristics to the resources needed to run an 
> in-memory algorithm? In this case I assume that 10 million significant 
> occurrences come from a much larger set of item-to-item matrix after applying 
> a min_support threshold or similar. Is this

My crude guidance is that with up to 100M data points (preferences),
you can probably get the entire data set into memory. Meaning, this is
roughly what fits in a very large but not massive heap (like 4GB).

It's not reasonable to fit a couple billion into memory unless you
really want to play with a 64GB heap or something (which you could!)

But I'm suggesting algorithms that don't need the whole data model in
memory, like slope one. All it really needs in memory to perform well
are the item-item average preference diffs. One diff per item-item
pair may be just fine to fit in memory if the number of items is
relatively not large -- and this is a situation where you can throw
away diffs without really compromising the computation much.

These take a lot of work to compute, but you can distribute that part
and recompute it periodically (there is already a MapReduce available
for this).

It still needs access to potentially the data to compute a
recommendation, so it would need to be accessible from a database or
something, not in memory. If that's feasible, this works.

>  size of the item-to-item determining the memory requirements for the 
> algorithm? Also is memory needed to process the full item-to-item matrix or 
> only the final one with the threshold applied?If I would have 1 bln items in 
> the matrix what would the algorithm's memory footprint be? 20Gb? Again, if 
> there's a best practices available to link the characteristics of a model 
> with the algorithms viability - that would be extremely useful.

We're still talking about non-distributed possibilities here? because
in the distributed version there is no memory issue. The computation
is chopped up so much that no one piece of it needs all the data at
once. (This has its own price of course.)

In general, like my slope-one suggestion above, yes there are several
non-distributed possibilities of this form.

You can always leave all your data in, say, a database. However the
algorithms are so data-intensive that it will be unreasonably slow to
leave it at that.

You're examining item-based algorithms, it seems, of which
co-occurrence-based approaches are a particular type, and which are a
close cousin to slope-one. In those, almost all of the data access
comes from computing item-item values (co-occurrence, similarity,
average pref diff, etc.)

If you can precompute those values, prune them, and store them in
memory, then performance gets reasonable again, even without all data
in memory.

And this sort of approach is feasible when the number of items is
relatively low, or else the number of item-item pairs gets too large
and again you run out of memory.

> Currently I'm storing the full item-to-item matrix to support future 
> incremental update of the model. Could this somehow be done in Mahout or is a 
> full run required every time?

(Here we're talking distributed again?)

If you're talking about a co-occurrence matrix, yeah I imagine it's
not hard to incrementally update it. There's no M/R for that but it's
easy to write. It operates on the co-occurrence matrix produces by
UserToCooccurrenceReducer, which is just a SequenceFile full of itemID
mapped to Vector of co-occurrences.

Re: Algorithm scalability

Reply via email to