Thanks Sean. This is the information I was looking for and I'll start digging. Yes I do understand at this point the distributed vs non-distributed pros and cons. The distributed ones allow you to scale well given you have enough well connected nodes and doesn't have a memory constrain. However there's a performance price to pay. The non distributes ones are faster but bounded by the memory available on the machine. Out of curiosity - sorry if this have been answered before - would it be possible to combine the two approaches so you could break the data set in batches that could fit in memory and use a non-distributed algorithm to provide results for each batch and then use Hadoop to merge the results in a sensible way? This would improve performance while scaling (this is different than the pseudo approach where you simply distribute the work on the same model). I didn't give it much though but I think this might work some limited cases. -qf --- On Wed, 5/5/10, Sean Owen <[email protected]> wrote:
From: Sean Owen <[email protected]> Subject: Re: Algorithm scalability To: [email protected] Received: Wednesday, May 5, 2010, 6:56 AM On Wed, May 5, 2010 at 11:38 AM, First Qaxy <[email protected]> wrote: > With regards to the use of the in-memory algorithms - I was under the > impression that those would not work on this model. Is there a rule of thumb > that connects the model characteristics to the resources needed to run an > in-memory algorithm? In this case I assume that 10 million significant > occurrences come from a much larger set of item-to-item matrix after applying > a min_support threshold or similar. Is this My crude guidance is that with up to 100M data points (preferences), you can probably get the entire data set into memory. Meaning, this is roughly what fits in a very large but not massive heap (like 4GB). It's not reasonable to fit a couple billion into memory unless you really want to play with a 64GB heap or something (which you could!) But I'm suggesting algorithms that don't need the whole data model in memory, like slope one. All it really needs in memory to perform well are the item-item average preference diffs. One diff per item-item pair may be just fine to fit in memory if the number of items is relatively not large -- and this is a situation where you can throw away diffs without really compromising the computation much. These take a lot of work to compute, but you can distribute that part and recompute it periodically (there is already a MapReduce available for this). It still needs access to potentially the data to compute a recommendation, so it would need to be accessible from a database or something, not in memory. If that's feasible, this works. > size of the item-to-item determining the memory requirements for the > algorithm? Also is memory needed to process the full item-to-item matrix or > only the final one with the threshold applied?If I would have 1 bln items in > the matrix what would the algorithm's memory footprint be? 20Gb? Again, if > there's a best practices available to link the characteristics of a model > with the algorithms viability - that would be extremely useful. We're still talking about non-distributed possibilities here? because in the distributed version there is no memory issue. The computation is chopped up so much that no one piece of it needs all the data at once. (This has its own price of course.) In general, like my slope-one suggestion above, yes there are several non-distributed possibilities of this form. You can always leave all your data in, say, a database. However the algorithms are so data-intensive that it will be unreasonably slow to leave it at that. You're examining item-based algorithms, it seems, of which co-occurrence-based approaches are a particular type, and which are a close cousin to slope-one. In those, almost all of the data access comes from computing item-item values (co-occurrence, similarity, average pref diff, etc.) If you can precompute those values, prune them, and store them in memory, then performance gets reasonable again, even without all data in memory. And this sort of approach is feasible when the number of items is relatively low, or else the number of item-item pairs gets too large and again you run out of memory. > Currently I'm storing the full item-to-item matrix to support future > incremental update of the model. Could this somehow be done in Mahout or is a > full run required every time? (Here we're talking distributed again?) If you're talking about a co-occurrence matrix, yeah I imagine it's not hard to incrementally update it. There's no M/R for that but it's easy to write. It operates on the co-occurrence matrix produces by UserToCooccurrenceReducer, which is just a SequenceFile full of itemID mapped to Vector of co-occurrences.
