Great discussion. My take-aways are that the current implementation is more or less the same as a matrix-based implementation. It's been pretty specialized and probably works as fast or faster than a simple, clean matrix implementation. But we need a matrix-based implementation since that is distributable and the current implementation can't really be distributed. So it won't really work past a scale of maybe 100M ratings. And that implementation is likely to be relatively efficient, even given the mapreduce overhead, and can incorporate familiar concepts like log-likelihood similarity, etc.
I'd like to put this on my to-do list, for after two things happen: 1. Hadoop gets sorted. Right now I can't really make progress on Hadoop 0.20.0 period 2. Our matrix implementation is finalized -- understand we're probably switching to some other library? On Thu, Sep 10, 2009 at 8:39 AM, Gökhan Çapan<[email protected]> wrote: > So, these algorithms are nearly same, in terms of pattern of computation, > and this speed up is not enough, right? > > -- > Gökhan Çapan >
