I suggest for purposes of the project we would build implementations of Recommender that can consume some output from Hadoop on HDFS, like a SequenceFile or whatever it's called. Shouldn't be hard at all. This sort of hybrid approach is already what happens with slope-one -- I wrote some jobs to build its diffs and then you can load the output into SlopeOneRecommender -- which works online from there.
At least then the "hybrid" offline/online recommenders aren't yet a third species of recommender in the framework. Perhaps there isn't even a need for fully offline recommenders? Just jobs that can produce supporting intermediate output for online recommenders? That'd be tidier still. If I may digress -- I wonder how important these implementations are for the project, which seems like a bit of heresy -- surely Mahout needs to support recommendation on huge amounts of data? I think the answer's yes, but: LinkedIn and Netflix and Apple and most organizations with huge data to recommend from have already developed sophisticated, customized solutions. Organizations with less than 100M data points or so to process don't need distributed architectures. They can use Mahout as-is with its online non-distributed recommenders pretty well. 10 lines of code and one big server and a day of tinkering and they have a full-on simple recommender engine, online or offline. And I argue that this is about 90% of users of the project who want recommendations. So who are these organizations that have enough data (like 1B+ data points) that they need something like the rocket science that LinkedIn needs, but can't or haven't developed such capability already in-house? I guess that's why I've been reluctant to engineer and complicate the framework to fit in offline distributed recommendation -- because this can become as complex as we like -- since I wonder at the 'market' for it. But it seems inevitable that this must exist, even if just as a nice clean simple reference implementation of the idea. Perhaps I won't go overboard on designing something complex yet here at the moment. On Sun, Dec 6, 2009 at 12:43 AM, Jake Mannix <[email protected]> wrote: > But having a nice api for *outputting* the precomputed matrices which > are pretty big into a format where online "queries"/recommendation > requests can be computed I think is really key here. We should think > much more about what makes the most sense here.
