On Mon, May 31, 2010 at 11:42 PM, Jake Mannix <[email protected]> wrote: > +1 from me on this: anywhere we use JDBC / SQL, we could be using > a noSQL data store more scalably, because I don't *think* we rely on any > aggregate fancy SQL joining or grouping or anything.
It's not too complicated, except for slope-one, where it's pretty useful to leverage SQL. So I don't know that SQL should go away, if only because, for every business with Cassandra set up there are 1000 with a database. And for every user with huge data, there are 100 with medium-sized data. I don't want to lose sight of supporting the common case on the way. One larger point I'm making, which I don't believe anyone's disputing, is that there are more than just huge item-based recommenders in play here. What's below makes total sense, for one algorithm, though. > One thing I wonder, Sean, is if you used say, Voldemort to store rows > of the ItemSimilarity matrix and the user-item preferences matrix, > computing recommendations for an item-based recommender on the > fly by doing a get() based on all the items a user had rated and then > a multiGet() based on all the keys returned, then recommending in > the usual fashion... it's two remote calls to a key-value store during > the course of producing recommendations, but something like Voldemort > or Cassandra would have their responses available in basically no time > (it's all in memory usually), and from experience, the latency you'd > get would be pretty reasonable. Yes this is more or less how the item-based recommender works now, when paired up with a JDBC-backed ItemSimilarity. (It could use a "bulk query" method to get all those similarities in one go though; not too hard to weave in.) I'd be a little concerned about whether this fits comfortably in memory. The similarity matrix is potentially dense -- big rows -- and you're loading one row per item the user has rated. It could get into tens of megabytes for one query. The distributed version dares not do this. But, worth a try in principle. > Seems like it would be a nice integration to try out. The advantage of > that kind of recommender (on the fly) is that you could change the way > you compute recommendations (ie. the model) on a per-query basis > if need be, and if new items were added to the users list of things > they'd rated, that users' row in the key-value store could be updated > on the fly too (the ItemSimilarity matrix would drift out of date, sure, > but it could be batch updated periodically). +1 for someone to try it. I'm currently more interested in the Hadoop end of the spectrum myself.
