On Mon, May 31, 2010 at 11:42 PM, Jake Mannix <[email protected]> wrote:
> +1 from me on this: anywhere we use JDBC / SQL, we could be using
> a noSQL data store more scalably, because I don't *think* we rely on any
> aggregate fancy SQL joining or grouping or anything.

It's not too complicated, except for slope-one, where it's pretty
useful to leverage SQL.

So I don't know that SQL should go away, if only because, for every
business with Cassandra set up there are 1000 with a database. And for
every user with huge data, there are 100 with medium-sized data. I
don't want to lose sight of supporting the common case on the way.

One larger point I'm making, which I don't believe anyone's disputing,
is that there are more than just huge item-based recommenders in play
here. What's below makes total sense, for one algorithm, though.


> One thing I wonder, Sean, is if you used say, Voldemort to store rows
> of the ItemSimilarity matrix and the user-item preferences matrix,
> computing recommendations for an item-based recommender on the
> fly by doing a get() based on all the items a user had rated and then
> a multiGet() based on all the keys returned, then recommending in
> the usual fashion... it's two remote calls to a key-value store during
> the course of producing recommendations, but something like Voldemort
> or Cassandra would have their responses available in basically no time
> (it's all in memory usually), and from experience, the latency you'd
> get would be pretty reasonable.

Yes this is more or less how the item-based recommender works now,
when paired up with a JDBC-backed ItemSimilarity. (It could use a
"bulk query" method to get all those similarities in one go though;
not too hard to weave in.)

I'd be a little concerned about whether this fits comfortably in
memory. The similarity matrix is potentially dense -- big rows -- and
you're loading one row per item the user has rated. It could get into
tens of megabytes for one query. The distributed version dares not do
this. But, worth a try in principle.


> Seems like it would be a nice integration to try out.  The advantage of
> that kind of recommender (on the fly) is that you could change the way
> you compute recommendations (ie. the model) on a per-query basis
> if need be, and if new items were added to the users list of things
> they'd rated, that users' row in the key-value store could be updated
> on the fly too (the ItemSimilarity matrix would drift out of date, sure,
> but it could be batch updated periodically).

+1 for someone to try it. I'm currently more interested in the Hadoop
end of the spectrum myself.

Reply via email to