On Jun 6, 2008, at 5:56 PM, Randall Leeds wrote:

How do we get a random subset of a query?
When writing a view for CouchDB each document is run through a map function to generate the view. The map function is supposed to emit a key/value combination (though these can each be complex types) for each document. If we want a random subset of documents we can make they key (or some element of a complex key) be a random integer. Then we can just take the first n results

This works so long as it's a dynamic view that's posted in, but in that case I think CouchDB will literally visit every document (although trivially) and perform a sort before giving back a result -- even distributing this onto a cluster, it feels like an awfully heavyweight solution to a (seemingly) simple query...



How do we use CouchDB to distribute our filtering onto a cluster?
Unless I'm mistaken (and only a subset of javascript is available to a map function, in which case maybe we can use a Python view server), we should be able to calculate scores for each document through a view. The steps needed to make this happen I see as follows:

The filter needs to be accessible from the map function.
Either we:
- build our own View Server in Python and include our filter modules to call directly for calculating scores. - implement a RESTful pattern for calling filter modules via HTTP/ JSON (happy side effect is the possibility for off site filters)
 - maybe do both

Sounds good, I think this could work quite well in practice and is a good place to start. Like you mentioned later there is an existing python view server implementation in the couchdb-python package. I would start with something trivial to work out the kinks and then try driving the existing python filters in the melk.filter package with some kind of adapter. Maybe a simple question, but is there an existing mechanism for requesting that a view be regenerated (eg for things that need to be re-run periodically as opposed to being driven by document updates)?

IMO, doing a RESTful pattern is probably not hard if it makes sense at some point. The intereface to a filter is fairly trivial, so implementing a general proxy filter that happens to chat over HTTP to somewhere else should not be a very big deal.



Is there anything we can do to improve our measurements of "goodness" and our results? There has been discussion about the recommendation/rating algorithm for Melkjug:
http://tinyurl.com/5s8tdh
http://tinyurl.com/5gawhk

If we wanted to get really nutty (read: awesome), it seems feasible to implement a closest-n articles (by Euclidean distance or dot product) view in a way which distributes. A Python view server could have views which leverage Numpy for doing the linear algebra for us. We could generate a view for each user which is updated whenever the users preferences change. This view would, for each score document (document which stores the scores for a given article against all filters), calculate the distance we require against the user's preference vector. Since we cannot pass arguments to CouchDB views, we can simply update the view when a user changes their filtering preferences.


I'm somewhat skeptical about how much we can do at interactive speeds without a pretty hefty cluster, but I think the idea of generating a view function for a filter configuration is totally legit. One way of providing interactive speeds might be to maintain two databases, one with a small recent (and partly random?) subset of articles that is used to respond to dynamic views as the user is literally tuning a filter in the UI, and a second more extensive database which is used when a user has saved a particular configuration and the system has some time to mull over the larger set of articles.


- Luke

Reply via email to