Re: [Melkjug Development List] CouchDB Madness

Luke Tucker Sat, 07 Jun 2008 09:39:12 -0700


On Jun 6, 2008, at 5:56 PM, Randall Leeds wrote:

How do we get a random subset of a query?
When writing a view for CouchDB each document is run through a mapfunction to generate the view. The map function is supposed to emita key/value combination (though these can each be complex types) foreach document. If we want a random subset of documents we can makethey key (or some element of a complex key) be a random integer.Then we can just take the first n results

This works so long as it's a dynamic view that's posted in, but inthat case I think CouchDB will literally visit every document(although trivially) and perform a sort before giving back a result --even distributing this onto a cluster, it feels like an awfullyheavyweight solution to a (seemingly) simple query...

How do we use CouchDB to distribute our filtering onto a cluster?
Unless I'm mistaken (and only a subset of javascript is available toa map function, in which case maybe we can use a Python viewserver), we should be able to calculate scores for each documentthrough a view. The steps needed to make this happen I see asfollows:
The filter needs to be accessible from the map function.
Either we:
- build our own View Server in Python and include our filtermodules to call directly for calculating scores.- implement a RESTful pattern for calling filter modules via HTTP/JSON (happy side effect is the possibility for off site filters)
 - maybe do both

Sounds good, I think this could work quite well in practice and is agood place to start. Like you mentioned later there is an existingpython view server implementation in the couchdb-python package. Iwould start with something trivial to work out the kinks and then trydriving the existing python filters in the melk.filter package withsome kind of adapter. Maybe a simple question, but is there anexisting mechanism for requesting that a view be regenerated (eg forthings that need to be re-run periodically as opposed to being drivenby document updates)?

IMO, doing a RESTful pattern is probably not hard if it makes sense atsome point. The intereface to a filter is fairly trivial, soimplementing a general proxy filter that happens to chat over HTTP tosomewhere else should not be a very big deal.

Is there anything we can do to improve our measurements of"goodness" and our results?There has been discussion about the recommendation/rating algorithmfor Melkjug:
http://tinyurl.com/5s8tdh
http://tinyurl.com/5gawhk
If we wanted to get really nutty (read: awesome), it seems feasibleto implement a closest-n articles (by Euclidean distance or dotproduct) view in a way which distributes. A Python view server couldhave views which leverage Numpy for doing the linear algebra for us.We could generate a view for each user which is updated whenever theusers preferences change. This view would, for each score document(document which stores the scores for a given article against allfilters), calculate the distance we require against the user'spreference vector. Since we cannot pass arguments to CouchDB views,we can simply update the view when a user changes their filteringpreferences.

I'm somewhat skeptical about how much we can do at interactive speedswithout a pretty hefty cluster, but I think the idea of generating aview function for a filter configuration is totally legit. One way ofproviding interactive speeds might be to maintain two databases, onewith a small recent (and partly random?) subset of articles that isused to respond to dynamic views as the user is literally tuning afilter in the UI, and a second more extensive database which is usedwhen a user has saved a particular configuration and the system hassome time to mull over the larger set of articles.



- Luke

Re: [Melkjug Development List] CouchDB Madness

Reply via email to