On Jun 6, 2008, at 5:56 PM, Randall Leeds wrote:
How do we get a random subset of a query?
When writing a view for CouchDB each document is run through a map
function to generate the view. The map function is supposed to emit
a key/value combination (though these can each be complex types) for
each document. If we want a random subset of documents we can make
they key (or some element of a complex key) be a random integer.
Then we can just take the first n results
This works so long as it's a dynamic view that's posted in, but in
that case I think CouchDB will literally visit every document
(although trivially) and perform a sort before giving back a result --
even distributing this onto a cluster, it feels like an awfully
heavyweight solution to a (seemingly) simple query...
How do we use CouchDB to distribute our filtering onto a cluster?
Unless I'm mistaken (and only a subset of javascript is available to
a map function, in which case maybe we can use a Python view
server), we should be able to calculate scores for each document
through a view. The steps needed to make this happen I see as
follows:
The filter needs to be accessible from the map function.
Either we:
- build our own View Server in Python and include our filter
modules to call directly for calculating scores.
- implement a RESTful pattern for calling filter modules via HTTP/
JSON (happy side effect is the possibility for off site filters)
- maybe do both
Sounds good, I think this could work quite well in practice and is a
good place to start. Like you mentioned later there is an existing
python view server implementation in the couchdb-python package. I
would start with something trivial to work out the kinks and then try
driving the existing python filters in the melk.filter package with
some kind of adapter. Maybe a simple question, but is there an
existing mechanism for requesting that a view be regenerated (eg for
things that need to be re-run periodically as opposed to being driven
by document updates)?
IMO, doing a RESTful pattern is probably not hard if it makes sense at
some point. The intereface to a filter is fairly trivial, so
implementing a general proxy filter that happens to chat over HTTP to
somewhere else should not be a very big deal.
Is there anything we can do to improve our measurements of
"goodness" and our results?
There has been discussion about the recommendation/rating algorithm
for Melkjug:
http://tinyurl.com/5s8tdh
http://tinyurl.com/5gawhk
If we wanted to get really nutty (read: awesome), it seems feasible
to implement a closest-n articles (by Euclidean distance or dot
product) view in a way which distributes. A Python view server could
have views which leverage Numpy for doing the linear algebra for us.
We could generate a view for each user which is updated whenever the
users preferences change. This view would, for each score document
(document which stores the scores for a given article against all
filters), calculate the distance we require against the user's
preference vector. Since we cannot pass arguments to CouchDB views,
we can simply update the view when a user changes their filtering
preferences.
I'm somewhat skeptical about how much we can do at interactive speeds
without a pretty hefty cluster, but I think the idea of generating a
view function for a filter configuration is totally legit. One way of
providing interactive speeds might be to maintain two databases, one
with a small recent (and partly random?) subset of articles that is
used to respond to dynamic views as the user is literally tuning a
filter in the UI, and a second more extensive database which is used
when a user has saved a particular configuration and the system has
some time to mull over the larger set of articles.
- Luke