No, I think Brian is right that the growth is exponential, rather than factorial. In his example, I think for instance >> 10 x [a b] (a <= b) means the pairs [1 2] [1 3] [1 4] [1 5] [2 3] [2 4] [2 5] [3 4] [3 5] [4 5] A
On Sat, Apr 25, 2009 at 12:48 PM, Paul Davis <[email protected]> wrote: > On Sat, Apr 25, 2009 at 3:40 PM, Brian Candler <[email protected]> wrote: >> On Sat, Apr 25, 2009 at 10:27:33AM -0700, Adam Wolff wrote: >>> Hi list, >>> I have a little query problem that would be easy to solve in SQL, but >>> seems hard in CouchDB. I have data that looks like this: >>> doc1: { >>> keys : [1,2,3], >>> data : .... >>> } >>> doc2: { >>> keys : [1,3,5], >>> data : .... >>> } >>> ... >>> >>> I'd like a view of my documents that lets me filter on multiple keys, >>> so a query of keys=[1] yields doc1,doc2, as does a query of >>> keys=[1,3]. >> >> What do you mean by keys=[1,3]? Does it mean all documents which contains >> either key 1 or 3, or all documents which contains both keys 1 and 3? >> >> In the first case, you can POST to the view with {"keys":[1,3]} >> >> In the latter case, the simplest answer is to fetch the views for key=1 and >> key=3, and intersect them on the client. Actually you can be smarter than >> that: fetch whichever result set is smaller (keys=1 or key=3), and then >> filter the result set on the client to discard those entries which don't >> have the other key. >> >> (This is exactly what a SQL database would do, except it hides it from you >> by doing it all server-side) >> >>> In couch, the only way I can think of to truly index all the >>> combinations of the keys is to emit the n! combinations of the keys, >>> which seems like a bad idea. (Maybe it's not though?) >> >> I think it would be (2^n)-1 rather than n!, since you can sort them first. >> >> e.g. for 5 keys there would be 31 combinations: >> 1 x [] -- useless, ignore >> 5 x [a] >> 10 x [a b] (a <= b) >> 10 x [a b c] (a <= b <= c) >> 5 x [a b c d] >> 1 x [a b c d e] >> >> Actually you can halve it again, by realising that if you have emitted >> [1,2,3,4,5] then you don't need to emit any prefixes of this, e.g. [1,2,3,4] >> or [1,2,3] or [1,2] or [1], because a startkey-endkey query can be used to >> find them. >> > > I've contemplated something like this, but I'm pretty sure it breaks > when you start contemplating that a key may end up being [a, e] in > which case your querying needs to account for all possible values > between a and e. > > In other news, if someone wants to write an inverted index > implementation that could solve this class of queries I'm sure there'd > probably be a couple beers in it. > > HTH, > Paul Davis > >> Mind you, exponential growth still ain't good :-) It might be acceptable if >> you have a limited number of tags, and you really want an instantaneous >> answer to your multi-tag query across millions of documents. >> >> Other tradeoffs are possible. For example, it might make sense just to emit >> all single tags plus all combinations of two tags; that would be (n^2+n)/2 >> combinations. Then your query can instantly find all docs matching two of >> the tags requested, and if the user has asked for three or more, filter >> those at the client. But you still have to choose which 2 tags to query on, >> and it may well not be worth it in practice. >> >> Regards, >> >> Brian. >> >
