Your requirements as stated would be well met by a something like
Lucene.
However, another possible way to go about this is to permute the key
sets into key arrays and emit each. The number of keys would normally
be (N!)/2, where N is the number of fields you are indexing. However,
we can use views collation to do range lookups, allows us to ignore
the different array key suffixes. That would reduce the number of key
arrays emitted per document to 2^N. If each document has 10 fields,
then the number of permutations would be 2^10 or 1024 keys emitted per
doc.
To build that index for 50000 documents would take an on-disk view
index of 50,000,000 rows. Building it will take a very long time and
it will take a lot of disk space. But once built, it should then
possible to do the categorized, drill down searches, that can show you
relevant sub-categories and their counts to further narrow down
search, and do so pretty efficiently. This is very much the kind of
stuff like Endeca does for online retailers.
I don't know if CouchDB views are up to it yet, but it might be worth
experimenting.
-Damien
On Sep 26, 2008, at 2:11 PM, Paul Davis wrote:
code. This feels to me like something a database should take care of,
and might become problematic when you have your webpage code talk
with
couchdb directly.
Be very wary of yourself when you think such things. Generally its a
sign (at least for me) that you're not realizing how deeply your SQL
brainwashing runs. And generally when I get to this point if I just
step back I realize there's probably a decent way to do it with couch.
Though, in this particular case you have come to a somewhat lacking
area of couch in its ability to handle dynamic queries as such.
And, just a thought, whenever multiget and include_docs lands, you'd
be able to do this pretty easily as:
get set of documents for tag
for rest of tags:
multiget set of documents where other tag
it'd be an iterative weeding out of docs.
At least. I think that'd work....
Paul