Your requirements as stated would be well met by a something like Lucene.

However, another possible way to go about this is to permute the key sets into key arrays and emit each. The number of keys would normally be (N!)/2, where N is the number of fields you are indexing. However, we can use views collation to do range lookups, allows us to ignore the different array key suffixes. That would reduce the number of key arrays emitted per document to 2^N. If each document has 10 fields, then the number of permutations would be 2^10 or 1024 keys emitted per doc.

To build that index for 50000 documents would take an on-disk view index of 50,000,000 rows. Building it will take a very long time and it will take a lot of disk space. But once built, it should then possible to do the categorized, drill down searches, that can show you relevant sub-categories and their counts to further narrow down search, and do so pretty efficiently. This is very much the kind of stuff like Endeca does for online retailers.

I don't know if CouchDB views are up to it yet, but it might be worth experimenting.

-Damien


On Sep 26, 2008, at 2:11 PM, Paul Davis wrote:

code. This feels to me like something a database should take care of,
and might become problematic when you have your webpage code talk with
couchdb directly.

Be very wary of yourself when you think such things. Generally its a
sign (at least for me) that you're not realizing how deeply your SQL
brainwashing runs. And generally when I get to this point if I just
step back I realize there's probably a decent way to do it with couch.

Though, in this particular case you have come to a somewhat lacking
area of couch in its ability to handle dynamic queries as such.

And, just a thought, whenever multiget and include_docs lands, you'd
be able to do this pretty easily as:

get set of documents for tag
for rest of tags:
   multiget set of documents where other tag

it'd be an iterative weeding out of docs.

At least. I think that'd work....

Paul

Reply via email to