I store documents submitted by users, with optional tags (lists of strings):
CREATE TABLE doc (
user_id uuid,
date text, // part of partition key, to distribute data better
doc_id uuid,
tags list<text>,
contents text,
PRIMARY KEY((user_id, date), doc_id)
);
What is the best way to implement tag filtering? A user can select a
list of tags and get documents with the tags. I thought about:
1) Full denormalization - include tags in the primary key and insert a
doc for each subset of specified tags. This will however lead to large
disk space usage, because there are 2**n subsets (for 10 tags and a 1MB
doc 1000MB would be written).
2) Secondary index on 'tags' collection, and using queries like:
SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND
tags CONTAINS=? ...
Since I will supply partition key value, I assume there will be no
problems with contacting multiple nodes. But how well will it work for
hundreds of thousands of results? I think intersection of tag matches
needs to be performed in memory so it will not scale well.
3) Partial denormalization - do inserts for each single tag and then
manually compute intersection. However in the worst case it can lead to
scanning almost the whole table.
4) Full denormalization but without contents. I would get correct
doc_ids fast, then I would need to use '... WHERE doc_id IN ?' with
potentially a very large list of doc_ids.
What's Cassandra's way to implement this?