Tag filtering data model

Artur Siekielski Fri, 11 Sep 2015 00:54:01 -0700

I store documents submitted by users, with optional tags (lists of strings):


CREATE TABLE doc (
  user_id uuid,
  date text, // part of partition key, to distribute data better
  doc_id uuid,
  tags list<text>,
  contents text,
  PRIMARY KEY((user_id, date), doc_id)
);

What is the best way to implement tag filtering? A user can select alist of tags and get documents with the tags. I thought about:

1) Full denormalization - include tags in the primary key and insert adoc for each subset of specified tags. This will however lead to largedisk space usage, because there are 2**n subsets (for 10 tags and a 1MBdoc 1000MB would be written).


2) Secondary index on 'tags' collection, and using queries like:

SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? ANDtags CONTAINS=? ...

Since I will supply partition key value, I assume there will be noproblems with contacting multiple nodes. But how well will it work forhundreds of thousands of results? I think intersection of tag matchesneeds to be performed in memory so it will not scale well.

3) Partial denormalization - do inserts for each single tag and thenmanually compute intersection. However in the worst case it can lead toscanning almost the whole table.

4) Full denormalization but without contents. I would get correctdoc_ids fast, then I would need to use '... WHERE doc_id IN ?' withpotentially a very large list of doc_ids.



What's Cassandra's way to implement this?

Tag filtering data model

Reply via email to