Clint, > CREATE TABLE events ( > id text, > date text, // Could also use year+month here or year+week or something else > event_time timestamp, > event blob, > PRIMARY KEY ((id, date), event_time)) > WITH CLUSTERING ORDER BY (event_time DESC); > > The downside of this approach is that we can no longer do a simple > continuous scan to get all of the events for a given user. Some users > may log lots and lots of interactions every day, while others may interact > with our application infrequently, so I'd like a quick way to get the most > recent interaction for a given user. > > Has anyone used different approaches for this problem?
One idea is to provide additional manual partitioning like… CREATE TABLE events ( user_id text, partition int, event_time timeuuid, event_json text, PRIMARY KEY ((user_id, partition), event_time) ) WITH CLUSTERING ORDER BY (event_time DESC) AND compaction={'class': 'DateTieredCompactionStrategy'}; Here "partition" is a random digit from 0 to (N*M) where N=nodes in cluster, and M=arbitrary number. Read performance is going to suffer a little because you need to query N*M as many partition keys for each read, but should be constant enough that it comes down to increasing the cluster's hardware and scaling out as need be. The multikey reads you can do it with a SELECT…IN query, or better yet with parallel reads (less pressure on the coordinator at expense of extra network calls). Starting with M=1, you have the option to increase it over time if the rows in partitions for any users get too high. (We do¹ something similar for storing all raw events in our enterprise platform, but because the data is not user-centric the initial partition key is minute-by-minute timebuckets, and M has remained at 1 the whole time). This approach is better than using order-preserving partition (really don't do that). I would also consider replacing "event blob" with "event text", choosing json instead of any binary serialisation. We've learnt the hard way the value of data transparency, and i'm guessing the storage cost is small given c* compression. Otherwise the advice here is largely repeating what Jens has already said. ~mck ¹ slide 19+20 from https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/