Don't really see a difference in two options. Won't the partitioner run on user id and create a hash for you? Unless your hash function is better than partitioner.
On Fri, 18 Sep 2020, 21:33 Attila Wind, <attilaw@swf.technology> wrote: > Hey guys, > > I'm curious about your experiences regarding a data modeling question we > are facing with. > At the moment we see 2 major different approaches in terms of how to build > the tables > But I'm googling around already for days with no luck to find any useful > material explaining to me how a Map (as collection datatype) works on the > storage engine, and what could surprise us later if we . So decided to ask > this question... (If someone has some nice pointers here maybe that is also > much appreciated!) > > So > *To describe the problem* in a simplified form > > - Imagine you have users (everyone is identified with a UUID), > - and we want to answer a simple question: "have we seen this guy > before?" > - we "just" want to be able to answer this question for a limited time > - let's say for 3 months > - ....but... there are lots of lots of users we run into... many > millions / each day... > - ....and ~15-20% of them are returning users only - so many guys we > just might see once > > We are thinking about something like a big big Map, in a form of > userId => lastSeenTimestamp > > Obviously if we would have something like that then answering the above > question is simply: > if(map.get(userId) != null) => TRUE - we have seen the guy before > > Regarding the 2 major modelling approaches I mentioned above > > *Approach 1* > Just simply use a table, something like this > > CREATE TABLE IF NOT EXISTS users ( > user_id varchar, > last_seen int, -- a UNIX timestamp is enough, > thats why int > > PRIMARY KEY (user_id) > ) .... > AND default_time_to_live = <3 months of seconds>; > > > *Approach 2 *to do not produce that much rows, "cluster" the guys a bit > together (into 1 row) so > introduce a hashing function over the userId, producing a value btw [0; > 10000] > and go with a table like > > CREATE TABLE IF NOT EXISTS users ( > user_id_hash int, > users_seen map<text, int>, -- this is a userId => > last timestamp map > PRIMARY KEY (user_id_hash) > ) .... > AND default_time_to_live = <3 months of seconds>; -- yes, its > clearly not a good enough way ... > > > In theory: > > - on a WRITE path both representation gives us a way to do the write > without the need of read > - even the READ path is pretty efficient in both cases > - Approach2 is worse definitely when we come to the cleanup - "remove > info if older than 3 month" > - Approach2 might affect the balance of the cluster more - thats clear > (however not that much due to the "law of large number" and really enough > random factors) > > And what we are struggling around is: what do you think > *Which approach would be better over time? *So will slow down the cluster > less considering in compaction etc etc > > As far as we can see the real question is: > > which hurts more? > > - much more rows, but very small rows (regarding data size), or > - much less rows, but much bigger rows (regarding data size) > > ? > > Any thoughts, comments, pointers to some related case studies, articles, > etc is highly appreciated!! :-) > > thanks! > -- > Attila Wind > > http://www.linkedin.com/in/attilaw > Mobile: +49 176 43556932 > > >