I used Cassandra Set (no experience with map ), and one thing for sure is that 
with Cassandra collections you are limited to a few thousands entry per row 
(less than 10K for better performance)


Sent using https://www.zoho.com/mail/




---- On Fri, 18 Sep 2020 20:33:21 +0430 Attila Wind <attilaw@swf.technology> 
wrote ----


Hey guys,

I'm curious about your experiences regarding a data modeling
      question we are facing with. 
 At the moment we see 2 major different approaches in terms of how
      to build the tables
 But I'm googling around already for days with no luck to find any
      useful material explaining to me how a Map (as collection
      datatype) works on the storage engine, and what could surprise us
      later if we . So decided to ask this question... (If someone has
      some nice pointers here maybe that is also much appreciated!)
So
 To describe the problem in a simplified form
Imagine you have users (everyone is identified with a UUID),

and we want to answer a simple question: "have we seen this
        guy before?"

we "just" want to be able to answer this question for a
        limited time - let's say for 3 months

....but... there are lots of lots of users we run into... many
        millions / each day... 

....and ~15-20% of them are returning users only - so many
        guys we just might see once


We are thinking about something like a big big Map, in a form of
     userId => lastSeenTimestamp
 
 Obviously if we would have something like that then answering the
      above question is simply:
     if(map.get(userId) != null)  => TRUE - we have seen the guy
      before

Regarding the 2 major modelling approaches I mentioned above
 
 Approach 1
 Just simply use a table, something like this
 
 CREATE TABLE IF NOT EXISTS users (
     user_id            varchar,
     last_seen        int,                -- a UNIX timestamp is
      enough, thats why int
    PRIMARY KEY (user_id)
 ) ....
 AND default_time_to_live = <3 months of seconds>;
Approach 2
 to do not produce that much rows, "cluster" the guys a bit
      together (into 1 row) so
 introduce a hashing function over the userId, producing a value
      btw [0; 10000]
 and go with a table like
CREATE TABLE IF NOT EXISTS users (
     user_id_hash    int,
     users_seen        map<text, int>,            -- this is
      a userId => last timestamp map

    PRIMARY KEY (user_id_hash)
 ) ....
 AND default_time_to_live = <3 months of seconds>;        --
    yes, its clearly not a good enough way ...
 
In theory:
on a WRITE path both representation gives us a way to do the
        write without the need of read

even the READ path is pretty efficient in both cases

Approach2 is worse definitely when we come to the cleanup -
        "remove info if older than 3 month"

Approach2 might affect the balance of the cluster more - thats
        clear (however not that much due to the "law of large number"
        and really enough random factors)


And what we are struggling around is: what do you think
 Which approach would be better over time? So will slow
      down the cluster less considering in compaction etc etc
As far as we can see the real question is: 

which hurts more?

much more rows, but very small rows (regarding data size), or

much less rows, but much bigger rows (regarding data size)


?

Any thoughts, comments, pointers to some related case studies,
      articles, etc is highly appreciated!! :-)

thanks!

-- 
 Attila Wind
 
  http://www.linkedin.com/in/attilaw
 Mobile: +49 176 43556932

Reply via email to