Don't really see a difference in two options. Won't the partitioner run on
user id and create a hash for you? Unless your hash function is better than
partitioner.

On Fri, 18 Sep 2020, 21:33 Attila Wind, <attilaw@swf.technology> wrote:

> Hey guys,
>
> I'm curious about your experiences regarding a data modeling question we
> are facing with.
> At the moment we see 2 major different approaches in terms of how to build
> the tables
> But I'm googling around already for days with no luck to find any useful
> material explaining to me how a Map (as collection datatype) works on the
> storage engine, and what could surprise us later if we . So decided to ask
> this question... (If someone has some nice pointers here maybe that is also
> much appreciated!)
>
> So
> *To describe the problem* in a simplified form
>
>    - Imagine you have users (everyone is identified with a UUID),
>    - and we want to answer a simple question: "have we seen this guy
>    before?"
>    - we "just" want to be able to answer this question for a limited time
>    - let's say for 3 months
>    - ....but... there are lots of lots of users we run into... many
>    millions / each day...
>    - ....and ~15-20% of them are returning users only - so many guys we
>    just might see once
>
> We are thinking about something like a big big Map, in a form of
>     userId => lastSeenTimestamp
>
> Obviously if we would have something like that then answering the above
> question is simply:
>     if(map.get(userId) != null)  => TRUE - we have seen the guy before
>
> Regarding the 2 major modelling approaches I mentioned above
>
> *Approach 1*
> Just simply use a table, something like this
>
> CREATE TABLE IF NOT EXISTS users (
>     user_id            varchar,
>     last_seen        int,                -- a UNIX timestamp is enough,
> thats why int
>
>     PRIMARY KEY (user_id)
> ) ....
> AND default_time_to_live = <3 months of seconds>;
>
>
> *Approach 2 *to do not produce that much rows, "cluster" the guys a bit
> together (into 1 row) so
> introduce a hashing function over the userId, producing a value btw [0;
> 10000]
> and go with a table like
>
> CREATE TABLE IF NOT EXISTS users (
>     user_id_hash    int,
>     users_seen        map<text, int>,            -- this is a userId =>
> last timestamp map
>     PRIMARY KEY (user_id_hash)
> ) ....
> AND default_time_to_live = <3 months of seconds>;        -- yes, its
> clearly not a good enough way ...
>
>
> In theory:
>
>    - on a WRITE path both representation gives us a way to do the write
>    without the need of read
>    - even the READ path is pretty efficient in both cases
>    - Approach2 is worse definitely when we come to the cleanup - "remove
>    info if older than 3 month"
>    - Approach2 might affect the balance of the cluster more - thats clear
>    (however not that much due to the "law of large number" and really enough
>    random factors)
>
> And what we are struggling around is: what do you think
> *Which approach would be better over time? *So will slow down the cluster
> less considering in compaction etc etc
>
> As far as we can see the real question is:
>
> which hurts more?
>
>    - much more rows, but very small rows (regarding data size), or
>    - much less rows, but much bigger rows (regarding data size)
>
> ?
>
> Any thoughts, comments, pointers to some related case studies, articles,
> etc is highly appreciated!! :-)
>
> thanks!
> --
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>
>

Reply via email to