Hi I'm struggling with the following issue. I need to build a cube with 6 dimensions for app usage for example: -------+-------+------+-----+------+------ user | app | d3 | d4 | d5 | d6 -------+-------+------+-----+------+------ u1 | a1 | x | y | z | 5 -------+-------+------+-----+------+------ u2 | a1 | a | b | c | 6 -------+-------+------+-----+------+------
the dimensions combinations generate ~100M rows daily. for each row, I need to calculate the unique monthly active users, weekly active users and daily active users, along with some other data (that can be simply added up) I can load the data of the last 30 days, each day, and calculate a cube with countDistinct('userId) but this requires a huge cluster, and is quite expensive. I tried to use Hyper Log Log, and store the byte array of the HLL of the previous day, de-serialize it, add the users of the current day, calc the new distinct, and serialize the byte array for the next day. however, to get 5% error accuracy with HLL, the byte array has to be 4K long, which makes the 100M rows, be ~ 4000 times bigger. and i ended up requiring a lot more resources. I wonder if one of you can think of a better solution. Thanks Tal -- *Tal Grynbaum* / *CTO & co-founder* m# +972-54-7875797 mobile retention done right