Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-11 Thread Georg Heiler
For grouping with each: look into grouping sets https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-multi-dimensional-aggregation.html Am Di., 11. Juni 2019 um 06:09 Uhr schrieb Rishi Shah < rishishah.s...@gmail.com>: > Thank you both for your input! > > To calculate moving average

Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-10 Thread Rishi Shah
Thank you both for your input! To calculate moving average of active users, could you comment on whether to go for RDD based implementation or dataframe? If dataframe, will window function work here? In general, how would spark behave when working with dataframe with date, week, month, quarter,

Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-09 Thread Jörn Franke
Depending on what accuracy is needed, hyperloglogs can be an interesting alternative https://en.m.wikipedia.org/wiki/HyperLogLog > Am 09.06.2019 um 15:59 schrieb big data : > > From m opinion, Bitmap is the best solution for active users calculation. > Other solution almost bases on

Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-09 Thread big data
From m opinion, Bitmap is the best solution for active users calculation. Other solution almost bases on count(distinct) calculation process, which is more slower. If you 've implemented Bitmap solution including how to build Bitmap, how to load Bitmap, then Bitmap is the best choice. 在

[Pyspark 2.4] Best way to define activity within different time window

2019-06-05 Thread Rishi Shah
Hi All, Is there a best practice around calculating daily, weekly, monthly, quarterly, yearly active users? One approach is to create a window of daily bitmap and aggregate it based on period later. However I was wondering if anyone has a better approach to tackling this problem.. -- Regards,