For both simplicity and efficiency, I'd recommend making the mausummary table partitioned on date and generate the MAU data each day. There is no reason to generate MAU data for a given day more than once (unless you find some problems with the source data or something).
On Fri, Oct 12, 2012 at 1:08 PM, Igor Tatarinov <i...@decide.com> wrote: > You just need to put the join condition in the WHERE clause. That way Hive > will do a cartesian product followed by a filter. > > On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina <t...@z2live.com> wrote: >> >> I think I see what you're saying about the temp table with start/end dates >> (30x expansion makes sense) and it sounds like it should work. I just need >> to figure out a good way to generate the table. Thanks! >> >> Tom >> >> On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <i...@decide.com> wrote: >>> >>> If you have a lot of data, you might have to write a custom reducer (in >>> python) to keep track of the moving date window. >>> >>> If you don't have that much data, you might want to use a temp table >>> <start_date, end_date> such that datediff(end_date, start_date) < 30. To >>> create such a table, you can self-join a table of unique dates using the >>> above condition. Then, you would join your data with that table on >>> start_date and group by end_date counting distinct user_ids. Hope I got that >>> right :) >>> >>> The latter approach will essentially multiply the number of rows by 30. >>> >>> igor >>> decide.com >>> >>> >>> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <t...@z2live.com> wrote: >>>> >>>> I'm trying to compute the number of active users in the previous 30 days >>>> for each day over a date range. I can't think of any way to do it directly >>>> within Hive so I'm wondering if you guys have any ideas. >>>> >>>> Basically the algorithm is something like: >>>> >>>> For each day in date range: >>>> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - >>>> logins.day < 30; >>>> >>>> Thanks for your help! >>>> >>>> Tom >>>> >>> >> >