Re: Rolling MAU computation

2012-10-12 Thread Vijay
For both simplicity and efficiency, I'd recommend making the mausummary table partitioned on date and generate the MAU data each day. There is no reason to generate MAU data for a given day more than once (unless you find some problems with the source data or something). On Fri, Oct 12, 2012 at 1:

Re: Rolling MAU computation

2012-10-12 Thread Igor Tatarinov
You just need to put the join condition in the WHERE clause. That way Hive will do a cartesian product followed by a filter. On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina wrote: > I think I see what you're saying about the temp table with start/end dates > (30x expansion makes sense) and it sounds

Re: Rolling MAU computation

2012-10-12 Thread Tom Hubina
I think I see what you're saying about the temp table with start/end dates (30x expansion makes sense) and it sounds like it should work. I just need to figure out a good way to generate the table. Thanks! Tom On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov wrote: > If you have a lot of data,

Re: Rolling MAU computation

2012-10-12 Thread Tom Hubina
The problem is that "day" is the value in the for loop. I've tried doing a join with a table that contains the set of days, but the problem is that you can't do a join on a range ... Hive only support equality in the join. For example: INSERT OVERWRITE TABLE mausummary SELECT day, COUNT(DISTINCT(

Re: Rolling MAU computation

2012-10-10 Thread Igor Tatarinov
If you have a lot of data, you might have to write a custom reducer (in python) to keep track of the moving date window. If you don't have that much data, you might want to use a temp table such that datediff(end_date, start_date) < 30. To create such a table, you can self-join a table of unique

Re: Rolling MAU computation

2012-10-10 Thread MiaoMiao
How about SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day < 30 GROUP BY day; On Thu, Oct 11, 2012 at 6:05 AM, Tom Hubina wrote: > I'm trying to compute the number of active users in the previous 30 days for > each day over a date range. I can't think of any way to do it dir

Re: Rolling MAU computation

2012-10-10 Thread Tom Hubina
An example would be awesome .. I've never used a map side join (though I'm searching on that now .. ) Tom On Wed, Oct 10, 2012 at 3:59 PM, Roberto Sanabria wrote: > I've done this with a map side join using a table that stores days of the > week. I use that to drive the day im calculating the co

Re: Rolling MAU computation

2012-10-10 Thread Roberto Sanabria
I've done this with a map side join using a table that stores days of the week. I use that to drive the day im calculating the count for. Let me know if you need an example. Cheers, R On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina wrote: > I'm trying to compute the number of active users in the pr

Rolling MAU computation

2012-10-10 Thread Tom Hubina
I'm trying to compute the number of active users in the previous 30 days for each day over a date range. I can't think of any way to do it directly within Hive so I'm wondering if you guys have any ideas. Basically the algorithm is something like: For each day in date range: SELECT day, COUNT(