For both simplicity and efficiency, I'd recommend making the
mausummary table partitioned on date and generate the MAU data each
day. There is no reason to generate MAU data for a given day more than
once (unless you find some problems with the source data or
something).
On Fri, Oct 12, 2012 at 1:
You just need to put the join condition in the WHERE clause. That way Hive
will do a cartesian product followed by a filter.
On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina wrote:
> I think I see what you're saying about the temp table with start/end dates
> (30x expansion makes sense) and it sounds
I think I see what you're saying about the temp table with start/end dates
(30x expansion makes sense) and it sounds like it should work. I just need
to figure out a good way to generate the table. Thanks!
Tom
On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov wrote:
> If you have a lot of data,
The problem is that "day" is the value in the for loop.
I've tried doing a join with a table that contains the set of days, but the
problem is that you can't do a join on a range ... Hive only support
equality in the join. For example:
INSERT OVERWRITE TABLE mausummary SELECT day, COUNT(DISTINCT(
If you have a lot of data, you might have to write a custom reducer (in
python) to keep track of the moving date window.
If you don't have that much data, you might want to use a temp table
such that datediff(end_date, start_date) < 30. To
create such a table, you can self-join a table of unique
How about
SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day
< 30 GROUP BY day;
On Thu, Oct 11, 2012 at 6:05 AM, Tom Hubina wrote:
> I'm trying to compute the number of active users in the previous 30 days for
> each day over a date range. I can't think of any way to do it dir
An example would be awesome .. I've never used a map side join (though I'm
searching on that now .. )
Tom
On Wed, Oct 10, 2012 at 3:59 PM, Roberto Sanabria
wrote:
> I've done this with a map side join using a table that stores days of the
> week. I use that to drive the day im calculating the co
I've done this with a map side join using a table that stores days of the
week. I use that to drive the day im calculating the count for. Let me know
if you need an example.
Cheers,
R
On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina wrote:
> I'm trying to compute the number of active users in the pr
I'm trying to compute the number of active users in the previous 30 days
for each day over a date range. I can't think of any way to do it directly
within Hive so I'm wondering if you guys have any ideas.
Basically the algorithm is something like:
For each day in date range:
SELECT day, COUNT(