I have three tables:

Table 1: record when and who visited gas station or not, this contains all
the users of interest, name all the users as a set A
date                         |      user name    |   visited gas station?
2013-09-01                         tom                             yes
2013-09-02                         tom                             yes
2013-09-01                         hanks                          yes
2013-09-03                         tomy                            yes
....                                           ...
       ...

Table 2: record when and who visited Bestbuy,   the user in set A appear
here, but not all users of A will appear in this table, also table 2 has
users does not belong to set A
date                         |      user name    |   visited Bestbuy?
2013-09-01                         tom                             yes
2013-09-02                         jacob                           yes
2013-09-01                         hanks                          yes
2013-09-03                         michael                       yes
....                                           ...
       ...

Table 3: record when and who arrives one of three destinations: CA, NY and
DC, the users in table 3 has similar situation as users in table 2
regarding set A.
date                         |      user name    |   visited Bestbuy or not

2013-09-01                         tom                             CA
2013-09-02                         tom                             NY
2013-09-01                         hanks                          DC
2013-09-03                         tomy                            CA
....                                           ...
       ...

Now we want to know,  within a 90 days period, what are the following
numbers:
(1) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, after that date, user went to
Bestbuy, finally arrive CA
(2) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, then did not visit Bestbuy, finally
arrive CA

(3) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, then went Bestbuy, finally arrive NY
(4) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, then did not visit Bestbuy,
finally arrive NY

(5) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, then went Bestbuy, finally arrive DC
(6) for any given day, for the users in table 1, how many of them has a
path like:   visited Gas station first, then did not visit Bestbuy,
finally arrive DC

It is possible in a given day, a user may visit Bestbuy many times for
table 2, this will be simply regarded as a status 'visited Bestbuy'. One
user will be counted once in a given day.
>From the day of user visiting gas station, within next 90 days, this user
has to arrive one of three final destinations in table 3, and arrives only
one station. no multiple arrived stations.

For example,
a user tom may visit gas station on 9/1, then went to Bestbuy on 9/5,
finally went to CA on 9/30, then this user tom can be counted as 1 on 9/1
for path (1).
a user hanks may visit gas station on 9/1, then went to Bestbuy on 9/5 and
on 9/8, finally went to CA on 9/30, then this user tom count as 1 on 9/1
for path (1).
a user ruby may visit gas station on 9/1, and tables 2 does not have
records to show ruby visit Bestbuy till 11/30, and ruby arrives DC in table
3 before 11/30, then ruby contribute 1 for path (6) for the day 9/1.

How to write Hive query to get those numbers for the six paths?
a sample output will be :
9/1: 100 for path (1), 90 for path (2), ...., etc
9/2: 60 for path (1), 80 for path (2), ...., etc
9/3: ................

Any suggestions or suggested reference/readings will be deeply appreciated.

Thanks!
Qiao

Reply via email to