What's the right data storage/representation?

Jon Palmer Tue, 15 May 2012 05:12:38 -0700

All,

I'm a relative newcomer to Hadoop/Hive. We have a very standard setup of 
multiple webapp servers backed by a mySql database. We are evaluating Hive as a 
high scale solution for our relatively sophisticated reporting and analytics 
needs. However, it's not clear what the best practices are around storing and 
representing the data our application generates. Probably best explained with 
an example:


We imagine a Hive deployment that is importing Apache logs and MySql data from 
the application db (probably via Sqoop). We would run our analysis daily and 
output the results somewhere (flat files in s3 or another MySql reporting 
database). We have users that have a) a status (Basic or Premium) and b) a 
location (a Zip code). We'd like to be able to ask questions like "How many 
premium users did we have within ten miles of zip 02110 on Jan 3rd 2012?" 
Computing these numbers for all dates across all zip codes and for a number of 
radi on a very large set of users seems like a pretty good use of Hadoop/Hive.

However users can move location and change status. The application database 
only really cares about the current location and status of a user and not the 
history of those fields. This presents a challenge to the analytics process. If 
we run the analysis every day we will naturally pick up the changes in status 
and location. However, if we were to try to recomputed our entire analysis for 
all dates we would get different results for users that moved location or 
changed status. The Apache logs are like not of much use as they are unlikely 
to contain member ids to deduce the requests which resulted in the change of 
status or location for a user.

How is this type of problem typically solved with Hive?

I can see a few potential solutions:

1.       Don't solve it. Accept that you have some artifacts in your reporting 
data that cannot be recovered from the source data.

2.       Create status and location history tables in the application db and 
use that during the analytics process.

3.       Log the status and location change 'events' to some other log file and 
use those logs in the Hive analysis.

Are there any 'best practices' around these kinds of problems and in particular 
suggestions for the simplest implementation of the extra logging and analysis 
required by 3.?

Thanks
Jon



This email is intended for the person(s) to whom it is addressed and may 
contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, 
distribution, copying, or disclosure by any person other than the addressee(s) 
is strictly prohibited. If you have received this email in error, please notify 
the sender immediately by return email and delete the message and any 
attachments from your system.

What's the right data storage/representation?

Reply via email to