All, I'm a relative newcomer to Hadoop/Hive. We have a very standard setup of multiple webapp servers backed by a mySql database. We are evaluating Hive as a high scale solution for our relatively sophisticated reporting and analytics needs. However, it's not clear what the best practices are around storing and representing the data our application generates. Probably best explained with an example:
We imagine a Hive deployment that is importing Apache logs and MySql data from the application db (probably via Sqoop). We would run our analysis daily and output the results somewhere (flat files in s3 or another MySql reporting database). We have users that have a) a status (Basic or Premium) and b) a location (a Zip code). We'd like to be able to ask questions like "How many premium users did we have within ten miles of zip 02110 on Jan 3rd 2012?" Computing these numbers for all dates across all zip codes and for a number of radi on a very large set of users seems like a pretty good use of Hadoop/Hive. However users can move location and change status. The application database only really cares about the current location and status of a user and not the history of those fields. This presents a challenge to the analytics process. If we run the analysis every day we will naturally pick up the changes in status and location. However, if we were to try to recomputed our entire analysis for all dates we would get different results for users that moved location or changed status. The Apache logs are like not of much use as they are unlikely to contain member ids to deduce the requests which resulted in the change of status or location for a user. How is this type of problem typically solved with Hive? I can see a few potential solutions: 1. Don't solve it. Accept that you have some artifacts in your reporting data that cannot be recovered from the source data. 2. Create status and location history tables in the application db and use that during the analytics process. 3. Log the status and location change 'events' to some other log file and use those logs in the Hive analysis. Are there any 'best practices' around these kinds of problems and in particular suggestions for the simplest implementation of the extra logging and analysis required by 3.? Thanks Jon This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.
