We do things a little differently than some of the responses I've seen so far.  
Our client software pings a group of apache servers with specific URLs/query 
strings at 15-20 points during its lifecycle, coinciding with "interesting" 
events during the course of the user's experience.  No data is returned, we 
just store the request in the apache log for consumption.  Each request 
contains a UUID specific to that client's current session.

We parse the hourly apache logs using cascading to join up all the various 
requests on the UUID, providing us a session-level view of the data.  We do a 
few more basic transforms of the data, and then write it to HDFS as a set of 
SequenceFiles.  We then use hive to create an external table pointed at the 
data's location.  This lets us do a quick validation query.  If the query 
passes, we load the data into a new partition on our fact table for that date 
and hour.

Here's where Hive has really helped us.  Our primary fact table contains 
something on the order of 20-30 different fields, the values of which are 
arrived at by applying business logic in most cases.  For example, some fields 
are simply taken directly from the underlying beacons, such as IP address.  But 
then some are, say, the timestamp difference between two events.  When we first 
started off, we executed this business logic during the ETL process and stored 
the results in the hive table.  We quickly saw that this would be a problem if 
we changed the definition of any of the fields, however.  We would need to 
rerun ETL for the entire dataset, which could take days.  So we decided instead 
to take all that business logic out of the ETL process and put it in a custom 
SerDe.

ETL now does only a few transforms, mostly to get the beacons aggregated to a 
session grain as mentioned above. The SerDe defines the fields in the fact 
table, and defines an implementing class/method for each.  The first time the 
data is deserialized and a field requested, the implementing method executes 
the business logic and caches and returns the result.  So now if a definition 
changes, we simply update our SerDe and release the new build to our users.  No 
rerun necessary.

We're very happy with how it's all worked out and, as another poster said, very 
appreciative of all the help the mailing list has provided.

Jason


On 12/14/09 1:00 PM, "Vijay" <[email protected]> wrote:

Can anyone share their ETL workflow experiences with Hive? For example, how do 
you transform data from log files to Hive tables? Do you use hive with 
map/reduce scripts or do you use hive programmatically? Or do you do something 
entirely different? I haven't found any samples or details about the 
programmatic usage of hive.

Thanks in advance,
Vijay

Reply via email to