We do things a little differently than some of the responses I've seen so far. Our client software pings a group of apache servers with specific URLs/query strings at 15-20 points during its lifecycle, coinciding with "interesting" events during the course of the user's experience. No data is returned, we just store the request in the apache log for consumption. Each request contains a UUID specific to that client's current session.
We parse the hourly apache logs using cascading to join up all the various requests on the UUID, providing us a session-level view of the data. We do a few more basic transforms of the data, and then write it to HDFS as a set of SequenceFiles. We then use hive to create an external table pointed at the data's location. This lets us do a quick validation query. If the query passes, we load the data into a new partition on our fact table for that date and hour. Here's where Hive has really helped us. Our primary fact table contains something on the order of 20-30 different fields, the values of which are arrived at by applying business logic in most cases. For example, some fields are simply taken directly from the underlying beacons, such as IP address. But then some are, say, the timestamp difference between two events. When we first started off, we executed this business logic during the ETL process and stored the results in the hive table. We quickly saw that this would be a problem if we changed the definition of any of the fields, however. We would need to rerun ETL for the entire dataset, which could take days. So we decided instead to take all that business logic out of the ETL process and put it in a custom SerDe. ETL now does only a few transforms, mostly to get the beacons aggregated to a session grain as mentioned above. The SerDe defines the fields in the fact table, and defines an implementing class/method for each. The first time the data is deserialized and a field requested, the implementing method executes the business logic and caches and returns the result. So now if a definition changes, we simply update our SerDe and release the new build to our users. No rerun necessary. We're very happy with how it's all worked out and, as another poster said, very appreciative of all the help the mailing list has provided. Jason On 12/14/09 1:00 PM, "Vijay" <[email protected]> wrote: Can anyone share their ETL workflow experiences with Hive? For example, how do you transform data from log files to Hive tables? Do you use hive with map/reduce scripts or do you use hive programmatically? Or do you do something entirely different? I haven't found any samples or details about the programmatic usage of hive. Thanks in advance, Vijay
