Wow, this turned out to be a great discussion! Thanks everyone for providing detailed feedback. As has already been said many times before, this mailing list has been immensely helpful.
Please do keep responding as you can. I think information like this will be tremendously helpful for people and teams evaluating hadoop/hive or are in the initial design phases! On Tue, Dec 15, 2009 at 4:03 PM, Jason Michael <[email protected]>wrote: > We do things a little differently than some of the responses I’ve seen so > far. Our client software pings a group of apache servers with specific > URLs/query strings at 15-20 points during its lifecycle, coinciding with > “interesting” events during the course of the user’s experience. No data is > returned, we just store the request in the apache log for consumption. Each > request contains a UUID specific to that client’s current session. > > We parse the hourly apache logs using cascading to join up all the various > requests on the UUID, providing us a session-level view of the data. We do > a few more basic transforms of the data, and then write it to HDFS as a set > of SequenceFiles. We then use hive to create an external table pointed at > the data’s location. This lets us do a quick validation query. If the > query passes, we load the data into a new partition on our fact table for > that date and hour. > > Here’s where Hive has really helped us. Our primary fact table contains > something on the order of 20-30 different fields, the values of which are > arrived at by applying business logic in most cases. For example, some > fields are simply taken directly from the underlying beacons, such as IP > address. But then some are, say, the timestamp difference between two > events. When we first started off, we executed this business logic during > the ETL process and stored the results in the hive table. We quickly saw > that this would be a problem if we changed the definition of any of the > fields, however. We would need to rerun ETL for the entire dataset, which > could take days. So we decided instead to take all that business logic out > of the ETL process and put it in a custom SerDe. > > ETL now does only a few transforms, mostly to get the beacons aggregated to > a session grain as mentioned above. The SerDe defines the fields in the fact > table, and defines an implementing class/method for each. The first time > the data is deserialized and a field requested, the implementing method > executes the business logic and caches and returns the result. So now if a > definition changes, we simply update our SerDe and release the new build to > our users. No rerun necessary. > > We’re very happy with how it’s all worked out and, as another poster said, > very appreciative of all the help the mailing list has provided. > > Jason > > > > On 12/14/09 1:00 PM, "Vijay" <[email protected]> wrote: > > Can anyone share their ETL workflow experiences with Hive? For example, how > do you transform data from log files to Hive tables? Do you use hive with > map/reduce scripts or do you use hive programmatically? Or do you do > something entirely different? I haven't found any samples or details about > the programmatic usage of hive. > > Thanks in advance, > Vijay > >
