We do a couple of different things. First, we have a bunch of logs that are just key/value pairs of transaction-id/server-events-in-json- form. I have scripts which add a new partition for every day's log data. To hive, it's just a two column field, although the second column is a huge json field to hive. To make that column queryable, I created a bunch of UDF's which understand our log format, so I can do stuff like:

select count(tid) from txns where UserAgent(txn) like '%Chrome%' and FooBar(txn) = 'baz';

There's also tables which get generated from that transactions table, which are simple columnar tables.

I've written python wrappers for many Hive commands: execute_hql(), add_partition(), drop_partition(), etc.

Bobby   

On Dec 14, 2009, at 1:00 PM, Vijay wrote:

Can anyone share their ETL workflow experiences with Hive? For example, how do you transform data from log files to Hive tables? Do you use hive with map/reduce scripts or do you use hive programmatically? Or do you do something entirely different? I haven't found any samples or details about the programmatic usage of hive.

Thanks in advance,
Vijay

Reply via email to