We do a couple of different things. First, we have a bunch of logs
that are just key/value pairs of transaction-id/server-events-in-json-
form. I have scripts which add a new partition for every day's log
data. To hive, it's just a two column field, although the second
column is a huge json field to hive. To make that column queryable, I
created a bunch of UDF's which understand our log format, so I can do
stuff like:
select count(tid) from txns where UserAgent(txn) like '%Chrome%' and
FooBar(txn) = 'baz';
There's also tables which get generated from that transactions table,
which are simple columnar tables.
I've written python wrappers for many Hive commands: execute_hql(),
add_partition(), drop_partition(), etc.
Bobby
On Dec 14, 2009, at 1:00 PM, Vijay wrote:
Can anyone share their ETL workflow experiences with Hive? For
example, how do you transform data from log files to Hive tables? Do
you use hive with map/reduce scripts or do you use hive
programmatically? Or do you do something entirely different? I
haven't found any samples or details about the programmatic usage of
hive.
Thanks in advance,
Vijay