Re: ETL workflow experiences with Hive

Bryan Talbot Tue, 15 Dec 2009 09:32:23 -0800

I've just started using hive but I'll share my experiences with loading data.  
We have raw log files in HDFS.  These files we want to keep and not change.  
They sometimes also have more fields than we want to have available in hive 
tables, so here's how we import that data.  A hive script creates an external 
table (e.g., real_table_stg) with the location pointing to the raw data file to 
be imported.  The script then "insert into real_table select <stuff> from 
real_table_stg".  The <stuff> might be as simple as "*" or it might filter or 
cleanse raw data from the external table.

One notable issue I've had is the lack of variable substitution in hive 
scripts.  This seems to make every hive script become a script template 
requiring preprocessing to replace values that must change for each run: 
location, partition names, etc.  I'm currently using groovy scripts to perform 
the template processing and run the hive jobs.  I have also found it convenient 
to use hive's M/R support in some cases where expressing a transformation in 
SQL is hard.

-Bryan

On Dec 14, 2009, at Dec 14, 1:00 PM, Vijay wrote:

> Can anyone share their ETL workflow experiences with Hive? For example, how 
> do you transform data from log files to Hive tables? Do you use hive with 
> map/reduce scripts or do you use hive programmatically? Or do you do 
> something entirely different? I haven't found any samples or details about 
> the programmatic usage of hive.
> 
> Thanks in advance,
> Vijay

Re: ETL workflow experiences with Hive

Reply via email to