I've just started using hive but I'll share my experiences with loading data. We have raw log files in HDFS. These files we want to keep and not change. They sometimes also have more fields than we want to have available in hive tables, so here's how we import that data. A hive script creates an external table (e.g., real_table_stg) with the location pointing to the raw data file to be imported. The script then "insert into real_table select <stuff> from real_table_stg". The <stuff> might be as simple as "*" or it might filter or cleanse raw data from the external table.
One notable issue I've had is the lack of variable substitution in hive scripts. This seems to make every hive script become a script template requiring preprocessing to replace values that must change for each run: location, partition names, etc. I'm currently using groovy scripts to perform the template processing and run the hive jobs. I have also found it convenient to use hive's M/R support in some cases where expressing a transformation in SQL is hard. -Bryan On Dec 14, 2009, at Dec 14, 1:00 PM, Vijay wrote: > Can anyone share their ETL workflow experiences with Hive? For example, how > do you transform data from log files to Hive tables? Do you use hive with > map/reduce scripts or do you use hive programmatically? Or do you do > something entirely different? I haven't found any samples or details about > the programmatic usage of hive. > > Thanks in advance, > Vijay
