Hi Raj, there is no way of adding new data to a file in HDFS as long as the append functionality is not available. Adding new "records" to a Hive table means, creating a new file with those records. You do this in the "staging" table which might be inefficient for large data sets especially if you run MR jobs on it. After two years you will see more than 700 files. To have all records in one file, you run an aggregation procedure with the select command you mentioned. Select (*) reads all small files and depending on the number of reducers running (should be only one in this case) only one file will contain all records for the "finaltable". The same could be done with a MR job which has the identity mapper and the identiy reducer and numberRedurcers = 1. Populating the staging table means just add the new file with the new records each day to the HDFS-folder, which contains the table data.
Best wishes Mirko 2014-02-10 3:45 GMT+01:00 Raj Hadoop <hadoop...@yahoo.com>: > > > Hi, > > My requirement is a typical Datawarehouse and ETL requirement. I need to > accomplish > > 1) Daily Insert transaction records to a Hive table or a HDFS file. This > table or file is not a big table ( approximately 10 records per day). I > don't want to Partition the table / file. > > > I am reading a few articles on this. It was being mentioned that we need > to load to a staging table in Hive. And then insert like the below : > > insert overwrite table finaltable select * from staging; > > I am not getting this logic. How should I populate the staging table > daily. > > Thanks, > Raj > > >