I do not know what made me think back to this thread but it is a rather cool idea. I guess now NoSQL is more attractive for this, but if you wanted to do it...
a1 pk first last lastmod deleted 1 A lincoln 1 0 2 bob barker 2 0 a2 1 A Lincoln 4 0 2 X X 6 1 select transform(pk,first,last,lastmod,deleted) USING 'latest_not_delete.pl' as (pk1,first1,last1,lastmod1,deleted1) FROM a1 union a2 cluster by pk sort by pk, lastmod asc Where 'latest_not_delete.pl' would be a script that accepts N rows and returns only the most recent. Has anyone ever tried something like this? On Wed, Jun 9, 2010 at 6:32 PM, atreju <n.atr...@gmail.com> wrote: > As an ideal solution, I have a suggestion to Hive contributors to make it > look like Hive is doing insert/update/delete: > > > This will require a special type of table creation syntax. I will call it as > "UPDATABLE TABLE". The table will have 3 special columns that are defined in > the create script: > 1. The primary key column. (let's say: col_pk) > 2. BIGINT type date column that shows the ms from Jan 1st, 1970 to actual > data manipulation date/time in RDMBS. (dml_date) > 3. TINYINT or BOOLEAN type column that will store 0 if the record is deleted > and 1 if it is inserted or updated. (dml_action) > > This will require the RDBMS table to have PK and last update date column and > deletes recorded in some other table by pk and date. > > On Day 1, the entire table is put into Hadoop, with addition of 2 extra > columns: dml_date (bigint) and dml_action. > > On Day 2, we first find the max of dml_date from Hive table. Then we query > from RDBMS inserts/updates/deletes since that date/time and write into a > file with the correct dml_date/dml_action. The file goes to the same folder > that our Hive table is in. > > Right now, if on Day 1 we had 100 rows and on Day 2, 10 rows a regular Hive > table would show 110 rows. But, since this is a special table (UPDATABLE > TABLE), every time this table is queried in Hive, Hive first run a > map-reduce that would find the most recent (max(dml_date)) row per pk (group > by col_pk) that is not deleted (dml_action!=0) and use that output in the > user's query. That is the big idea!! > > Hive can have Insert/Update/Delete commands that would do nothing but create > a file with rows of manipulated data with correct date and action. > > There can be a special "flush" kind of command that runs the MR and replaces > all files in the table directory with single file. That can run weekly, > monthly or may be after each time dml data received from RDBMS. > > Sqoop can have Hive interface that saves certain table attributes like pk > column, RDBMS connection info,... and with one command from Hive, the Hive > table gets updated from RDBMS.... > > What do you think? > > > > On Tue, Jun 8, 2010 at 3:58 PM, Aaron Kimball <aa...@cloudera.com> wrote: >> >> I think that this might be the way to go. In general, folding updates and >> deletes into datasets is a difficult problem due to the append-only nature >> of datasets. >> >> Something that might help you here is to partition your tables in Hive >> based on some well-distributed key. Then if you have a relatively small >> number of partitions affected by an incremental import (perhaps more >> recently-imported records are more likely to be updated? in this case, >> partition the tables by the month/week you imported them?) you can only >> perform the fold-in of the new deltas on the affected partitions. This >> should be much faster than a full table scan. >> >> Have you seen the Sqoop tool? It handles imports and exports between HDFS >> (and Hive) and RDBMS systems -- but currently can only import new records >> (and subsequent INSERTs); it can't handle updates/deletes. Sqoop is >> available at http://github.com/cloudera/sqoop -- it doesn't run on Apache >> 0.20.3, but works on CDH (Cloudera's Distribution for Hadoop) and Hadoop >> 0.21/trunk. >> >> This sort of capability is something I'm really interested in adding to >> Sqoop. If you've got a well-run process for doing this, I'd really >> appreciate your help adding this feature :) Send me an email off-list if >> you're interested. At the very least, I'd urge you to try out the tool. >> >> Cheers, >> - Aaron Kimball >> >> >> On Tue, Jun 8, 2010 at 8:54 PM, atreju <n.atr...@gmail.com> wrote: >>> >>> To generate smart output from base data we need to copy some base tables >>> from relational database into Hadoop. Some of them are big. To dump the >>> entire table into Hadoop everyday is not an option since there are like 30+ >>> tables and each would take several hours. >>> >>> The methodology that we approached is to get the entire table dump first. >>> Then each day or every 4-6 hours get only insert/update/delete since the >>> last copy from RDBMS (based on a date field in the table). Using Hive do >>> outer join + union the new data with existing data and write into a new >>> file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records >>> inserted, 2 records updated and 1 deleted since the last Hadoop copy, then >>> the Hive query will get 97 of the not changed data + 3 inserts + 2 updates >>> and write into a new file. The other applications like Pig or Hive will pick >>> the most recent file to use when selecting/loading data from those base >>> table data files. >>> >>> This logic is working fine in lower environments for small size tables. >>> With production data, for about 30GB size table, the incremental >>> re-generation of the file in Hadoop is still taking several hours. I tried >>> using zipped version and it took even longer time. I am not convinced that >>> this is the best we can do to handle updates and deletes since we had to >>> re-write 29GB unchanged data of the 30GB file again into a new file. ...and >>> this is not the biggest table. >>> >>> I am thinking that this should be problem for many companies. What are >>> the other approaches to apply updates and deletes on base tables to the >>> Hadoop data files? >>> >>> We have 4 data nodes and using version 20.3. >>> >>> Thanks! >>> >> >> >