When hive is running the map-reduce job, how do we handle concurrent update/deletion/insertion ?
On Wed, Jun 9, 2010 at 3:32 PM, atreju <[email protected]> wrote: > As an ideal solution, I have a suggestion to Hive contributors to make it > look like Hive is doing insert/update/delete: > > > This will require a special type of table creation syntax. I will call it > as > "UPDATABLE TABLE". The table will have 3 special columns that are defined > in > the create script: > 1. The primary key column. (let's say: col_pk) > 2. BIGINT type date column that shows the ms from Jan 1st, 1970 to actual > data manipulation date/time in RDMBS. (dml_date) > 3. TINYINT or BOOLEAN type column that will store 0 if the record is > deleted > and 1 if it is inserted or updated. (dml_action) > > This will require the RDBMS table to have PK and last update date column > and > deletes recorded in some other table by pk and date. > > On Day 1, the entire table is put into Hadoop, with addition of 2 extra > columns: dml_date (bigint) and dml_action. > > On Day 2, we first find the max of dml_date from Hive table. Then we query > from RDBMS inserts/updates/deletes since that date/time and write into a > file with the correct dml_date/dml_action. The file goes to the same folder > that our Hive table is in. > > Right now, if on Day 1 we had 100 rows and on Day 2, 10 rows a regular Hive > table would show 110 rows. But, since this is a special table (UPDATABLE > TABLE), every time this table is queried in Hive, Hive first run a > map-reduce that would find the most recent (max(dml_date)) row per pk > (group > by col_pk) that is not deleted (dml_action!=0) and use that output in the > user's query. That is the big idea!! > > Hive can have Insert/Update/Delete commands that would do nothing but > create > a file with rows of manipulated data with correct date and action. > > There can be a special "flush" kind of command that runs the MR and > replaces > all files in the table directory with single file. That can run weekly, > monthly or may be after each time dml data received from RDBMS. > > Sqoop can have Hive interface that saves certain table attributes like pk > column, RDBMS connection info,... and with one command from Hive, the Hive > table gets updated from RDBMS.... > > What do you think? > > > > On Tue, Jun 8, 2010 at 3:58 PM, Aaron Kimball <[email protected]> wrote: > > > I think that this might be the way to go. In general, folding updates and > > deletes into datasets is a difficult problem due to the append-only > nature > > of datasets. > > > > Something that might help you here is to partition your tables in Hive > > based on some well-distributed key. Then if you have a relatively small > > number of partitions affected by an incremental import (perhaps more > > recently-imported records are more likely to be updated? in this case, > > partition the tables by the month/week you imported them?) you can only > > perform the fold-in of the new deltas on the affected partitions. This > > should be much faster than a full table scan. > > > > Have you seen the Sqoop tool? It handles imports and exports between HDFS > > (and Hive) and RDBMS systems -- but currently can only import new > records > > (and subsequent INSERTs); it can't handle updates/deletes. Sqoop is > > available at http://github.com/cloudera/sqoop -- it doesn't run on > Apache > > 0.20.3, but works on CDH (Cloudera's Distribution for Hadoop) and Hadoop > > 0.21/trunk. > > > > This sort of capability is something I'm really interested in adding to > > Sqoop. If you've got a well-run process for doing this, I'd really > > appreciate your help adding this feature :) Send me an email off-list if > > you're interested. At the very least, I'd urge you to try out the tool. > > > > Cheers, > > - Aaron Kimball > > > > > > On Tue, Jun 8, 2010 at 8:54 PM, atreju <[email protected]> wrote: > > > >> To generate smart output from base data we need to copy some base tables > >> from relational database into Hadoop. Some of them are big. To dump the > >> entire table into Hadoop everyday is not an option since there are like > 30+ > >> tables and each would take several hours. > >> > >> The methodology that we approached is to get the entire table dump > first. > >> Then each day or every 4-6 hours get only insert/update/delete since the > >> last copy from RDBMS (based on a date field in the table). Using Hive do > >> outer join + union the new data with existing data and write into a new > >> file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 > records > >> inserted, 2 records updated and 1 deleted since the last Hadoop copy, > then > >> the Hive query will get 97 of the not changed data + 3 inserts + 2 > updates > >> and write into a new file. The other applications like Pig or Hive will > pick > >> the most recent file to use when selecting/loading data from those base > >> table data files. > >> > >> This logic is working fine in lower environments for small size tables. > >> With production data, for about 30GB size table, the incremental > >> re-generation of the file in Hadoop is still taking several hours. I > tried > >> using zipped version and it took even longer time. I am not convinced > that > >> this is the best we can do to handle updates and deletes since we had to > >> re-write 29GB unchanged data of the 30GB file again into a new file. > ...and > >> this is not the biggest table. > >> > >> I am thinking that this should be problem for many companies. What are > the > >> other approaches to apply updates and deletes on base tables to the > >> Hadoop data files? > >> > >> We have 4 data nodes and using version 20.3. > >> > >> Thanks! > >> > >> > > > > >
