To generate smart output from base data we need to copy some base tables from relational database into Hadoop. Some of them are big. To dump the entire table into Hadoop everyday is not an option since there are like 30+ tables and each would take several hours.
The methodology that we approached is to get the entire table dump first. Then each day or every 4-6 hours get only insert/update/delete since the last copy from RDBMS (based on a date field in the table). Using Hive do outer join + union the new data with existing data and write into a new file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records inserted, 2 records updated and 1 deleted since the last Hadoop copy, then the Hive query will get 97 of the not changed data + 3 inserts + 2 updates and write into a new file. The other applications like Pig or Hive will pick the most recent file to use when selecting/loading data from those base table data files. This logic is working fine in lower environments for small size tables. With production data, for about 30GB size table, the incremental re-generation of the file in Hadoop is still taking several hours. I tried using zipped version and it took even longer time. I am not convinced that this is the best we can do to handle updates and deletes since we had to re-write 29GB unchanged data of the 30GB file again into a new file. ...and this is not the biggest table. I am thinking that this should be problem for many companies. What are the other approaches to apply updates and deletes on base tables to the Hadoop data files? We have 4 data nodes and using version 20.3. Thanks!