To generate smart output from base data we need to copy some base tables
from relational database into Hadoop. Some of them are big. To dump the
entire table into Hadoop everyday is not an option since there are like 30+
tables and each would take several hours.

The methodology that we approached is to get the entire table dump first.
Then each day or every 4-6 hours get only insert/update/delete since the
last copy from RDBMS (based on a date field in the table). Using Hive do
outer join + union the new data with existing data and write into a new
file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records
inserted, 2 records updated and 1 deleted since the last Hadoop copy, then
the Hive query will get 97 of the not changed data + 3 inserts + 2 updates
and write into a new file. The other applications like Pig or Hive will pick
the most recent file to use when selecting/loading data from those base
table data files.

This logic is working fine in lower environments for small size tables. With
production data, for about 30GB size table, the incremental re-generation of
the file in Hadoop is still taking several hours. I tried using zipped
version and it took even longer time. I am not convinced that this is the
best we can do to handle updates and deletes since we had to re-write 29GB
unchanged data of the 30GB file again into a new file. ...and this is not
the biggest table.

I am thinking that this should be problem for many companies. What are the
other approaches to apply updates and deletes on base tables to the
Hadoop data files?

We have 4 data nodes and using version 20.3.

Thanks!

Reply via email to