We have a 10 million row table exported from AS400 mainframe every day, the 
table is exported as a csv text file, which is about 30GB in size, then the csv 
file is imported into a RDBMS table which is dropped and recreated every day. 
Now we want to find how many rows are updated during each export-import 
interval, the table has a primary key, so deletes and inserts can be found 
using RDBMS joins quickly, but we must do a column to column comparing in order 
to find the difference between rows ( about 90%) with the same primary keys. 
Our goal is to find a comparing process which takes no more than 10 minutes 
with a 4-node cluster, each server in which has 4 4-core 3.0 GHz CPUs, 8GB 
memory  and a  300G local  RAID5 array.

Bellow is our current solution:
    The old data is kept in the RDBMS with index created on the primary key, 
the new data is imported into HDFS as the input file of our Map-Reduce job. 
Every map task connects to the RDBMS database, and selects old data from it for 
every row, map tasks will generate outputs if differences are found, and there 
are no reduce tasks.

As you can see, with the number of concurrent map tasks increasing, the RDBMS 
database will become the bottleneck, so we want to kick off the RDBMS, but we 
have no idea about how to retrieve the old row with a given key quickly from 
HDFS files, any suggestion is welcome.

Reply via email to