We are planning for a 50-100TB Kudu installation (about 200 tables or so). 

One of the requirements that we are working on is to have a secondary copy of 
our data in a Disaster Recovery data center in a different location. 




Since we are going to have inserts, updates, and deletes (for instance in the 
case the primary key is changed), we are trying to devise a process that will 
keep the secondary instance in sync with the primary one. The two instances do 
not have to be identical in real-time (i.e. we are not looking for synchronous 
writes to Kudu), but we would like to have some pretty good confidence that the 
secondary instance contains all the changes that the primary has up to say an 
hour before (or something like that). 




So far we considered a couple of options: 
- refreshing the seconday instance with a full copy of the primary one every so 
often, but that would mean having to transfer say 50TB of data between the two 
locations every time, and our network bandwidth constraints would prevent to do 
that even on a daily basis 
- having a column that contains the most recent time a row was updated, however 
this column couldn't be part of the primary key (because the primary key in 
Kudu is immutable), and therefore finding which rows have been changed every 
time would require a full scan of the table to be sync'd. It would also rely on 
the "last update timestamp" column to be always updated by the application (an 
assumption that we would like to avoid), and would need some other process to 
take into accounts the rows that are deleted. 




Since many of today's RDBMS (Oracle, MySQL, etc) allow for some sort of 'Change 
Data Capture' mechanism where only the 'deltas' are captured and applied to the 
secondary instance, we were wondering if there's any way in Kudu to achieve 
something like that (possibly mining the WALs, since my understanding is that 
each change gets applied to the WALs first). 




Thanks, 
Franco Venturi 

Reply via email to