We are planning for a 50-100TB Kudu installation (about 200 tables or so).
One of the requirements that we are working on is to have a secondary copy of our data in a Disaster Recovery data center in a different location. Since we are going to have inserts, updates, and deletes (for instance in the case the primary key is changed), we are trying to devise a process that will keep the secondary instance in sync with the primary one. The two instances do not have to be identical in real-time (i.e. we are not looking for synchronous writes to Kudu), but we would like to have some pretty good confidence that the secondary instance contains all the changes that the primary has up to say an hour before (or something like that). So far we considered a couple of options: - refreshing the seconday instance with a full copy of the primary one every so often, but that would mean having to transfer say 50TB of data between the two locations every time, and our network bandwidth constraints would prevent to do that even on a daily basis - having a column that contains the most recent time a row was updated, however this column couldn't be part of the primary key (because the primary key in Kudu is immutable), and therefore finding which rows have been changed every time would require a full scan of the table to be sync'd. It would also rely on the "last update timestamp" column to be always updated by the application (an assumption that we would like to avoid), and would need some other process to take into accounts the rows that are deleted. Since many of today's RDBMS (Oracle, MySQL, etc) allow for some sort of 'Change Data Capture' mechanism where only the 'deltas' are captured and applied to the secondary instance, we were wondering if there's any way in Kudu to achieve something like that (possibly mining the WALs, since my understanding is that each change gets applied to the WALs first). Thanks, Franco Venturi