Re: Change Data Capture (CDC) with Kudu

Mike Percy Fri, 22 Sep 2017 14:30:47 -0700

CDC is something that I would like to see in Kudu but we aren't there yet
with the underlying support in the Raft Consensus implementation. Once we
have higher availability re-replication support (KUDU-1097) we will be a
bit closer for a solution involving traditional WAL streaming to an
external consumer because we will have support for non-voting replicas. But
there would still be plenty of work to do to support CDC after that, at
least from an API perspective as well as a WAL management perspective (how
long to keep old log files).

That said, what you really are asking for is a streaming backup solution,
which may or may not use the same mechanism (unfortunately it's not
designed or implemented yet).

As an alternative to Adar's suggestions, a reasonable option for you at
this time may be an incremental backup. It takes a little schema design to
do it, though. You could consider doing something like the following:

   1. Add a last_updated column to all your tables and update the column
   when you change the value. Ideally monotonic across the cluster but you
   could also go with local time and build in a "fudge factor" when reading in
   step 2
   2. Periodically scan the table for any changes newer than the previous
   scan in the last_updated column. This type of scan is more efficient to do
   in Kudu than in many other systems. With Impala you could run a query like:
   select * from table1 where last_updated > $prev_updated;
   3. Dump the results of this query to parquet
   4. Use distcp to copy the parquet files over to the other cluster
   periodically (maybe you can throttle this if needed to avoid saturating the
   pipe)
   5. Upsert the parquet data into Kudu on the remote end

Hopefully some workaround like this would work for you until Kudu has a
reliable streaming backup solution.

Like Adar said, as an Apache project we are always open to contributions
and it would be great to get some in this area. Please reach out if you're
interested in collaborating on a design.

Mike

On Fri, Sep 22, 2017 at 10:43 AM, Adar Lieber-Dembo <a...@cloudera.com>
wrote:

> Franco,
>
> Thanks for the detailed description of your problem.
>
> I'm afraid there's no such mechanism in Kudu today. Mining the WALs seems
> like a path fraught with land mines. Kudu GCs WAL segments aggressively so
> I'd be worried about a listening mechanism missing out on some row
> operations. Plus the WAL is Raft-specific as it includes both REPLICATE
> messages (reflecting a Write RPC from a client) and COMMIT messages
> (written out when a majority of replicas have written a REPLICATE); parsing
> and making sense of this would be challenging. Perhaps you could build
> something using Linux's inotify system for receiving file change
> notifications, but again I'd be worried about missing certain updates.
>
> Another option is to replicate the data at the OS level. For example, you
> could periodically rsync the entire cluster onto a standby cluster. There's
> bound to be data loss in the event of a failover, but I don't think you'll
> run into any corruption (though Kudu does take advantage of sparse files
> and hole punching, so you should verify that any tool you use supports
> that).
>
> Disaster Recovery is an oft-requested feature, but one that Kudu
> developers have been unable to prioritize yet. Would you or your someone on
> your team be interested in working on this?
>
> On Thu, Sep 21, 2017 at 7:12 PM Franco Venturi <fvent...@comcast.net>
> wrote:
>
>> We are planning for a 50-100TB Kudu installation (about 200 tables or so).
>>
>> One of the requirements that we are working on is to have a secondary
>> copy of our data in a Disaster Recovery data center in a different location.
>>
>>
>> Since we are going to have inserts, updates, and deletes (for instance in
>> the case the primary key is changed), we are trying to devise a process
>> that will keep the secondary instance in sync with the primary one. The two
>> instances do not have to be identical in real-time (i.e. we are not looking
>> for synchronous writes to Kudu), but we would like to have some pretty good
>> confidence that the secondary instance contains all the changes that the
>> primary has up to say an hour before (or something like that).
>>
>>
>> So far we considered a couple of options:
>> - refreshing the seconday instance with a full copy of the primary one
>> every so often, but that would mean having to transfer say 50TB of data
>> between the two locations every time, and our network bandwidth constraints
>> would prevent to do that even on a daily basis
>> - having a column that contains the most recent time a row was updated,
>> however this column couldn't be part of the primary key (because the
>> primary key in Kudu is immutable), and therefore finding which rows have
>> been changed every time would require a full scan of the table to be
>> sync'd. It would also rely on the "last update timestamp" column to be
>> always updated by the application (an assumption that we would like to
>> avoid), and would need some other process to take into accounts the rows
>> that are deleted.
>>
>>
>> Since many of today's RDBMS (Oracle, MySQL, etc) allow for some sort of
>> 'Change Data Capture' mechanism where only the 'deltas' are captured and
>> applied to the secondary instance, we were wondering if there's any way in
>> Kudu to achieve something like that (possibly mining the WALs, since my
>> understanding is that each change gets applied to the WALs first).
>>
>>
>> Thanks,
>> Franco Venturi
>>
>

Re: Change Data Capture (CDC) with Kudu

Reply via email to