[ https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331768#comment-15331768 ]
Joshua McKenzie commented on CASSANDRA-8844: -------------------------------------------- bq. That's what is in the code That's a pretty nasty idiom as it leaves a lot of inter-version cruft around. Would be really nice to have versioned schema for these things cleanly abstracted so we could just deserialize a version and construct w/the appropriate method from there. But that's a problem for another day. > Change Data Capture (CDC) > ------------------------- > > Key: CASSANDRA-8844 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8844 > Project: Cassandra > Issue Type: New Feature > Components: Coordination, Local Write-Read Paths > Reporter: Tupshin Harper > Assignee: Joshua McKenzie > Priority: Critical > Fix For: 3.x > > > "In databases, change data capture (CDC) is a set of software design patterns > used to determine (and track) the data that has changed so that action can be > taken using the changed data. Also, Change data capture (CDC) is an approach > to data integration that is based on the identification, capture and delivery > of the changes made to enterprise data sources." > -Wikipedia > As Cassandra is increasingly being used as the Source of Record (SoR) for > mission critical data in large enterprises, it is increasingly being called > upon to act as the central hub of traffic and data flow to other systems. In > order to try to address the general need, we (cc [~brianmhess]), propose > implementing a simple data logging mechanism to enable per-table CDC patterns. > h2. The goals: > # Use CQL as the primary ingestion mechanism, in order to leverage its > Consistency Level semantics, and in order to treat it as the single > reliable/durable SoR for the data. > # To provide a mechanism for implementing good and reliable > (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) > continuous semi-realtime feeds of mutations going into a Cassandra cluster. > # To eliminate the developmental and operational burden of users so that they > don't have to do dual writes to other systems. > # For users that are currently doing batch export from a Cassandra system, > give them the opportunity to make that realtime with a minimum of coding. > h2. The mechanism: > We propose a durable logging mechanism that functions similar to a commitlog, > with the following nuances: > - Takes place on every node, not just the coordinator, so RF number of copies > are logged. > - Separate log per table. > - Per-table configuration. Only tables that are specified as CDC_LOG would do > any logging. > - Per DC. We are trying to keep the complexity to a minimum to make this an > easy enhancement, but most likely use cases would prefer to only implement > CDC logging in one (or a subset) of the DCs that are being replicated to > - In the critical path of ConsistencyLevel acknowledgment. Just as with the > commitlog, failure to write to the CDC log should fail that node's write. If > that means the requested consistency level was not met, then clients *should* > experience UnavailableExceptions. > - Be written in a Row-centric manner such that it is easy for consumers to > reconstitute rows atomically. > - Written in a simple format designed to be consumed *directly* by daemons > written in non JVM languages > h2. Nice-to-haves > I strongly suspect that the following features will be asked for, but I also > believe that they can be deferred for a subsequent release, and to guage > actual interest. > - Multiple logs per table. This would make it easy to have multiple > "subscribers" to a single table's changes. A workaround would be to create a > forking daemon listener, but that's not a great answer. > - Log filtering. Being able to apply filters, including UDF-based filters > would make Casandra a much more versatile feeder into other systems, and > again, reduce complexity that would otherwise need to be built into the > daemons. > h2. Format and Consumption > - Cassandra would only write to the CDC log, and never delete from it. > - Cleaning up consumed logfiles would be the client daemon's responibility > - Logfile size should probably be configurable. > - Logfiles should be named with a predictable naming schema, making it > triivial to process them in order. > - Daemons should be able to checkpoint their work, and resume from where they > left off. This means they would have to leave some file artifact in the CDC > log's directory. > - A sophisticated daemon should be able to be written that could > -- Catch up, in written-order, even when it is multiple logfiles behind in > processing > -- Be able to continuously "tail" the most recent logfile and get > low-latency(ms?) access to the data as it is written. > h2. Alternate approach > In order to make consuming a change log easy and efficient to do with low > latency, the following could supplement the approach outlined above > - Instead of writing to a logfile, by default, Cassandra could expose a > socket for a daemon to connect to, and from which it could pull each row. > - Cassandra would have a limited buffer for storing rows, should the listener > become backlogged, but it would immediately spill to disk in that case, never > incurring large in-memory costs. > h2. Additional consumption possibility > With all of the above, still relevant: > - instead (or in addition to) using the other logging mechanisms, use CQL > transport itself as a logger. > - Extend the CQL protoocol slightly so that rows of data can be return to a > listener that didn't explicit make a query, but instead registered itself > with Cassandra as a listener for a particular event type, and in this case, > the event type would be anything that would otherwise go to a CDC log. > - If there is no listener for the event type associated with that log, or if > that listener gets backlogged, the rows will again spill to the persistent > storage. > h2. Possible Syntax > {code:sql} > CREATE TABLE ... WITH CDC LOG > {code} > Pros: No syntax extesions > Cons: doesn't make it easy to capture the various permutations (i'm happy to > be proven wrong) of per-dc logging. also, the hypothetical multiple logs per > table would break this > {code:sql} > CREATE CDC_LOG mylog ON mytable WHERE MyUdf(mycol1, mycol2) = 5 with > DCs={'dc1','dc3'} > {code} > Pros: Expressive and allows for easy DDL management of all aspects of CDC > Cons: Syntax additions. Added complexity, partly for features that might not > be implemented -- This message was sent by Atlassian JIRA (v6.3.4#6332)