I'll be honest, I'm having a hard time wrapping my head around an
architecture where you use CDC to push data into Kafka.  I've worked on
plenty of systems that use Kafka as a means of communication, and one of
the consumers is a process that stores data in Cassandra.  That's pretty
normal.  Sending Cassandra mutations to Kafka, on the other hand, feels
backwards and for 99% of teams, more work than it's worth.

There may be some use cases for it.. but I'm not sure what they are.  It
might help if you shared the use cases where the extra complexity is
required?  When does writing to Cassandra which then dedupes and writes to
Kafka a preferred design then using Kafka and simply writing to Cassandra?

If the answer is "because it's fun to solve hard problems" that's OK too!

Jon


On Thu, Sep 6, 2018 at 1:53 PM Joy Gao <j...@wepay.com.invalid> wrote:

> Hi all,
>
> We are fairly new to Cassandra. We began looking into the CDC feature
> introduced in 3.0. As we spent more time looking into it, the complexity
> began to add up (i.e. duplicated mutation based on RF, out of order
> mutation, mutation does not contain full row of data, etc). These
> limitations have already been mentioned in the discussion thread in
> CASSANDRA-8844, so we understand the design decisions around this. However,
> we do not want to push solving this complexity to every downstream
> consumers, where they each have to handle
> deduping/ordering/read-before-write to get full row; instead we want to
> solve them earlier in the pipeline, so the change message are
> deduped/ordered/complete by the time they arrive in Kafka. Dedupe can be
> solved with a cache, and ordering can be solved since mutations have
> timestamps, but the one we have the most trouble with is not having the
> full row of data.
>
> We had a couple discussions with some folks in other companies who are
> working on applying CDC feature for their real-time data pipelines. On a
> high-level, the common feedback we gathered is to use a stateful processing
> approach to maintain a separate db which mutations are applied to, which
> then allows them to construct the "before" and "after" data without having
> to query the original Cassandra db on each mutation. The downside of this
> is the operational overhead of having to maintain this intermediary db for
> CDC.
>
> We have an unconventional idea (inspired by DSE Advanced Replication) that
> eliminates some of the operational overhead, but with tradeoff of
> increasing code complexity and memory pressure. The high level idea is a
> stateless processing approach where we have a process in each C* node that
> parse mutation from CDC logs and query local node to get the "after" data,
> which avoid network hops and thus making reading full-row of data more
> efficient. We essentially treat the mutations in CDC log as change
> notifications. To solve dedupe/ordering, only the primary node for each
> token range will send the data to Kafka, but data are reconciled with peer
> nodes to prevent data loss.
>
> We have a* WIP design doc
> <https://wepayinc.box.com/s/fmdtw0idajyfa23hosf7x4ustdhb0ura>* that goes
> over this idea in details.
>
> We haven't sort out all the edge cases yet, but would love to get some
> feedback from the community on the general feasibility of this approach.
> Any ideas/concerns/questions would be helpful to us. Thanks!
>
> Joy
>


-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Reply via email to