We have the similar use case:  Streamific, the Ingestion Service for Hadoop
Big Data at Uber Engineering <https://eng.uber.com/streamific/>. We had
this data ingestion pipeline built on MySQL/schemaless
<https://eng.uber.com/schemaless-part-one/> before using Cassandra. For
Cassandra, we used to do double write to Cassandra/Kafka and moving to CDC
(as dual write has its own issues). Here is one of the use cases we
opensourced: Introducing AthenaX, Uber Engineering’s Open Source Streaming
Analytics Platform <https://eng.uber.com/athenax/>. For most of our use
cases, we cannot put kafka before Cassandra to get consistency requirement.
We're having the same challenges
<https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf> for
CDC, and here is what we currently do for the dedup and full update (not
perfect, we're still working on improving it):

Deduplication: currently we de-dup the data in the kafka consumer instead
of the producer which means there're 3 (RF number) copies of data in Kafka.
We're working on dedup with the cache as mentioned before (also in the PPT
<https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf>),
but we also want to make sure the downstream consumer is able to handle
duplicated data, as the cache won't cover 100% de-dup the data (also in our
case, cache layer has lower SLA).

Full row update: MySQL provides the full row in binlog. Cassandra commitlog
only has the updated fields, but the downstream consumer has all the
historical data and it could be merged there: Hudi: Uber Engineering’s
Incremental Processing Framework on Hadoop <https://eng.uber.com/hoodie/>,
it's also opensourced here <https://uber.github.io/hudi/index.html>.

Just FYI. ElasticSearch is also another consumer of the kafka topic: Databook:
Turning Big Data into Knowledge with Metadata at Uber
<https://eng.uber.com/databook/>. And we opensourced the data auditing
system for the pipeline: Introducing Chaperone: How Uber Engineering Audits
Kafka End-to-End <https://eng.uber.com/chaperone/>
We're also exploring Cache invalidation with CDC, currently, the update lag
(10 seconds) is the blocker issue for that.

On Wed, Sep 12, 2018 at 2:18 AM DuyHai Doan <doanduy...@gmail.com> wrote:

> The biggest problem of having CDC working correctly in C* is the
> deduplication issue.
>
> Having a process to read incoming mutation from commitlog is not that
> hard, having to dedup them through N replicas is much harder
>
> The idea is : why don't we generate the CDC event directly at the
> coordinator side ? Indeed, the coordinator is the single source of true for
> each mutation request. As soon as the coordinator receives 1
> acknowledgement from any replica, the mutation can be considered "durable"
> and safely sent downstream to the CDC processor. This approach would
> requires to change the write path on the coordinator side and may have
> impact on performance (if writing to CDC downstream is blocking or too slow)
>
> My 2 cents
>
> On Wed, Sep 12, 2018 at 5:56 AM, Joy Gao <j...@wepay.com.invalid> wrote:
>
>> Re Rahul:  "Although DSE advanced replication does one way, those are use
>> cases with limited value to me because ultimately it’s still a master slave
>> design."
>> Completely agree. I'm not familiar with Calvin protocol, but that sounds
>> interesting (reading time...).
>>
>> On Tue, Sep 11, 2018 at 8:38 PM Joy Gao <j...@wepay.com> wrote:
>>
>>> Thank you all for the feedback so far.
>>>
>>> The immediate use case for us is setting up a real-time streaming data
>>> pipeline from C* to our Data Warehouse (BigQuery), where other teams can
>>> access the data for reporting/analytics/ad-hoc query. We already do
>>> this with MySQL
>>> <https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka>,
>>> where we stream the MySQL Binlog via Debezium <https://debezium.io>'s
>>> MySQL Connector to Kafka, and then use a BigQuery Sink Connector to stream
>>> data to BigQuery.
>>>
>>> Re Jon's comment about why not write to Kafka first? In some cases that
>>> may be ideal; but one potential concern we have with writing to Kafka first
>>> is not having "read-after-write" consistency. The data could be written to
>>> Kafka, but not yet consumed by C*. If the web service issues a (quorum)
>>> read immediately after the (quorum) write, the data that is being returned
>>> could still be outdated if the consumer did not catch up. Having web
>>> service interacts with C* directly solves this problem for us (we could add
>>> a cache before writing to Kafka, but that adds additional operational
>>> complexity to the architecture; alternatively, we could write to Kafka and
>>> C* transactionally, but distributed transaction is slow).
>>>
>>> Having the ability to stream its data to other systems could make C*
>>> more flexible and more easily integrated into a larger data ecosystem. As
>>> Dinesh has mentioned, implementing this in the database layer means there
>>> is a standard approach to getting a change notification stream (unlike
>>> trigger which is ad-hoc and customized). Aside from replication, the change
>>> events could be used for updating Elasticsearch, generating derived views
>>> (i.e. for reporting), sending to an audit services, sending to a
>>> notification service, and in our case, streaming to our data warehouse for
>>> analytics. (one article that goes over database streaming is Martin
>>> Kleppman's Turning the Database Inside Out with Apache Samza
>>> <https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/>,
>>> which seems relevant here). For reference, this turning database into a
>>> stream of change events is pretty common in SQL databases (i.e. mysql
>>> binlog, postgres WAL) and NoSQL databases that have primary-replica setup
>>> (i.e. Mongodb Oplog). Recently CockroachDB introduced a CDC feature as well
>>> (and they have master-less replication too).
>>>
>>> Hope that answers the question. That said, dedupe/ordering/getting full
>>> row of data via C* CDC is a hard problem, but may be worth solving for
>>> reasons mentioned above. Our proposal is an user approach to solve these
>>> problems. Maybe the more sensible thing to do is to build it as part of C*
>>> itself, but that's a much bigger discussion. If anyone is building a
>>> streaming pipeline for C*, we'd be interested in hearing their approaches
>>> as well.
>>>
>>>
>>> On Tue, Sep 11, 2018 at 7:01 AM Rahul Singh <
>>> rahul.xavier.si...@gmail.com> wrote:
>>>
>>>> You know what they say: Go big or go home.
>>>>
>>>> Right now candidates are Cassandra itself but embedded or on the side
>>>> not on the actual data clusters, zookeeper (yuck) , Kafka (which needs
>>>> zookeeper, yuck) , S3 (outside service dependency, so no go. )
>>>>
>>>> Jeff, Those are great patterns. ESP. Second one. Have used it several
>>>> times. Cassandra is a great place to store data in transport.
>>>>
>>>>
>>>> Rahul
>>>> On Sep 10, 2018, 5:21 PM -0400, DuyHai Doan <doanduy...@gmail.com>,
>>>> wrote:
>>>>
>>>> Also using Calvin means having to implement a distributed monotonic
>>>> sequence as a primitive, not trivial at all ...
>>>>
>>>> On Mon, Sep 10, 2018 at 3:08 PM, Rahul Singh <
>>>> rahul.xavier.si...@gmail.com> wrote:
>>>>
>>>>> In response to mimicking Advanced replication in DSE. I understand the
>>>>> goal. Although DSE advanced replication does one way, those are use cases
>>>>> with limited value to me because ultimately it’s still a master slave
>>>>> design.
>>>>>
>>>>> I’m working on a prototype for this for two way replication between
>>>>> clusters or databases regardless of dB tech - and every variation I can 
>>>>> get
>>>>> to comes down to some implementation of the Calvin protocol which 
>>>>> basically
>>>>> verifies the change in either cluster , sequences it according to impact 
>>>>> to
>>>>> underlying data, and then schedules the mutation in a predictable manner 
>>>>> on
>>>>> both clusters / DBS.
>>>>>
>>>>> All that means is that I need to sequence the change before it happens
>>>>> so I can predictably ensure it’s Scheduled for write / Mutation. So I’m
>>>>> Back to square one: having a definitive queue / ledger separate from
>>>>> the individual commit log of the cluster.
>>>>>
>>>>>
>>>>> Rahul Singh
>>>>> Chief Executive Officer
>>>>> m 202.905.2818
>>>>>
>>>>> Anant Corporation
>>>>> 1010 Wisconsin Ave NW, Suite 250
>>>>> <https://maps.google.com/?q=1010+Wisconsin+Ave+NW,+Suite+250+%0D%0AWashington,+D.C.+20007&entry=gmail&source=g>
>>>>> Washington, D.C. 20007
>>>>>
>>>>> We build and manage digital business technology platforms.
>>>>> On Sep 10, 2018, 3:58 AM -0400, Dinesh Joshi 
>>>>> <dinesh.jo...@yahoo.com.invalid>,
>>>>> wrote:
>>>>>
>>>>> On Sep 9, 2018, at 6:08 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>>>>
>>>>> There may be some use cases for it.. but I'm not sure what they are.
>>>>> It might help if you shared the use cases where the extra complexity is
>>>>> required?  When does writing to Cassandra which then dedupes and writes to
>>>>> Kafka a preferred design then using Kafka and simply writing to Cassandra?
>>>>>
>>>>>
>>>>> From the reading of the proposal, it seems bring functionality similar
>>>>> to MySQL's binlog to Kafka connector. This is useful for many applications
>>>>> that want to be notified when certain (or any) rows change in the database
>>>>> primarily for a event driven application architecture.
>>>>>
>>>>> Implementing this in the database layer means there is a standard
>>>>> approach to getting a change notification stream. Downstream subscribers
>>>>> can then decide which notifications to act on.
>>>>>
>>>>> LinkedIn's databus is similar in functionality -
>>>>> https://github.com/linkedin/databus However it is for heterogenous
>>>>> datastores.
>>>>>
>>>>> On Thu, Sep 6, 2018 at 1:53 PM Joy Gao <j...@wepay.com.invalid> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> We have a* WIP design doc
>>>>>> <https://wepayinc.box.com/s/fmdtw0idajyfa23hosf7x4ustdhb0ura>* that
>>>>>> goes over this idea in details.
>>>>>>
>>>>>> We haven't sort out all the edge cases yet, but would love to get
>>>>>> some feedback from the community on the general feasibility of this
>>>>>> approach. Any ideas/concerns/questions would be helpful to us. Thanks!
>>>>>>
>>>>>>
>>>>> Interesting idea. I did go over the proposal briefly. I concur with
>>>>> Jon about adding more use-cases to clarify this feature's potential
>>>>> use-cases.
>>>>>
>>>>> Dinesh
>>>>>
>>>>>
>>>>
>

Reply via email to