Re: MongoDB Kafka Connect driver

James Cheng Fri, 29 Jan 2016 14:13:45 -0800

Not sure if this will help anything, but just throwing it out there.

The Maxwell and mypipe projects both do CDC from MySQL and support 
bootstrapping. The way they do it is kind of "eventually consistent".


1) At time T1, record coordinates of the end of the binlog as of T1.
2) At time T2, do a full dump of the database into Kafka.
3) Connect back to the binlog in the coordinates recorded in step #1, and emit 
all those records into Kafka.

As Jay mentioned, MySQL supports full row images. At the start of step #3, the 
kafka topic contains all rows as of time T2. It is possible that during step 
#3, that you will emit rows that changed between T1 and T2. From the point of 
view of the consumer of the kafka topic, they would see rows that went "back in 
time". However, as step #3 progresses, and the consumer keeps reading, those 
rows would eventually converge down to their final state.

Maxwell: https://github.com/zendesk/maxwell
mypipe: https://github.com/mardambey/mypipe

Does that idea help in any way? Btw, a reason it is done this way is that it is 
"difficult" to do #1 and #2 above in a coordinated way without locking the 
database or without adding additional outside dependencies (LVM snapshots, 
being a specific one).

Btw, I glanced at some docs about the Mongodb oplog. It seems that each oplog 
contains
1) A way to identify the document that the change applies to.
2) A series of mongodb commands (set, unset) to alter the document in #1 to 
become the new document.

Thoughts:
For #1, does it identify a particular "version" of a document? (I don't know 
much about mongodb). If so, you might be able to use it to determine if the 
change should even be attempted to be applied to the object.
For #2, doesn't that mean you'll need "understand" mongodb's syntax and 
commands? Although maybe it is simply sets/unsets/deletes, in which case it's 
maybe pretty simple.

-James

> On Jan 29, 2016, at 9:39 AM, Jay Kreps <j...@confluent.io> wrote:
>
> Also, most database provide a "full logging" option that let's you capture
> the whole row in the log (I know Oracle and MySQL have this) but it sounds
> like Mongo doesn't yet. That would be the ideal solution.
>
> -Jay
>
> On Fri, Jan 29, 2016 at 9:38 AM, Jay Kreps <j...@confluent.io> wrote:
>
>> Ah, agreed. This approach is actually quite common in change capture,
>> though. For many use cases getting the final value is actually preferable
>> to getting intermediates. The exception is usually if you want to do
>> analytics on something like number of changes.
>>
>> On Fri, Jan 29, 2016 at 9:35 AM, Ewen Cheslack-Postava <e...@confluent.io>
>> wrote:
>>
>>> Jay,
>>>
>>> You can query after the fact, but you're not necessarily going to get the
>>> same value back. There could easily be dozens of changes to the document
>>> in
>>> the oplog so the delta you see may not even make sense given the current
>>> state of the document. Even if you can apply it the delta, you'd still be
>>> seeing data that is newer than the update. You can of course take this
>>> shortcut, but it won't give correct results. And if the data has been
>>> deleted since then, you won't even be able to write the full record... As
>>> far as I know, the way the op log is exposed won't let you do something
>>> like pin a query to the state of the db at a specific point in the op log
>>> and you may be reading from the beginning of the op log, so I don't think
>>> there's a way to get correct results by just querying the DB for the full
>>> documents.
>>>
>>> Strictly speaking you don't need to get all the data in memory, you just
>>> need a record of the current set of values somewhere. This is what I was
>>> describing following those two options -- if you do an initial dump to
>>> Kafka, you could track only offsets in memory and read back full values as
>>> needed to apply deltas, but this of course requires random reads into your
>>> Kafka topic (but may perform fine in practice depending on the workload).
>>>
>>> -Ewen
>>>
>>> On Fri, Jan 29, 2016 at 9:12 AM, Jay Kreps <j...@confluent.io> wrote:
>>>
>>>> Hey Ewen, how come you need to get it all in memory for approach (1)? I
>>>> guess the obvious thing to do would just be to query for the record
>>>> after-image when you get the diff--e.g. just read a batch of changes and
>>>> multi-get the final values. I don't know how bad the overhead of this
>>> would
>>>> be...batching might reduce it a fair amount. The guarantees for this are
>>>> slightly different than the pure oplog too (you get the current value
>>> not
>>>> every necessarily every intermediate) but that should be okay for most
>>>> uses.
>>>>
>>>> -Jay
>>>>
>>>> On Fri, Jan 29, 2016 at 8:54 AM, Ewen Cheslack-Postava <
>>> e...@confluent.io>
>>>> wrote:
>>>>
>>>>> Sunny,
>>>>>
>>>>> As I said on Twitter, I'm stoked to hear you're working on a Mongo
>>>>> connector! It struck me as a pretty natural source to tackle since it
>>>> does
>>>>> such a nice job of cleanly exposing the op log.
>>>>>
>>>>> Regarding the problem of only getting deltas, unfortunately there is
>>> not
>>>> a
>>>>> trivial solution here -- if you want to generate the full updated
>>> record,
>>>>> you're going to have to have a way to recover the original document.
>>>>>
>>>>> In fact, I'm curious how you were thinking of even bootstrapping. Are
>>> you
>>>>> going to do a full dump and then start reading the op log? Is there a
>>>> good
>>>>> way to do the dump and figure out the exact location in the op log
>>> that
>>>> the
>>>>> query generating the dump was initially performed? I know that
>>> internally
>>>>> mongo effectively does these two steps, but I'm not sure if the
>>> necessary
>>>>> info is exposed via normal queries.
>>>>>
>>>>> If you want to reconstitute the data, I can think of a couple of
>>> options:
>>>>>
>>>>> 1. Try to reconstitute inline in the connector. This seems difficult
>>> to
>>>>> make work in practice. At some point you basically have to query for
>>> the
>>>>> entire data set to bring it into memory and then the connector is
>>>>> effectively just applying the deltas to its in memory copy and then
>>> just
>>>>> generating one output record containing the full document each time it
>>>>> applies an update.
>>>>> 2. Make the connector send just the updates and have a separate stream
>>>>> processing job perform the reconstitution and send to another topic.
>>> In
>>>>> this case, the first topic should not be compacted, but the second one
>>>>> could be.
>>>>>
>>>>> Unfortunately, without additional hooks into the database, there's not
>>>> much
>>>>> you can do besides this pretty heavyweight process. There may be some
>>>>> tricks you can use to reduce the amount of memory used during the
>>> process
>>>>> (e.g. keep a small cache of actual records and for the rest only store
>>>>> Kafka offsets for the last full value, performing a (possibly
>>> expensive)
>>>>> random read as necessary to get the full document value back), but to
>>> get
>>>>> full correctness you will need to perform this process.
>>>>>
>>>>> In terms of Kafka Connect supporting something like this, I'm not sure
>>>> how
>>>>> general it could be made, or that you even want to perform the process
>>>>> inline with the Kafka Connect job. If it's an issue that repeatedly
>>>> arises
>>>>> across a variety of systems, then we should consider how to address it
>>>> more
>>>>> generally.
>>>>>
>>>>> -Ewen
>>>>>
>>>>> On Tue, Jan 26, 2016 at 8:43 PM, Sunny Shah <su...@tinyowl.co.in>
>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> We are trying to write a Kafka-connect connector for Mongodb. The
>>> issue
>>>>>> is, MongoDB does not provide an entire changed document for update
>>>>>> operations, It just provides the modified fields.
>>>>>>
>>>>>> if Kafka allows custom log compaction then It is possible to
>>> eventually
>>>>>> merge an entire document and subsequent update to to create an
>>> entire
>>>>>> record again.
>>>>>>
>>>>>> As Ewen pointed out to me on twitter, this is not possible, then
>>> What
>>>> is
>>>>>> the Kafka-connect way of solving this issue?
>>>>>>
>>>>>> @Ewen, Thanks a lot for a really quick answer on twitter.
>>>>>>
>>>>>> --
>>>>>> Thanks and Regards,
>>>>>> Sunny
>>>>>>
>>>>>> The contents of this e-mail and any attachment(s) are confidential
>>> and
>>>>>> intended for the named recipient(s) only. It shall not attach any
>>>>> liability
>>>>>> on the originator or TinyOwl Technology Pvt. Ltd. or its affiliates.
>>>> Any
>>>>>> form of reproduction, dissemination, copying, disclosure,
>>> modification,
>>>>>> distribution and / or publication of this message without the prior
>>>>> written
>>>>>> consent of the author of this e-mail is strictly prohibited. If you
>>>> have
>>>>>> received this email in error please delete it and notify the sender
>>>>>> immediately. You are liable to the company (TinyOwl Technology Pvt.
>>>>> Ltd.) in
>>>>>> case of any breach in 
>>>>>> confidentialy (through any form of communication) wherein the
>>> company
>>>>> has
>>>>>> the right to injunct legal action and an equitable relief for
>>> damages.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Ewen
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Ewen
>>>
>>
>>


________________________________

This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.

Re: MongoDB Kafka Connect driver

Reply via email to