Re: [DISCUSS] KIP-381 Connect: Tell about records that had their offsets flushed in callback

Per Steffensen Thu, 18 Oct 2018 05:05:42 -0700

On 17/10/2018 18.17, Ryanne Dolan wrote:

> this does not guarantee that the
> offsets of R have been written/flushed at the next commit() call
True, but does it matter? So long as you can guarantee the records aredelivered to the downstream Kafka cluster, it shouldn't matter if theyhave been committed or not.
The worst that can happen is that the worker gets bounced and asks forthe same records a second time. Even if those records have since beendropped from the upstream data source, it doesn't matter cuz you knowthey were previously delivered successfully.

You are kinda arguing that offsets are not usable at all. I think theyare. Below I will explain a fairly simple source-connector, and how itwould be mislead by the way source-connector-framework currently works,and how my fix would help it not be. The source-connector is picked outof blue air, but not too far from what I have had to deal with in real life

Lets assume I write a fairly simple source-connector, that picks up datafrom files in a given folder. For simplicity lets just assume that eachfile fits in a Kafka-message. My source-connector just sorts the filesby timestamp and sends out the data in the files, oldest file first. Itis possible that the receiving side of the data my source-connectorsends out, will get the same data twice, for one of the following reasons* There were actually two input-files that contained exactly the samedata (in that case the receiving side should handle it twice)* The data from that same file may be sent twice in two Kafka-messages,due to global atomicy being impossible (in that case the receiving sideshould only handle the data once)I order to allow the receiving side to know, when two consecutivemessages are essentially the same, so that it will know only to handleone of them, I introduce a simple sequence-numbering system in mysource-connector. I simply write a sequence-number in theKafka-messages, and I use Kafka-connect offsets to keep track of thenext sequence-number to be used, so that I can pick up with the correctsequence-number in case of a crash/restart. If there is no offsets whenthe source-connector starts (first start) it will just start withsequence-number 1.


*Assume the following files are in the input-folder:*
* 2018-01-01_10_00_00-<GUID1>.data
* 2018-01-01_10_00_00-<GUID2>.data
* 2018-01-01_10_00_01-<GUID3>.data
* 2018-01-01_10_00_02-<GUID4>.data
…

*Now this sequence of events are possible*
* mySourceConnector.poll() —> [

R1 = record({seq: 1, data=<data from2018-01-01_10_00_00-<GUID1>.data>},{ nextSeq=2 }}, R2 = record({seq: 2, data=<data from2018-01-01_10_00_00-<GUID2>.data>},{ nextSeq=3 }}

]
* data of R1 was sent and acknowledged
* mySourceConnector.commitRecord(R1)
* data of R2 was sent and acknowledged
* mySourceConnector.commitRecord(R2)

* offsets-committer kicks in around here and picks up the offsets fromR1 and R2, resulting in the merged offsets to written and flushed to be{ nextSeq=3 }

* mySourceConnector.poll() —> [

R3 = record({seq: 3, data=<data from2018-01-01_10_00_01-<GUID3>.data>},{ nextSeq=4 }}

]
* data of R3 was sent and acknowledged
* mySourceConnector.commitRecord(R3)
* offsets-committer finishes writing and flushing offsets { nextSeq=3 }
* mySourceConnector.commit()

In mySourceConnector.commit() implementation I believe that the data andoffsets for R1, R2 and R3 has been sent/written/flushed/acknowledged,and therefore I delete the following files

* 2018-01-01_10_00_00-<GUID1>.data
* 2018-01-01_10_00_00-<GUID2>.data
* 2018-01-01_10_00_01-<GUID3>.data

But the truth is that data for R1, R2 and R3 has been sent withsequence-number 1, 2 and 3 respectively, but the flushed offsets says {nextSeq=3 }, and not { nextSeq=4 } which I would indirectly expectIf the system crashes here, upon restart I will get { nextSeq=3 }, butfile containing the data supposed to get sequence-number 3 has alreadybeen deleted. Therefore I will end up with this next poll

* poll() —> [

R4 = record({seq: 3, data=<data from2018-01-01_10_00_02-<GUID4>.data},{ nextSeq=4 }}

]
If my system had worked I should have ended up with this next poll
* poll() —> [

R4 = record({seq: 4, data=<data from2018-01-01_10_00_02-<GUID4>.data},{ nextSeq=5 }}

The receiving side of my data will get two messages containing the samesequence-number 3. It will therefore incorrectly ignore the secondmessage. Even if it double check by looking at the actual data of thetwo message, and If the content of <data from2018-01-01_10_00_01-<GUID3>.data and <data from2018-01-01_10_00_02-<GUID4>.data was actually identical, it has no wayof figuring out to do the right thing (actually handle both messages)


*With my fix to the problem*, the call to commit() would have been
mySourceConnector.commit([R1, R2])
I would know only to delete the following files
* 2018-01-01_10_00_00-<GUID1>.data
* 2018-01-01_10_00_00-<GUID2>.data
And after crash/restart I would end up sending the correct next message
mySourceConnector.poll() —> [

R3 = record({seq: 3, data=<data from2018-01-01_10_00_01-<GUID3>.data>},{ nextSeq=4 }}

Re: [DISCUSS] KIP-381 Connect: Tell about records that had their offsets flushed in callback

Reply via email to