[ https://issues.apache.org/jira/browse/KAFKA-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323726#comment-15323726 ]
Randall Hauch commented on KAFKA-3821: -------------------------------------- [~ewencp], I'm not sure a {{data}} field is that much different than storing information in the offsets. I'd also worry that if the data isn't stored in the same place as the offsets, that there's a chance for failing after offsets are flushed but before data is flushed (or vice versa). However, I have been thinking of something that may very well be, well, out in left field. Maybe connectors need some kind of _history events_ that they can produce and easily consume. We originally talked some time ago about possibly adding persisted state (e.g., a key-value store), but I think that's going in the wrong direction and that instead an _event stream_ is far better and more powerful because it gives you an ordered history. Naturally, these history events would be persisted to a Kafka topic, but it'd be one that's primarily for the connector (or any monitors). The semantics are very specific, though: * The connector registers an _event handler_ that is called for every event, with one method that handles the events and a few other methods that are called at various lifecycle state transitions. * Upon startup, the framework reads the topic from the very beginning, replaying all events and forwarding them to the _event handler_. When complete, the event handler's _end replay_ method is called to signal to it that all events have been replayed. * The connector can write to it at any time, and all events are synchronously written to Kafka (this is thus a lower-volume topic). The connector's consumer sees these events after they've been read from Kafka, although much of the time these events could probably be ignored (since the connector is writing them), but occasionally there might be an event it didn't create and the connector wants/needs to respond. I could see a couple of uses for this: # Debezium's MySQL connector records all DDL statements (with offsets) it reads from the MySQL binlog so that, when the connector starts back up and is given an offset it is to start it, it can replay the DDL statements until it reaches but does not pass the designated offset to reconstruct an in-memory model of the database schemas. Currently we do this with a separate thread, but we've had to sidestep Kafka Connect and handle this ourselves because we have to have full control over producing, flushing, and consuming. And our history is flushed before the offsets for normal records might be flushed, so we know that we might have extra events in our history (and have to ignore them) should something go wrong and DDL get recorded without the affiliated offsets. # Record occasional state changes, such as "starting a snapshot", "completing a snapshot", etc. # Enable a management tool to submit events (rather "requests") to be written to the topic. The connector would receive them and can react accordingly, all without having to stop the connector, change the configuration, restart the connector, wait for something to complete, and change the configuration back. An example might be to perform an ad hoc snapshot of 3 specific tables, or "flush snapshot of current DDL statements to history" (allowing the topic to be log compacted). # Task coordination, where a connector with multiple tasks needs one "leader" task to do work while the others wait, and when completed all the tasks can continue. (This might be contrived.) Now, I've not fleshed this out very far and welcome any thoughts or criticisms. My connector writing experience is limited: I've written one complicated connector and one relatively simple prototype connector, and I'm not sure whether this is a general need or something far too specific to my limited context. This isn't too different from offset storage, except that you want to be able to replay the history upon startup. Perhaps there's a way of incorporating these ideas into offset storage, but I suspect that's not really a good idea. > Allow Kafka Connect source tasks to produce offset without writing to topics > ---------------------------------------------------------------------------- > > Key: KAFKA-3821 > URL: https://issues.apache.org/jira/browse/KAFKA-3821 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect > Affects Versions: 0.9.0.1 > Reporter: Randall Hauch > Assignee: Ewen Cheslack-Postava > > Provide a way for a {{SourceTask}} implementation to record a new offset for > a given partition without necessarily writing a source record to a topic. > Consider a connector task that uses the same offset when producing an unknown > number of {{SourceRecord}} objects (e.g., it is taking a snapshot of a > database). Once the task completes those records, the connector wants to > update the offsets (e.g., the snapshot is complete) but has no more records > to be written to a topic. With this change, the task could simply supply an > updated offset. -- This message was sent by Atlassian JIRA (v6.3.4#6332)