Hi Konstantine, b) I thought hard but briefly about this. Although there's potential for > ambiguity if someone inspects the key on its own, I believe that if we > combine this information with what we save in the value, the ambiguity is > removed. With respect to whether a connector can override another > connector's key, I don't think that's possible for any combination of > unique topic name and connector name. I hope I'm not missing anything. >
I think it is possible for connectors to have conflicting keys: * topic-name="foo-connector" and connector-name="bar" * topic-name="foo" and connector-name="connector-bar" would both have the key "topic-foo-connector-connector-bar" > c) That means that worker tasks will produce topic status records once they > detect that the worker they are running on does not include a topic that is > used by the tasks in the list of active topics for this connector. But once > the worker adds this topic to the set of active topics for this connector, > then the worker tasks stop producing those messages. They'll produce one > again if the worker stops including that topic in the active topics set for > some reason. I could improve the wording on the KIP, but I'd also like to > know we are on the same page re: the design here. > OK, I understand now. I think it would be helpful to include the example you gave in your previous reply. Many thanks, Tom > > Let me know if the above address your questions. > Thanks, > Konstantine > > > > On Wed, Jan 15, 2020 at 12:05 PM Randall Hauch <rha...@gmail.com> wrote: > > > Almog, > > > > You raise some interesting questions. Comments inline below. > > > > On Wed, Jan 15, 2020 at 11:19 AM Almog Gavra <al...@confluent.io> wrote: > > > > > Hi Konstantine, > > > > > > Thanks for the KIP! This is going to make automatic integration with > > > Connect much more powerful. > > > > > > My thoughts are mostly around freshness of the data and being able to > > > expose that to users. Riffing on Randall's timestamp question - have we > > > considered adding some interval at which point a connector will > republish > > > any topics that it encounters and update the timestamp? That way we > have > > > some refreshing mechanism that isn't as powerful as the complete reset > > > (which may not be practical in many scenarios). > > > > > > > My question about recording the timestamp at which each active topic > record > > were (infrequently) written was more about making a bit more information > > available given the current design, and whether recording a bit more > > information (for very little additional storage cost and no extra runtime > > cost) may be worth it if in the future we figure out how to use this > > information. > > > > I think it's more complicated to try to record the history of when topics > > were most recently used, since that requires recording a lot more active > > topic records than the current proposal. Besides, it's not unexpected > that > > source and sink connectors sometimes don't use topics for periods of > time. > > A sink connector only consumes from a topic when there are additional > > records to consume, and a source connector only needs to write to a topic > > when there is information in the upstream system targeted to that topic. > An > > example of the latter is that most database connectors will write to a > > topic for a particular table only when rows in that table have changed. > > > > > > > I also agree with Randall's other point (Would it be better to not > > > automatically reset connector's active topics when a sink connector is > > > restarted?). I think keeping the behavior as symmetrical between sink > and > > > source connectors is a good idea. > > > > > > Lastly, with regards to the API, I can imagine it is also pretty useful > > to > > > answer the inverse question: "which connectors write to topic X". > Perhaps > > > we can achieve this by letting the users compute it and just expose an > > API > > > that returns the entire mapping at once (instead of needing to call the > > > /connectors/{name}/topics endpoint for each connector). > > > > > > It may be worth considering a method such as /topics that would produce a > > result that is an aggregate of the potentially many > > /connectors/{name}/topic responses, especially if the result of the > /topics > > is an array of the individual responses from /connectors/{name}/topic. > The > > benefit of a single request is that the answer to "which connectors use > > topic X" can be computed using simple tools such as 'jq'. And, if tooling > > frequently fetches the active topic information for all connectors, > > providing the aggregate method would reduce the load on the tooling and > the > > server. It may also be relatively easy to implement. > > > > > > > Otherwise, looks good to me! Hits the requirements that I had in mind > on > > > the nose. > > > - Almog > > > > > > On Wed, Jan 15, 2020 at 1:14 AM Tom Bentley <tbent...@redhat.com> > wrote: > > > > > > > Hi Konstantine, > > > > > > > > Thanks for the KIP, I can see how it could be useful. > > > > > > > > a) Did you consider using a metric for this? I don't think it would > > > satisfy > > > > all the use cases you have in mind, but you could mention it in the > > > > rejected alternatives. > > > > > > > > b) If the topic name contains the string "-connector" then the key > > format > > > > is ambiguous. This isn't necessarily fatal because the value will > > > > disambiguate, but it could be misleading. Any reason not to just use > a > > > JSON > > > > key, and simplify the value? > > > > > > > > c) I didn't understand this part: "As soon as a worker detects the > > > addition > > > > of a topic to a connector's set of active topics, the worker will > cease > > > to > > > > post update messages to the status.storage.topic for that connector. > ". > > > I'm > > > > sure I've overlooking something but why is this necessary? Is this > were > > > the > > > > task id in the value is used? > > > > > > > > Thanks again, > > > > > > > > Tom > > > > > > > > On Wed, Jan 15, 2020 at 12:15 AM Randall Hauch <rha...@gmail.com> > > wrote: > > > > > > > > > Oh, one more thing: > > > > > > > > > > 9. There's no mention of how the status topic is partitioned, or > how > > > > > partitioning will be used by the new topic records. The KIP should > > > > probably > > > > > outline this for clarity and completeness. > > > > > > > > > > Best regards, > > > > > > > > > > Randall > > > > > > > > > > On Tue, Jan 14, 2020 at 5:25 PM Randall Hauch <rha...@gmail.com> > > > wrote: > > > > > > > > > > > Thanks, Konstantine. Overall, this KIP looks interesting and > really > > > > > > useful, and for the most part is spot on. I do have a number of > > > > > > questions/comments about specifics: > > > > > > > > > > > > 1. The topic records have a value that includes the connector > > > name, > > > > > > task number that last reported the topic is used, and the > topic > > > > name. > > > > > > There's no mention of record timestamps, but I wonder if it'd > be > > > > > useful to > > > > > > record this. One challenge might be that a connector does not > > > write > > > > > to a > > > > > > topic for a while or the task remains running for long periods > > of > > > > > time and > > > > > > therefore the worker doesn't record that this topic has been > > newly > > > > > written > > > > > > to since it the task was restarted. IOW, the semantics of the > > > > > timestamp may > > > > > > be a bit murky. Have you thought about recording the > timestamp, > > > and > > > > > if so > > > > > > what are the pros and cons? > > > > > > - The "Recording active topics" section says the following: > > > > > > "As soon as a worker detects the addition of a topic to a > > > > > > connector's set of active topics, all the connector's tasks > > > that > > > > > inspect > > > > > > source or sink records will cease to post update messages > to > > > the > > > > > > status.storage.topic." > > > > > > This probably means the timestamp won't be very useful. > > > > > > 2. The KIP says "the Kafka record value stores the ID of the > > task > > > > that > > > > > > succeeded to store a topic status record last." However, this > > is a > > > > bit > > > > > > unclear: is it really storing the last task that successfully > > > wrote > > > > > to that > > > > > > topic (as this would require very frequent writes to this > > topic), > > > or > > > > > is it > > > > > > more that this is the task that was last *recorded* as having > > > > written > > > > > > to the topic? (Here, "recorded" could be a bit of a gray area, > > > since > > > > > this > > > > > > would depend on the how the worker periodically records this > > > > > information.) > > > > > > Any kind of clarity here might be helpful. > > > > > > 3. In the "Recording active topics" section (and the > surrounding > > > > > > sections), the "task" is used ambiguously. For example, "when > > its > > > > > tasks > > > > > > start processing their first records ... these tasks will > start > > > > > inspecting > > > > > > which is the Kafka topic of each of these records". IIUC, the > > > first > > > > > "task" > > > > > > mentioned is the connector's task, and the second is the > > worker's > > > > > task. Do > > > > > > we need to distinguish this more clearly? > > > > > > 4. Maybe I missed it, but does this KIP explicitly say that > the > > > > > > Connector API is unchanged? It's probably worth pointing out > to > > > help > > > > > > assuage any concerns that connector implementations have to > > change > > > > to > > > > > make > > > > > > use of this feature. > > > > > > 5. In the "Resetting a connector's set of active topics" > section > > > the > > > > > > behavior is not exactly clear. Consider a user running > connector > > > > "A", > > > > > the > > > > > > connector has been fully started and is processing records, > and > > > the > > > > > worker > > > > > > has recorded topic usage records. Then the user resets the > > active > > > > > topics > > > > > > for connector A while the connector is still running? If the > > > > connector > > > > > > writes to no new topics, before the tasks are rebalanced then > is > > > it > > > > > correct > > > > > > that Connect would report no active topics? And after the > tasks > > > are > > > > > > rebalance, will the worker record any topics used by connector > > A? > > > > > > 6. In the "Restaring" (misspelled) section: "Reconfiguring a > > > source > > > > > > connector has also no altering effect for a source connector. > > > > > However, when > > > > > > reconfiguring a sink connector if the new configuration no > > longer > > > > > includes > > > > > > any of the previously tracked topics, these topics will be > > removed > > > > > from the > > > > > > set of active topics for this sink connector by appending > > > tombstone > > > > > > messages appropriately after the reconfiguration of the > > > connector." > > > > > Would > > > > > > it be better to not automatically reset connector's active > > topics > > > > > when a > > > > > > sink connector is restarted? Isn't that more consistent with > the > > > > > > "Resetting" behavior and the goals at the top of the KIP: > "it'd > > be > > > > > useful > > > > > > for users, operators and applications to know which are the > > topics > > > > > that a > > > > > > connector has used since it was first created"? > > > > > > 7. The `PUT /connectors/{name}/topics/reset` endpoint "this > > > request > > > > > > can be reapplied after the deletion of the connector". IOW, > even > > > > > though > > > > > > connector with that name doesn't exist, we can still make this > > > > > request? How > > > > > > does this compare with other methods such as "status"? > > > > > > 8. What are the security implications of this proposal? > > > > > > > > > > > > As you can see, most of these can probably be addressed without > > much > > > > > work. > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Randall > > > > > > > > > > > > On Mon, Jan 13, 2020 at 11:05 PM Konstantine Karantasis < > > > > > > konstant...@confluent.io> wrote: > > > > > > > > > > > >> Hi all. > > > > > >> > > > > > >> I just posted KIP-558: Track the set of actively used topics by > > > > > connectors > > > > > >> in Kafka Connect > > > > > >> > > > > > >> Wiki link here: > > > > > >> > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-558%3A+Track+the+set+of+actively+used+topics+by+connectors+in+Kafka+Connect > > > > > >> > > > > > >> I think it's a nice extension to follow up on KIP-158 and a > useful > > > > > feature > > > > > >> to the ever increasing number of applications that are built > > around > > > > > Kafka > > > > > >> Connect. > > > > > >> Would love to hear what you think. > > > > > >> > > > > > >> Best, > > > > > >> Konstantine > > > > > >> > > > > > > > > > > > > > > > > > > > > >