Re: [DISCUSS] KIP-558: Track the set of actively used topics by connectors in Kafka Connect

Randall Hauch Wed, 15 Jan 2020 14:41:48 -0800

On Wed, Jan 15, 2020 at 4:36 PM Randall Hauch <rha...@gmail.com> wrote:


> On Wed, Jan 15, 2020 at 2:05 PM Konstantine Karantasis <
> konstant...@confluent.io> wrote:
>
>>
>> 9. I assumed that partitioning is implied by default, because there's no
>> requirement for complete ordering of topic status records. But I'll add
>> this fact as a separate bullet. The status.storage.topic is already a
>> partitioned topic.
>>
>
> Agreed. I think it'd be sufficient to simply mention that partition will
> be chosen based upon the active topic records' keys, ensuring that all
> active topic records for the same connector will be written to the same
> partition and will be totally ordered.
>

Well, my previous statement is not quite right. All topic records for the
same *connector and topic* will be written to the same partition and will
be totally ordered. But as you pointed out, it doesn't really matter, other
than that this feature will work with any # of partitions. The new bullet
you described would be sufficient. :-D


>
>>
>> I'm following up with the rest of the comments, shortly.
>> Thanks,
>> Konstantine
>>
>>
>> On Wed, Jan 15, 2020 at 9:19 AM Almog Gavra <al...@confluent.io> wrote:
>>
>> > Hi Konstantine,
>> >
>> > Thanks for the KIP! This is going to make automatic integration with
>> > Connect much more powerful.
>> >
>> > My thoughts are mostly around freshness of the data and being able to
>> > expose that to users. Riffing on Randall's timestamp question - have we
>> > considered adding some interval at which point a connector will
>> republish
>> > any topics that it encounters and update the timestamp? That way we have
>> > some refreshing mechanism that isn't as powerful as the complete reset
>> > (which may not be practical in many scenarios).
>> >
>> > I also agree with Randall's other point (Would it be better to not
>> > automatically reset connector's active topics when a sink connector is
>> > restarted?). I think keeping the behavior as symmetrical between sink
>> and
>> > source connectors is a good idea.
>> >
>> > Lastly, with regards to the API, I can imagine it is also pretty useful
>> to
>> > answer the inverse question: "which connectors write to topic X".
>> Perhaps
>> > we can achieve this by letting the users compute it and just expose an
>> API
>> > that returns the entire mapping at once (instead of needing to call the
>> > /connectors/{name}/topics endpoint for each connector).
>> >
>> > Otherwise, looks good to me! Hits the requirements that I had in mind on
>> > the nose.
>> > - Almog
>> >
>> > On Wed, Jan 15, 2020 at 1:14 AM Tom Bentley <tbent...@redhat.com>
>> wrote:
>> >
>> > > Hi Konstantine,
>> > >
>> > > Thanks for the KIP, I can see how it could be useful.
>> > >
>> > > a) Did you consider using a metric for this? I don't think it would
>> > satisfy
>> > > all the use cases you have in mind, but you could mention it in the
>> > > rejected alternatives.
>> > >
>> > > b) If the topic name contains the string "-connector" then the key
>> format
>> > > is ambiguous. This isn't necessarily fatal because the value will
>> > > disambiguate, but it could be misleading. Any reason not to just use a
>> > JSON
>> > > key, and simplify the value?
>> > >
>> > > c) I didn't understand this part: "As soon as a worker detects the
>> > addition
>> > > of a topic to a connector's set of active topics, the worker will
>> cease
>> > to
>> > > post update messages to the status.storage.topic for that connector.
>> ".
>> > I'm
>> > > sure I've overlooking something but why is this necessary? Is this
>> were
>> > the
>> > > task id in the value is used?
>> > >
>> > > Thanks again,
>> > >
>> > > Tom
>> > >
>> > > On Wed, Jan 15, 2020 at 12:15 AM Randall Hauch <rha...@gmail.com>
>> wrote:
>> > >
>> > > > Oh, one more thing:
>> > > >
>> > > > 9. There's no mention of how the status topic is partitioned, or how
>> > > > partitioning will be used by the new topic records. The KIP should
>> > > probably
>> > > > outline this for clarity and completeness.
>> > > >
>> > > > Best regards,
>> > > >
>> > > > Randall
>> > > >
>> > > > On Tue, Jan 14, 2020 at 5:25 PM Randall Hauch <rha...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Thanks, Konstantine. Overall, this KIP looks interesting and
>> really
>> > > > > useful, and for the most part is spot on. I do have a number of
>> > > > > questions/comments about specifics:
>> > > > >
>> > > > >    1. The topic records have a value that includes the connector
>> > name,
>> > > > >    task number that last reported the topic is used, and the topic
>> > > name.
>> > > > >    There's no mention of record timestamps, but I wonder if it'd
>> be
>> > > > useful to
>> > > > >    record this. One challenge might be that a connector does not
>> > write
>> > > > to a
>> > > > >    topic for a while or the task remains running for long periods
>> of
>> > > > time and
>> > > > >    therefore the worker doesn't record that this topic has been
>> newly
>> > > > written
>> > > > >    to since it the task was restarted. IOW, the semantics of the
>> > > > timestamp may
>> > > > >    be a bit murky. Have you thought about recording the timestamp,
>> > and
>> > > > if so
>> > > > >    what are the pros and cons?
>> > > > >    - The "Recording active topics" section says the following:
>> > > > >       "As soon as a worker detects the addition of a topic to a
>> > > > >       connector's set of active topics, all the connector's tasks
>> > that
>> > > > inspect
>> > > > >       source or sink records will cease to post update messages to
>> > the
>> > > > >       status.storage.topic."
>> > > > >       This probably means the timestamp won't be very useful.
>> > > > >    2. The KIP says "the Kafka record value stores the ID of the
>> task
>> > > that
>> > > > >    succeeded to store a topic status record last." However, this
>> is a
>> > > bit
>> > > > >    unclear: is it really storing the last task that successfully
>> > wrote
>> > > > to that
>> > > > >    topic (as this would require very frequent writes to this
>> topic),
>> > or
>> > > > is it
>> > > > >    more that this is the task that was last *recorded* as having
>> > > written
>> > > > >    to the topic? (Here, "recorded" could be a bit of a gray area,
>> > since
>> > > > this
>> > > > >    would depend on the how the worker periodically records this
>> > > > information.)
>> > > > >    Any kind of clarity here might be helpful.
>> > > > >    3. In the "Recording active topics" section (and the
>> surrounding
>> > > > >    sections), the "task" is used ambiguously. For example, "when
>> its
>> > > > tasks
>> > > > >    start processing their first records ... these tasks will start
>> > > > inspecting
>> > > > >    which is the Kafka topic of each of these records". IIUC, the
>> > first
>> > > > "task"
>> > > > >    mentioned is the connector's task, and the second is the
>> worker's
>> > > > task. Do
>> > > > >    we need to distinguish this more clearly?
>> > > > >    4. Maybe I missed it, but does this KIP explicitly say that the
>> > > > >    Connector API is unchanged? It's probably worth pointing out to
>> > help
>> > > > >    assuage any concerns that connector implementations have to
>> change
>> > > to
>> > > > make
>> > > > >    use of this feature.
>> > > > >    5. In the "Resetting a connector's set of active topics"
>> section
>> > the
>> > > > >    behavior is not exactly clear. Consider a user running
>> connector
>> > > "A",
>> > > > the
>> > > > >    connector has been fully started and is processing records, and
>> > the
>> > > > worker
>> > > > >    has recorded topic usage records. Then the user resets the
>> active
>> > > > topics
>> > > > >    for connector A while the connector is still running? If the
>> > > connector
>> > > > >    writes to no new topics, before the tasks are rebalanced then
>> is
>> > it
>> > > > correct
>> > > > >    that Connect would report no active topics? And after the tasks
>> > are
>> > > > >    rebalance, will the worker record any topics used by connector
>> A?
>> > > > >    6. In the "Restaring" (misspelled) section: "Reconfiguring a
>> > source
>> > > > >    connector has also no altering effect for a source connector.
>> > > > However, when
>> > > > >    reconfiguring a sink connector if the new configuration no
>> longer
>> > > > includes
>> > > > >    any of the previously tracked topics, these topics will be
>> removed
>> > > > from the
>> > > > >    set of active topics for this sink connector by appending
>> > tombstone
>> > > > >    messages appropriately after the reconfiguration of the
>> > connector."
>> > > > Would
>> > > > >    it be better to not automatically reset connector's active
>> topics
>> > > > when a
>> > > > >    sink connector is restarted? Isn't that more consistent with
>> the
>> > > > >    "Resetting" behavior and the goals at the top of the KIP:
>> "it'd be
>> > > > useful
>> > > > >    for users, operators and applications to know which are the
>> topics
>> > > > that a
>> > > > >    connector has used since it was first created"?
>> > > > >    7. The `PUT /connectors/{name}/topics/reset` endpoint "this
>> > request
>> > > > >    can be reapplied after the deletion of the connector". IOW,
>> even
>> > > > though
>> > > > >    connector with that name doesn't exist, we can still make this
>> > > > request? How
>> > > > >    does this compare with other methods such as "status"?
>> > > > >    8. What are the security implications of this proposal?
>> > > > >
>> > > > > As you can see, most of these can probably be addressed without
>> much
>> > > > work.
>> > > > >
>> > > > > Best regards,
>> > > > >
>> > > > > Randall
>> > > > >
>> > > > > On Mon, Jan 13, 2020 at 11:05 PM Konstantine Karantasis <
>> > > > > konstant...@confluent.io> wrote:
>> > > > >
>> > > > >> Hi all.
>> > > > >>
>> > > > >> I just posted KIP-558: Track the set of actively used topics by
>> > > > connectors
>> > > > >> in Kafka Connect
>> > > > >>
>> > > > >> Wiki link here:
>> > > > >>
>> > > > >>
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-558%3A+Track+the+set+of+actively+used+topics+by+connectors+in+Kafka+Connect
>> > > > >>
>> > > > >> I think it's a nice extension to follow up on KIP-158 and a
>> useful
>> > > > feature
>> > > > >> to the ever increasing number of applications that are built
>> around
>> > > > Kafka
>> > > > >> Connect.
>> > > > >> Would love to hear what you think.
>> > > > >>
>> > > > >> Best,
>> > > > >> Konstantine
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-558: Track the set of actively used topics by connectors in Kafka Connect

Reply via email to