Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-15 Thread Chris Egerton
y
> help
> > > > > people
> > > > > > gauge how to respond to non-200 responses, and we can try to
> improve
> > > > > > wording and granularity over time based on user feedback. You and
> > > other
> > > > > > users may develop automated responses based on the content of the
> > > error
> > > > > > messages, but beware that the wording may change across releases.
> > > > > >
> > > > > > Does that seem reasonable for V1 of this feature? I can
> definitely
> > > see
> > > > > room
> > > > > > for expansion of the response format in the future, but would
> like to
> > > > > hold
> > > > > > off on that for now.
> > > > > >
> > > > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > > > > > [2] - https://github.com/apache/kafka/pull/14562
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Tue, Jun 11, 2024 at 3:37 AM Adrian Preston <
> prest...@uk.ibm.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Chris,
> > > > > > >
> > > > > > > Good KIP – I think it will be very helpful in alerting and
> > > automating
> > > > > the
> > > > > > > resolution of common Connect problems.
> > > > > > >
> > > > > > > I have a couple of questions / suggestions:
> > > > > > >
> > > > > > > 1. What are you planning on documenting as guidance for using
> this
> > > > new
> > > > > > > endpoint? My guess would be that if Connect doesn’t return a
> status
> > > > of
> > > > > > 200
> > > > > > > after some period I would either page someone, or restart the
> > > > process?
> > > > > > But
> > > > > > > I’m missing the nuance of distinguishing between readiness and
> > > > > liveness,
> > > > > > is
> > > > > > > this for maintaining overall availability when rolling out
> updates
> > > to
> > > > > > > several Connect processes?
> > > > > > >
> > > > > > > 2. Would you consider providing a way to discover details about
> > > > exactly
> > > > > > > which condition (or conditions) is/are failing? Perhaps by
> > > providing
> > > > a
> > > > > > > richer JSON response? Something like this would help us adopt
> the
> > > > > health
> > > > > > > check, as we could start by paging someone for all failures,
> then
> > > > over
> > > > > > time
> > > > > > > (as we gained confidence) transition more of the conditions
> over to
> > > > > being
> > > > > > > handled by automation.
> > > > > > >
> > > > > > > Regards,
> > > > > > > - Adrian
> > > > > > >
> > > > > > >
> > > > > > > From: Chris Egerton 
> > > > > > > Date: Monday, 10 June 2024 at 15:26
> > > > > > > To: dev@kafka.apache.org 
> > > > > > > Subject: [EXTERNAL] Re: [DISCUSS] KIP-1017: A health check
> endpoint
> > > > for
> > > > > > > Kafka Connect
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Thanks for the positive feedback!
> > > > > > >
> > > > > > > I've made one small addition to the KIP since there's been a
> change
> > > > to
> > > > > > our
> > > > > > > REST timeout error messages that's worth incorporating here.
> > > Quoting
> > > > > the
> > > > > > > added section directly:
> > > > > > >
> > > > > > > > Note that the HTTP status codes and "status" fields in the
> JSON
> > > > > > response
> > > > > > > will match the exact examples above. However, the "message"
> field
> > > may
> > > > > be
> > > > > > > augmented to include, among other things, more information
> about
> > > the
> > > > > > > operation(s) the worker could be blocked on (such as was added
> in
> > > > REST
> > > > > > > timeout error messages in KAFKA-15563 [1]).
> > > > > > >
> > > > > > > Assuming this still looks okay to everyone, I'll kick off a
> vote
> > > > thread
> > > > > > > sometime this week or the next.
> > > > > > >
> > > > > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > On Fri, Jun 7, 2024 at 12:01 PM Andrew Schofield <
> > > > > > > andrew_schofi...@live.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Chris,
> > > > > > > > This KIP looks good to me. I particularly like the
> explanation of
> > > > how
> > > > > > the
> > > > > > > > result will specifically
> > > > > > > > check the worker health in ways that have previously caused
> > > > trouble.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Andrew
> > > > > > > >
> > > > > > > > > On 7 Jun 2024, at 16:18, Mickael Maison <
> > > > mickael.mai...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Chris,
> > > > > > > > >
> > > > > > > > > Happy Friday! The KIP looks good to me. +1
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Mickael
> > > > > > > > >
> > > > > > > > > On Fri, Jan 26, 2024 at 8:41 PM Chris Egerton
> > > > > >  > > > > > > >
> > > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >> Hi all,
> > > > > > > > >>
> > > > > > > > >> Happy Friday! I'd like to kick off discussion for
> KIP-1017,
> > > > which
> > > > > > (as
> > > > > > > > the
> > > > > > > > >> title suggests) proposes adding a health check endpoint
> for
> > > > Kafka
> > > > > > > > Connect:
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1017%3A+Health+check+endpoint+for+Kafka+Connect
> > > > > > > > >>
> > > > > > > > >> This is one of the longest-standing issues with Kafka
> Connect
> > > > and
> > > > > > I'm
> > > > > > > > >> hoping we can finally put it in the ground soon. Looking
> > > forward
> > > > > to
> > > > > > > > hearing
> > > > > > > > >> people's thoughts!
> > > > > > > > >>
> > > > > > > > >> Cheers,
> > > > > > > > >>
> > > > > > > > >> Chris
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > Unless otherwise stated above:
> > > > > > >
> > > > > > > IBM United Kingdom Limited
> > > > > > > Registered in England and Wales with number 741598
> > > > > > > Registered office: PO Box 41, North Harbour, Portsmouth,
> Hants. PO6
> > > > 3AU
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>


Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-14 Thread Edoardo Comar
e that it should be an improvement KIP, but would
> > like
> > > to
> > > > gauge what you think about this.
> > > >
> > > > Regards,
> > > > Viktor
> > > >
> > > > On Tue, Jun 11, 2024 at 4:34 PM Chris Egerton  > >
> > > > wrote:
> > > >
> > > > > Hi Adrian,
> > > > >
> > > > > Thanks for your comments/questions! The responses to them are related
> > > so
> > > > > I'll try to address both at once.
> > > > >
> > > > > The most recent update I made to the KIP should help provide insight
> > > into
> > > > > what's going wrong if a non-200 response is returned. I don't plan on
> > > > > adding any structured data such as error codes or something like a
> > > > "phase"
> > > > > field with values like READING_CONFIG_TOPIC quite yet, but there is
> > > room
> > > > > for us to add human-readable information on the causes of failure in
> > > the
> > > > > "message" field (see KAFKA-15563 [1] and its PR [2] for an example of
> > > > what
> > > > > kind of information we might provide to users). Part of the problem
> > is
> > > > that
> > > > > while I've heard plenty of (justified!) complaints about the Kafka
> > > > Connect
> > > > > REST API becoming unavailable and the difficulties users face with
> > > > > debugging their workers when that happens, I still don't feel we
> > have a
> > > > > strong-enough grasp on the common causes for this scenario to commit
> > > to a
> > > > > response format that could be more machine-readable, and it can be
> > > > > surprisingly difficult to get to a root cause in some cases.
> > > > >
> > > > > I'm anticipating that users will rely on the endpoint primarily for
> > two
> > > > > things:
> > > > > 1) Ensuring that a worker has completed startup successfully during a
> > > > > rolling upgrade (if you don't get a 200 after long enough, abort the
> > > > > upgrade, check the error message, and start investigating)
> > > > > 2) Ensuring that a worker remains healthy after it has joined the
> > > cluster
> > > > > (if you don't get a 200 for a sustained period of time, check the
> > error
> > > > > message, and then decide whether to restart the process or issue a
> > > page)
> > > > >
> > > > > It's primarily designed to be easy to incorporate with automated
> > > tooling
> > > > > that has support for REST-based process health monitoring, while also
> > > > > providing some human-readable information (when possible) if the
> > worker
> > > > > isn't healthy. This human-readable information should hopefully help
> > > > people
> > > > > gauge how to respond to non-200 responses, and we can try to improve
> > > > > wording and granularity over time based on user feedback. You and
> > other
> > > > > users may develop automated responses based on the content of the
> > error
> > > > > messages, but beware that the wording may change across releases.
> > > > >
> > > > > Does that seem reasonable for V1 of this feature? I can definitely
> > see
> > > > room
> > > > > for expansion of the response format in the future, but would like to
> > > > hold
> > > > > off on that for now.
> > > > >
> > > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > > > > [2] - https://github.com/apache/kafka/pull/14562
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Tue, Jun 11, 2024 at 3:37 AM Adrian Preston 
> > > > > wrote:
> > > > >
> > > > > > Hi Chris,
> > > > > >
> > > > > > Good KIP – I think it will be very helpful in alerting and
> > automating
> > > > the
> > > > > > resolution of common Connect problems.
> > > > > >
> > > > > > I have a couple of questions / suggestions:
> > > > > >
> > > > > > 1. What are you planning on documenting as guidance for using this
> &g

Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-14 Thread Viktor Somogyi-Vass
gt; response format that could be more machine-readable, and it can be
> > > surprisingly difficult to get to a root cause in some cases.
> > >
> > > I'm anticipating that users will rely on the endpoint primarily for two
> > > things:
> > > 1) Ensuring that a worker has completed startup successfully during a
> > > rolling upgrade (if you don't get a 200 after long enough, abort the
> > > upgrade, check the error message, and start investigating)
> > > 2) Ensuring that a worker remains healthy after it has joined the
> cluster
> > > (if you don't get a 200 for a sustained period of time, check the error
> > > message, and then decide whether to restart the process or issue a
> page)
> > >
> > > It's primarily designed to be easy to incorporate with automated
> tooling
> > > that has support for REST-based process health monitoring, while also
> > > providing some human-readable information (when possible) if the worker
> > > isn't healthy. This human-readable information should hopefully help
> > people
> > > gauge how to respond to non-200 responses, and we can try to improve
> > > wording and granularity over time based on user feedback. You and other
> > > users may develop automated responses based on the content of the error
> > > messages, but beware that the wording may change across releases.
> > >
> > > Does that seem reasonable for V1 of this feature? I can definitely see
> > room
> > > for expansion of the response format in the future, but would like to
> > hold
> > > off on that for now.
> > >
> > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > > [2] - https://github.com/apache/kafka/pull/14562
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Tue, Jun 11, 2024 at 3:37 AM Adrian Preston 
> > > wrote:
> > >
> > > > Hi Chris,
> > > >
> > > > Good KIP – I think it will be very helpful in alerting and automating
> > the
> > > > resolution of common Connect problems.
> > > >
> > > > I have a couple of questions / suggestions:
> > > >
> > > > 1. What are you planning on documenting as guidance for using this
> new
> > > > endpoint? My guess would be that if Connect doesn’t return a status
> of
> > > 200
> > > > after some period I would either page someone, or restart the
> process?
> > > But
> > > > I’m missing the nuance of distinguishing between readiness and
> > liveness,
> > > is
> > > > this for maintaining overall availability when rolling out updates to
> > > > several Connect processes?
> > > >
> > > > 2. Would you consider providing a way to discover details about
> exactly
> > > > which condition (or conditions) is/are failing? Perhaps by providing
> a
> > > > richer JSON response? Something like this would help us adopt the
> > health
> > > > check, as we could start by paging someone for all failures, then
> over
> > > time
> > > > (as we gained confidence) transition more of the conditions over to
> > being
> > > > handled by automation.
> > > >
> > > > Regards,
> > > > - Adrian
> > > >
> > > >
> > > > From: Chris Egerton 
> > > > Date: Monday, 10 June 2024 at 15:26
> > > > To: dev@kafka.apache.org 
> > > > Subject: [EXTERNAL] Re: [DISCUSS] KIP-1017: A health check endpoint
> for
> > > > Kafka Connect
> > > > Hi all,
> > > >
> > > > Thanks for the positive feedback!
> > > >
> > > > I've made one small addition to the KIP since there's been a change
> to
> > > our
> > > > REST timeout error messages that's worth incorporating here. Quoting
> > the
> > > > added section directly:
> > > >
> > > > > Note that the HTTP status codes and "status" fields in the JSON
> > > response
> > > > will match the exact examples above. However, the "message" field may
> > be
> > > > augmented to include, among other things, more information about the
> > > > operation(s) the worker could be blocked on (such as was added in
> REST
> > > > timeout error messages in KAFKA-15563 [1]).
> > > >
> > > > Assuming this still looks okay to everyone, I'll kick off a vote
> thread
> > > > some

Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-14 Thread Chris Egerton
t; for us to add human-readable information on the causes of failure in
> > the
> > > > "message" field (see KAFKA-15563 [1] and its PR [2] for an example of
> > > what
> > > > kind of information we might provide to users). Part of the problem
> is
> > > that
> > > > while I've heard plenty of (justified!) complaints about the Kafka
> > > Connect
> > > > REST API becoming unavailable and the difficulties users face with
> > > > debugging their workers when that happens, I still don't feel we
> have a
> > > > strong-enough grasp on the common causes for this scenario to commit
> > to a
> > > > response format that could be more machine-readable, and it can be
> > > > surprisingly difficult to get to a root cause in some cases.
> > > >
> > > > I'm anticipating that users will rely on the endpoint primarily for
> two
> > > > things:
> > > > 1) Ensuring that a worker has completed startup successfully during a
> > > > rolling upgrade (if you don't get a 200 after long enough, abort the
> > > > upgrade, check the error message, and start investigating)
> > > > 2) Ensuring that a worker remains healthy after it has joined the
> > cluster
> > > > (if you don't get a 200 for a sustained period of time, check the
> error
> > > > message, and then decide whether to restart the process or issue a
> > page)
> > > >
> > > > It's primarily designed to be easy to incorporate with automated
> > tooling
> > > > that has support for REST-based process health monitoring, while also
> > > > providing some human-readable information (when possible) if the
> worker
> > > > isn't healthy. This human-readable information should hopefully help
> > > people
> > > > gauge how to respond to non-200 responses, and we can try to improve
> > > > wording and granularity over time based on user feedback. You and
> other
> > > > users may develop automated responses based on the content of the
> error
> > > > messages, but beware that the wording may change across releases.
> > > >
> > > > Does that seem reasonable for V1 of this feature? I can definitely
> see
> > > room
> > > > for expansion of the response format in the future, but would like to
> > > hold
> > > > off on that for now.
> > > >
> > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > > > [2] - https://github.com/apache/kafka/pull/14562
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Tue, Jun 11, 2024 at 3:37 AM Adrian Preston 
> > > > wrote:
> > > >
> > > > > Hi Chris,
> > > > >
> > > > > Good KIP – I think it will be very helpful in alerting and
> automating
> > > the
> > > > > resolution of common Connect problems.
> > > > >
> > > > > I have a couple of questions / suggestions:
> > > > >
> > > > > 1. What are you planning on documenting as guidance for using this
> > new
> > > > > endpoint? My guess would be that if Connect doesn’t return a status
> > of
> > > > 200
> > > > > after some period I would either page someone, or restart the
> > process?
> > > > But
> > > > > I’m missing the nuance of distinguishing between readiness and
> > > liveness,
> > > > is
> > > > > this for maintaining overall availability when rolling out updates
> to
> > > > > several Connect processes?
> > > > >
> > > > > 2. Would you consider providing a way to discover details about
> > exactly
> > > > > which condition (or conditions) is/are failing? Perhaps by
> providing
> > a
> > > > > richer JSON response? Something like this would help us adopt the
> > > health
> > > > > check, as we could start by paging someone for all failures, then
> > over
> > > > time
> > > > > (as we gained confidence) transition more of the conditions over to
> > > being
> > > > > handled by automation.
> > > > >
> > > > > Regards,
> > > > > - Adrian
> > > > >
> > > > >
> > > > > From: Chris Egerton 
> > > > > Date: Monday, 10 June 2024 at 15:26
> > > > &

Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-11 Thread Chris Egerton
ocess health monitoring, while also
> > providing some human-readable information (when possible) if the worker
> > isn't healthy. This human-readable information should hopefully help
> people
> > gauge how to respond to non-200 responses, and we can try to improve
> > wording and granularity over time based on user feedback. You and other
> > users may develop automated responses based on the content of the error
> > messages, but beware that the wording may change across releases.
> >
> > Does that seem reasonable for V1 of this feature? I can definitely see
> room
> > for expansion of the response format in the future, but would like to
> hold
> > off on that for now.
> >
> > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > [2] - https://github.com/apache/kafka/pull/14562
> >
> > Cheers,
> >
> > Chris
> >
> > On Tue, Jun 11, 2024 at 3:37 AM Adrian Preston 
> > wrote:
> >
> > > Hi Chris,
> > >
> > > Good KIP – I think it will be very helpful in alerting and automating
> the
> > > resolution of common Connect problems.
> > >
> > > I have a couple of questions / suggestions:
> > >
> > > 1. What are you planning on documenting as guidance for using this new
> > > endpoint? My guess would be that if Connect doesn’t return a status of
> > 200
> > > after some period I would either page someone, or restart the process?
> > But
> > > I’m missing the nuance of distinguishing between readiness and
> liveness,
> > is
> > > this for maintaining overall availability when rolling out updates to
> > > several Connect processes?
> > >
> > > 2. Would you consider providing a way to discover details about exactly
> > > which condition (or conditions) is/are failing? Perhaps by providing a
> > > richer JSON response? Something like this would help us adopt the
> health
> > > check, as we could start by paging someone for all failures, then over
> > time
> > > (as we gained confidence) transition more of the conditions over to
> being
> > > handled by automation.
> > >
> > > Regards,
> > > - Adrian
> > >
> > >
> > > From: Chris Egerton 
> > > Date: Monday, 10 June 2024 at 15:26
> > > To: dev@kafka.apache.org 
> > > Subject: [EXTERNAL] Re: [DISCUSS] KIP-1017: A health check endpoint for
> > > Kafka Connect
> > > Hi all,
> > >
> > > Thanks for the positive feedback!
> > >
> > > I've made one small addition to the KIP since there's been a change to
> > our
> > > REST timeout error messages that's worth incorporating here. Quoting
> the
> > > added section directly:
> > >
> > > > Note that the HTTP status codes and "status" fields in the JSON
> > response
> > > will match the exact examples above. However, the "message" field may
> be
> > > augmented to include, among other things, more information about the
> > > operation(s) the worker could be blocked on (such as was added in REST
> > > timeout error messages in KAFKA-15563 [1]).
> > >
> > > Assuming this still looks okay to everyone, I'll kick off a vote thread
> > > sometime this week or the next.
> > >
> > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Fri, Jun 7, 2024 at 12:01 PM Andrew Schofield <
> > > andrew_schofi...@live.com>
> > > wrote:
> > >
> > > > Hi Chris,
> > > > This KIP looks good to me. I particularly like the explanation of how
> > the
> > > > result will specifically
> > > > check the worker health in ways that have previously caused trouble.
> > > >
> > > > Thanks,
> > > > Andrew
> > > >
> > > > > On 7 Jun 2024, at 16:18, Mickael Maison 
> > > > wrote:
> > > > >
> > > > > Hi Chris,
> > > > >
> > > > > Happy Friday! The KIP looks good to me. +1
> > > > >
> > > > > Thanks,
> > > > > Mickael
> > > > >
> > > > > On Fri, Jan 26, 2024 at 8:41 PM Chris Egerton
> >  > > >
> > > > wrote:
> > > > >>
> > > > >> Hi all,
> > > > >>
> > > > >> Happy Friday! I'd like to kick off discussion for KIP-1017, which
> > (as
> > > > the
> > > > >> title suggests) proposes adding a health check endpoint for Kafka
> > > > Connect:
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1017%3A+Health+check+endpoint+for+Kafka+Connect
> > > > >>
> > > > >> This is one of the longest-standing issues with Kafka Connect and
> > I'm
> > > > >> hoping we can finally put it in the ground soon. Looking forward
> to
> > > > hearing
> > > > >> people's thoughts!
> > > > >>
> > > > >> Cheers,
> > > > >>
> > > > >> Chris
> > > >
> > > >
> > >
> > > Unless otherwise stated above:
> > >
> > > IBM United Kingdom Limited
> > > Registered in England and Wales with number 741598
> > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> > >
> >
>


Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-11 Thread Viktor Somogyi-Vass
Hi Chris,

I also have 2 other comments:

1. One more thing I came across is that should we provide the Retry-After
header in the response in case of 503 response? Although I'm not sure how
many clients honor this, we could add it just in case some does and if you
also find it useful. (We could default it to retry.backoff.ms.)

2. Adding to Adrian's comments, storing timestamped worker statuses in an
internal topic and then reading them from there would add valuable info
about what the worker does. For instance GET /health?startTime=45345323346
could fetch events from the given timestamp additionally to your proposed
behavior. Also, if the rest server isn't available, it would serve in
itself as a log about the workers' behavior. I understand if you think it's
such a scope change that it should be an improvement KIP, but would like to
gauge what you think about this.

Regards,
Viktor

On Tue, Jun 11, 2024 at 4:34 PM Chris Egerton 
wrote:

> Hi Adrian,
>
> Thanks for your comments/questions! The responses to them are related so
> I'll try to address both at once.
>
> The most recent update I made to the KIP should help provide insight into
> what's going wrong if a non-200 response is returned. I don't plan on
> adding any structured data such as error codes or something like a "phase"
> field with values like READING_CONFIG_TOPIC quite yet, but there is room
> for us to add human-readable information on the causes of failure in the
> "message" field (see KAFKA-15563 [1] and its PR [2] for an example of what
> kind of information we might provide to users). Part of the problem is that
> while I've heard plenty of (justified!) complaints about the Kafka Connect
> REST API becoming unavailable and the difficulties users face with
> debugging their workers when that happens, I still don't feel we have a
> strong-enough grasp on the common causes for this scenario to commit to a
> response format that could be more machine-readable, and it can be
> surprisingly difficult to get to a root cause in some cases.
>
> I'm anticipating that users will rely on the endpoint primarily for two
> things:
> 1) Ensuring that a worker has completed startup successfully during a
> rolling upgrade (if you don't get a 200 after long enough, abort the
> upgrade, check the error message, and start investigating)
> 2) Ensuring that a worker remains healthy after it has joined the cluster
> (if you don't get a 200 for a sustained period of time, check the error
> message, and then decide whether to restart the process or issue a page)
>
> It's primarily designed to be easy to incorporate with automated tooling
> that has support for REST-based process health monitoring, while also
> providing some human-readable information (when possible) if the worker
> isn't healthy. This human-readable information should hopefully help people
> gauge how to respond to non-200 responses, and we can try to improve
> wording and granularity over time based on user feedback. You and other
> users may develop automated responses based on the content of the error
> messages, but beware that the wording may change across releases.
>
> Does that seem reasonable for V1 of this feature? I can definitely see room
> for expansion of the response format in the future, but would like to hold
> off on that for now.
>
> [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> [2] - https://github.com/apache/kafka/pull/14562
>
> Cheers,
>
> Chris
>
> On Tue, Jun 11, 2024 at 3:37 AM Adrian Preston 
> wrote:
>
> > Hi Chris,
> >
> > Good KIP – I think it will be very helpful in alerting and automating the
> > resolution of common Connect problems.
> >
> > I have a couple of questions / suggestions:
> >
> > 1. What are you planning on documenting as guidance for using this new
> > endpoint? My guess would be that if Connect doesn’t return a status of
> 200
> > after some period I would either page someone, or restart the process?
> But
> > I’m missing the nuance of distinguishing between readiness and liveness,
> is
> > this for maintaining overall availability when rolling out updates to
> > several Connect processes?
> >
> > 2. Would you consider providing a way to discover details about exactly
> > which condition (or conditions) is/are failing? Perhaps by providing a
> > richer JSON response? Something like this would help us adopt the health
> > check, as we could start by paging someone for all failures, then over
> time
> > (as we gained confidence) transition more of the conditions over to being
> > handled by automation.
> >
> > Regards,
> > - Adrian
> >
> >
> > From: Chris Egerton 
> > Date: Monday, 10 June 20

RE: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-11 Thread Adrian Preston
Hi Chris,

Thanks for taking the time to provide more details about the intended usage. I 
hadn’t appreciated how nuanced (and perhaps in some cases not fully explored) 
the causes of an unhealthy Connect could be. With that in mind, I can see why 
you want to nail down a straight-forward and robust implementation before 
considering further enhancements.

Cheers,
- Adrian.


From: Chris Egerton 
Date: Tuesday, 11 June 2024 at 15:34
To: dev@kafka.apache.org 
Subject: [EXTERNAL] Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka 
Connect
Hi Adrian,

Thanks for your comments/questions! The responses to them are related so
I'll try to address both at once.

The most recent update I made to the KIP should help provide insight into
what's going wrong if a non-200 response is returned. I don't plan on
adding any structured data such as error codes or something like a "phase"
field with values like READING_CONFIG_TOPIC quite yet, but there is room
for us to add human-readable information on the causes of failure in the
"message" field (see KAFKA-15563 [1] and its PR [2] for an example of what
kind of information we might provide to users). Part of the problem is that
while I've heard plenty of (justified!) complaints about the Kafka Connect
REST API becoming unavailable and the difficulties users face with
debugging their workers when that happens, I still don't feel we have a
strong-enough grasp on the common causes for this scenario to commit to a
response format that could be more machine-readable, and it can be
surprisingly difficult to get to a root cause in some cases.

I'm anticipating that users will rely on the endpoint primarily for two
things:
1) Ensuring that a worker has completed startup successfully during a
rolling upgrade (if you don't get a 200 after long enough, abort the
upgrade, check the error message, and start investigating)
2) Ensuring that a worker remains healthy after it has joined the cluster
(if you don't get a 200 for a sustained period of time, check the error
message, and then decide whether to restart the process or issue a page)

It's primarily designed to be easy to incorporate with automated tooling
that has support for REST-based process health monitoring, while also
providing some human-readable information (when possible) if the worker
isn't healthy. This human-readable information should hopefully help people
gauge how to respond to non-200 responses, and we can try to improve
wording and granularity over time based on user feedback. You and other
users may develop automated responses based on the content of the error
messages, but beware that the wording may change across releases.

Does that seem reasonable for V1 of this feature? I can definitely see room
for expansion of the response format in the future, but would like to hold
off on that for now.

[1] - https://issues.apache.org/jira/browse/KAFKA-15563
[2] - https://github.com/apache/kafka/pull/14562

Cheers,

Chris

On Tue, Jun 11, 2024 at 3:37 AM Adrian Preston  wrote:

> Hi Chris,
>
> Good KIP – I think it will be very helpful in alerting and automating the
> resolution of common Connect problems.
>
> I have a couple of questions / suggestions:
>
> 1. What are you planning on documenting as guidance for using this new
> endpoint? My guess would be that if Connect doesn’t return a status of 200
> after some period I would either page someone, or restart the process? But
> I’m missing the nuance of distinguishing between readiness and liveness, is
> this for maintaining overall availability when rolling out updates to
> several Connect processes?
>
> 2. Would you consider providing a way to discover details about exactly
> which condition (or conditions) is/are failing? Perhaps by providing a
> richer JSON response? Something like this would help us adopt the health
> check, as we could start by paging someone for all failures, then over time
> (as we gained confidence) transition more of the conditions over to being
> handled by automation.
>
> Regards,
> - Adrian
>
>
> From: Chris Egerton 
> Date: Monday, 10 June 2024 at 15:26
> To: dev@kafka.apache.org 
> Subject: [EXTERNAL] Re: [DISCUSS] KIP-1017: A health check endpoint for
> Kafka Connect
> Hi all,
>
> Thanks for the positive feedback!
>
> I've made one small addition to the KIP since there's been a change to our
> REST timeout error messages that's worth incorporating here. Quoting the
> added section directly:
>
> > Note that the HTTP status codes and "status" fields in the JSON response
> will match the exact examples above. However, the "message" field may be
> augmented to include, among other things, more information about the
> operation(s) the worker could be blocked on (such as was added in REST
> timeout error messages in KAFKA-15563 [1]).
>
> Assuming this still 

Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-11 Thread Chris Egerton
Hi Adrian,

Thanks for your comments/questions! The responses to them are related so
I'll try to address both at once.

The most recent update I made to the KIP should help provide insight into
what's going wrong if a non-200 response is returned. I don't plan on
adding any structured data such as error codes or something like a "phase"
field with values like READING_CONFIG_TOPIC quite yet, but there is room
for us to add human-readable information on the causes of failure in the
"message" field (see KAFKA-15563 [1] and its PR [2] for an example of what
kind of information we might provide to users). Part of the problem is that
while I've heard plenty of (justified!) complaints about the Kafka Connect
REST API becoming unavailable and the difficulties users face with
debugging their workers when that happens, I still don't feel we have a
strong-enough grasp on the common causes for this scenario to commit to a
response format that could be more machine-readable, and it can be
surprisingly difficult to get to a root cause in some cases.

I'm anticipating that users will rely on the endpoint primarily for two
things:
1) Ensuring that a worker has completed startup successfully during a
rolling upgrade (if you don't get a 200 after long enough, abort the
upgrade, check the error message, and start investigating)
2) Ensuring that a worker remains healthy after it has joined the cluster
(if you don't get a 200 for a sustained period of time, check the error
message, and then decide whether to restart the process or issue a page)

It's primarily designed to be easy to incorporate with automated tooling
that has support for REST-based process health monitoring, while also
providing some human-readable information (when possible) if the worker
isn't healthy. This human-readable information should hopefully help people
gauge how to respond to non-200 responses, and we can try to improve
wording and granularity over time based on user feedback. You and other
users may develop automated responses based on the content of the error
messages, but beware that the wording may change across releases.

Does that seem reasonable for V1 of this feature? I can definitely see room
for expansion of the response format in the future, but would like to hold
off on that for now.

[1] - https://issues.apache.org/jira/browse/KAFKA-15563
[2] - https://github.com/apache/kafka/pull/14562

Cheers,

Chris

On Tue, Jun 11, 2024 at 3:37 AM Adrian Preston  wrote:

> Hi Chris,
>
> Good KIP – I think it will be very helpful in alerting and automating the
> resolution of common Connect problems.
>
> I have a couple of questions / suggestions:
>
> 1. What are you planning on documenting as guidance for using this new
> endpoint? My guess would be that if Connect doesn’t return a status of 200
> after some period I would either page someone, or restart the process? But
> I’m missing the nuance of distinguishing between readiness and liveness, is
> this for maintaining overall availability when rolling out updates to
> several Connect processes?
>
> 2. Would you consider providing a way to discover details about exactly
> which condition (or conditions) is/are failing? Perhaps by providing a
> richer JSON response? Something like this would help us adopt the health
> check, as we could start by paging someone for all failures, then over time
> (as we gained confidence) transition more of the conditions over to being
> handled by automation.
>
> Regards,
> - Adrian
>
>
> From: Chris Egerton 
> Date: Monday, 10 June 2024 at 15:26
> To: dev@kafka.apache.org 
> Subject: [EXTERNAL] Re: [DISCUSS] KIP-1017: A health check endpoint for
> Kafka Connect
> Hi all,
>
> Thanks for the positive feedback!
>
> I've made one small addition to the KIP since there's been a change to our
> REST timeout error messages that's worth incorporating here. Quoting the
> added section directly:
>
> > Note that the HTTP status codes and "status" fields in the JSON response
> will match the exact examples above. However, the "message" field may be
> augmented to include, among other things, more information about the
> operation(s) the worker could be blocked on (such as was added in REST
> timeout error messages in KAFKA-15563 [1]).
>
> Assuming this still looks okay to everyone, I'll kick off a vote thread
> sometime this week or the next.
>
> [1] - https://issues.apache.org/jira/browse/KAFKA-15563
>
> Cheers,
>
> Chris
>
> On Fri, Jun 7, 2024 at 12:01 PM Andrew Schofield <
> andrew_schofi...@live.com>
> wrote:
>
> > Hi Chris,
> > This KIP looks good to me. I particularly like the explanation of how the
> > result will specifically
> > check the worker health in ways that have previously caused trouble.
> >
> > Thanks

RE: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-11 Thread Adrian Preston
Hi Chris,

Good KIP – I think it will be very helpful in alerting and automating the 
resolution of common Connect problems.

I have a couple of questions / suggestions:

1. What are you planning on documenting as guidance for using this new 
endpoint? My guess would be that if Connect doesn’t return a status of 200 
after some period I would either page someone, or restart the process? But I’m 
missing the nuance of distinguishing between readiness and liveness, is this 
for maintaining overall availability when rolling out updates to several 
Connect processes?

2. Would you consider providing a way to discover details about exactly which 
condition (or conditions) is/are failing? Perhaps by providing a richer JSON 
response? Something like this would help us adopt the health check, as we could 
start by paging someone for all failures, then over time (as we gained 
confidence) transition more of the conditions over to being handled by 
automation.

Regards,
- Adrian


From: Chris Egerton 
Date: Monday, 10 June 2024 at 15:26
To: dev@kafka.apache.org 
Subject: [EXTERNAL] Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka 
Connect
Hi all,

Thanks for the positive feedback!

I've made one small addition to the KIP since there's been a change to our
REST timeout error messages that's worth incorporating here. Quoting the
added section directly:

> Note that the HTTP status codes and "status" fields in the JSON response
will match the exact examples above. However, the "message" field may be
augmented to include, among other things, more information about the
operation(s) the worker could be blocked on (such as was added in REST
timeout error messages in KAFKA-15563 [1]).

Assuming this still looks okay to everyone, I'll kick off a vote thread
sometime this week or the next.

[1] - https://issues.apache.org/jira/browse/KAFKA-15563

Cheers,

Chris

On Fri, Jun 7, 2024 at 12:01 PM Andrew Schofield 
wrote:

> Hi Chris,
> This KIP looks good to me. I particularly like the explanation of how the
> result will specifically
> check the worker health in ways that have previously caused trouble.
>
> Thanks,
> Andrew
>
> > On 7 Jun 2024, at 16:18, Mickael Maison 
> wrote:
> >
> > Hi Chris,
> >
> > Happy Friday! The KIP looks good to me. +1
> >
> > Thanks,
> > Mickael
> >
> > On Fri, Jan 26, 2024 at 8:41 PM Chris Egerton 
> wrote:
> >>
> >> Hi all,
> >>
> >> Happy Friday! I'd like to kick off discussion for KIP-1017, which (as
> the
> >> title suggests) proposes adding a health check endpoint for Kafka
> Connect:
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1017%3A+Health+check+endpoint+for+Kafka+Connect
> >>
> >> This is one of the longest-standing issues with Kafka Connect and I'm
> >> hoping we can finally put it in the ground soon. Looking forward to
> hearing
> >> people's thoughts!
> >>
> >> Cheers,
> >>
> >> Chris
>
>

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU


Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-10 Thread Chris Egerton
Hi all,

Thanks for the positive feedback!

I've made one small addition to the KIP since there's been a change to our
REST timeout error messages that's worth incorporating here. Quoting the
added section directly:

> Note that the HTTP status codes and "status" fields in the JSON response
will match the exact examples above. However, the "message" field may be
augmented to include, among other things, more information about the
operation(s) the worker could be blocked on (such as was added in REST
timeout error messages in KAFKA-15563 [1]).

Assuming this still looks okay to everyone, I'll kick off a vote thread
sometime this week or the next.

[1] - https://issues.apache.org/jira/browse/KAFKA-15563

Cheers,

Chris

On Fri, Jun 7, 2024 at 12:01 PM Andrew Schofield 
wrote:

> Hi Chris,
> This KIP looks good to me. I particularly like the explanation of how the
> result will specifically
> check the worker health in ways that have previously caused trouble.
>
> Thanks,
> Andrew
>
> > On 7 Jun 2024, at 16:18, Mickael Maison 
> wrote:
> >
> > Hi Chris,
> >
> > Happy Friday! The KIP looks good to me. +1
> >
> > Thanks,
> > Mickael
> >
> > On Fri, Jan 26, 2024 at 8:41 PM Chris Egerton 
> wrote:
> >>
> >> Hi all,
> >>
> >> Happy Friday! I'd like to kick off discussion for KIP-1017, which (as
> the
> >> title suggests) proposes adding a health check endpoint for Kafka
> Connect:
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1017%3A+Health+check+endpoint+for+Kafka+Connect
> >>
> >> This is one of the longest-standing issues with Kafka Connect and I'm
> >> hoping we can finally put it in the ground soon. Looking forward to
> hearing
> >> people's thoughts!
> >>
> >> Cheers,
> >>
> >> Chris
>
>


Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-07 Thread Andrew Schofield
Hi Chris,
This KIP looks good to me. I particularly like the explanation of how the 
result will specifically
check the worker health in ways that have previously caused trouble.

Thanks,
Andrew

> On 7 Jun 2024, at 16:18, Mickael Maison  wrote:
>
> Hi Chris,
>
> Happy Friday! The KIP looks good to me. +1
>
> Thanks,
> Mickael
>
> On Fri, Jan 26, 2024 at 8:41 PM Chris Egerton  wrote:
>>
>> Hi all,
>>
>> Happy Friday! I'd like to kick off discussion for KIP-1017, which (as the
>> title suggests) proposes adding a health check endpoint for Kafka Connect:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1017%3A+Health+check+endpoint+for+Kafka+Connect
>>
>> This is one of the longest-standing issues with Kafka Connect and I'm
>> hoping we can finally put it in the ground soon. Looking forward to hearing
>> people's thoughts!
>>
>> Cheers,
>>
>> Chris



Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-06-07 Thread Mickael Maison
Hi Chris,

Happy Friday! The KIP looks good to me. +1

Thanks,
Mickael

On Fri, Jan 26, 2024 at 8:41 PM Chris Egerton  wrote:
>
> Hi all,
>
> Happy Friday! I'd like to kick off discussion for KIP-1017, which (as the
> title suggests) proposes adding a health check endpoint for Kafka Connect:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1017%3A+Health+check+endpoint+for+Kafka+Connect
>
> This is one of the longest-standing issues with Kafka Connect and I'm
> hoping we can finally put it in the ground soon. Looking forward to hearing
> people's thoughts!
>
> Cheers,
>
> Chris


[DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

2024-01-26 Thread Chris Egerton
Hi all,

Happy Friday! I'd like to kick off discussion for KIP-1017, which (as the
title suggests) proposes adding a health check endpoint for Kafka Connect:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1017%3A+Health+check+endpoint+for+Kafka+Connect

This is one of the longest-standing issues with Kafka Connect and I'm
hoping we can finally put it in the ground soon. Looking forward to hearing
people's thoughts!

Cheers,

Chris