Re: [DISCUSS] KIP-714: Client metrics and observability

Philip Nee Fri, 22 Sep 2023 12:03:41 -0700

Hi Andrew -

Question on top of your answers: Do you think the client should actively
search for a broker that supports this RPC? As previously mentioned, the
broker uses the leastLoadedNode to find its first connection (am
I correct?), and what if that broker doesn't support the metric push?


P

On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Kirk,
> Thanks for your question. You are correct that the presence or absence of
> the new RPCs in the
> ApiVersionsResponse tells the client whether to request the telemetry
> subscriptions and push
> metrics.
>
> This is of course tricky in practice. It would be conceivable, as a
> cluster is upgraded to AK 3.7
> or as a client metrics receiver plugin is deployed across the cluster,
> that a client connects to some
> brokers that support the new RPCs and some that do not.
>
> Here’s my suggestion:
> * If a client is not connected to any brokers that support in the new
> RPCs, it cannot push metrics.
> * If a client is only connected to brokers that support the new RPCs, it
> will use the new RPCs in
> accordance with the KIP.
> * If a client is connected to some brokers that support the new RPCs and
> some that do not, it will
> use the new RPCs with the supporting subset of brokers in accordance with
> the KIP.
>
> Comments?
>
> Thanks,
> Andrew
>
> > On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote:
> >
> > Hi Andrew/Jun,
> >
> > I want to make sure I understand question/comment #119… In the case
> where a cluster without a metrics client receiver is later reconfigured and
> restarted to include a metrics client receiver, do we want the client to
> thereafter begin pushing metrics to the cluster? From Andrew’s response to
> question #119, it sounds like we’re using the presence/absence of the
> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push
> indicator. Do I have that correct?
> >
> > Thanks,
> > Kirk
> >
> >> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
> >>
> >> Hi Jun,
> >> Thanks for your comments. I’ve updated the KIP to clarify where
> necessary.
> >>
> >> 110. Yes, agree. The motivation section mentions this.
> >>
> >> 111. The replacement of ‘-‘ with ‘.’ for metric names and the
> replacement of
> >> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I
> think it’s a bit
> >> of a debatable point. OTLP makes a distinction between a namespace and a
> >> multi-word component. If it was “client.id” then “client” would be a
> namespace with
> >> an attribute key “id”. But “client_id” is just a key. So, it was
> intentional, but debatable.
> >>
> >> 112. Thanks. The link target moved. Fixed.
> >>
> >> 113. Thanks. Fixed.
> >>
> >> 114.1. If a standard metric makes sense for a client, it should use the
> exact same
> >> name. If a standard metric doesn’t make sense for a client, then it can
> omit that metric.
> >>
> >> For a required metric, the situation is stronger. All clients must
> implement these
> >> metrics with these names in order to implement the KIP. But the
> required metrics
> >> are essentially the number of connections and the request latency,
> which do not
> >> reference the underlying implementation of the client (which
> producer.record.queue.time.max
> >> of course does).
> >>
> >> I suppose someone might build a producer-only client that didn’t have
> consumer metrics.
> >> In this case, the consumer metrics would conceptually have the value 0
> and would not
> >> need to be sent to the broker.
> >>
> >> 114.2. If a client does not implement some metrics, they will not be
> available for
> >> analysis and troubleshooting. It just makes the ability to combine
> metrics from lots
> >> different clients less complete.
> >>
> >> 115. I think it was probably a mistake to be so specific about
> threading in this KIP.
> >> When the consumer threading refactor is complete, of course, it would
> do the appropriate
> >> equivalent. I’ve added a clarification and massively simplified this
> section.
> >>
> >> 116. I removed “client.terminating”.
> >>
> >> 117. Yes. Horrid. Fixed.
> >>
> >> 118. The Terminating flag just indicates that this is the final
> PushTelemetryRequest
> >> from this client. Any subsequent request will be rejected. I think this
> flag should remain.
> >>
> >> 119. Good catch. This was actually contradicting another part of the
> KIP. The current behaviour
> >> is indeed preserved. If the broker doesn’t have a client metrics
> receiver plugin, the new RPCs
> >> in this KIP are “turned off” and not reported in ApiVersionsResponse.
> The client will not
> >> attempt to push metrics.
> >>
> >> 120. The error handling table lists the error codes for
> PushTelemetryResponse. I’ve added one
> >> but it looked good to me. GetTelemetrySubscriptions doesn’t have any
> error codes, since the
> >> situation in which the client telemetry is not supported is handled by
> the RPCs not being offered
> >> by the broker.
> >>
> >> 121. Again, I think it’s probably a mistake to be specific about
> threading. Removed.
> >>
> >> 122. Good catch. For DescribeConfigs, the ACL operation should be
> >> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be
> >> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER
> >> resource.
> >>
> >> Thanks for the detailed review.
> >>
> >> Thanks,
> >> Andrew
> >>
> >>>
> >>> 110. Another potential motivation is the multiple clients support.
> Some of
> >>> the places may not have good monitoring support for non-java clients.
> >>>
> >>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name and
> >>> replace '-' with '_' for attributes. Why is the inconsistency?
> >>>
> >>> 112. OTLP specification: Page is not found from the link.
> >>>
> >>> 113. "Defining standard and required metrics makes the monitoring and
> >>> troubleshooting of clients from various client types ": Incomplete
> sentence.
> >>>
> >>> 114. standard/required metrics
> >>> 114.1 Do other clients need to implement those metrics with the exact
> same
> >>> names?
> >>> 114.2 What happens if some of those metrics are missing from a client?
> >>>
> >>> 115. "KafkaConsumer: both the "heart beat" and application threads": We
> >>> have an ongoing effort to refactor the consumer threading model (
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
> ).
> >>> Once this is done, PRC requests will only be made from the background
> >>> thread. Should this KIP follow the new model only?
> >>>
> >>> 116. 'The metrics should contain the reason for the client termination
> by
> >>> including the client.terminating metric with the label “reason” ...'.
> Hmm,
> >>> are we introducing a new metric client.terminating? If so, that needs
> to be
> >>> explicitly listed.
> >>>
> >>> 117. "As the metrics plugin may need to add additional metrics on top
> of
> >>> this the generic metrics receiver in the broker will not add these
> labels
> >>> but rely on the plugins to do so," The sentence doesn't read well.
> >>>
> >>> 118. "it is possible for the client to send at most one accepted
> >>> out-of-profile per connection before the rate-limiter kicks in": If we
> do
> >>> this, do we still need the Terminating flag in PushTelemetryRequestV0?
> >>>
> >>> 119. "If there is no client metrics receiver plugin configured on the
> >>> broker, it will respond to GetTelemetrySubscriptionsRequest with
> >>> RequestedMetrics set to Null and a -1 SubscriptionId. The client should
> >>> send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs
> has
> >>> expired. This allows the metrics receiver to be enabled or disabled
> without
> >>> having to restart the broker or reset the client connection."
> >>> "no client metrics receiver plugin configured" is defined by no metric
> >>> reporter implementing the ClientTelemetry interface, right? In that
> case,
> >>> it would be useful to avoid the clients sending
> >>> GetTelemetrySubscriptionsRequest periodically to preserve the current
> >>> behavior.
> >>>
> >>> 120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0:
> Could
> >>> we list error codes for each?
> >>>
> >>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may
> be
> >>> called from the request handling thread": Where else can this method be
> >>> called?
> >>>
> >>> 122. DescribeConfigs/AlterConfigs already exist. Are we changing the
> ACL?
> >>>
> >>> Thanks,
> >>>
> >>> Jun
> >>>
> >>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
> >>> andrew_schofield_j...@outlook.com> wrote:
> >>>
> >>>> Hi Milind,
> >>>> Thanks for your question.
> >>>>
> >>>> On reflection, I agree that INVALID_RECORD is most likely to be
> caused by a
> >>>> problem in the serialization in the client. I have changed the client
> >>>> action in this case
> >>>> to “Log an error and stop pushing metrics”.
> >>>>
> >>>> I have updated the KIP text accordingly.
> >>>>
> >>>> Thanks,
> >>>> Andrew
> >>>>
> >>>>> On 31 Jul 2023, at 12:09, Milind Luthra
> <milut...@confluent.io.INVALID>
> >>>> wrote:
> >>>>>
> >>>>> Hi Andrew,
> >>>>> Thanks for the clarifications.
> >>>>>
> >>>>> About 2b:
> >>>>> In case a client has a bug while serializing, it might be difficult
> for
> >>>> the
> >>>>> client to recover from that without code changes. In that, it might
> be
> >>>> good
> >>>>> to just log the INVALID_RECORD as an error, and treat the error as
> fatal
> >>>>> for the client (only fatal in terms of sending the metrics, the
> client
> >>>> can
> >>>>> keep functioning otherwise). What do you think?
> >>>>>
> >>>>> Thanks
> >>>>> Milind
> >>>>>
> >>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
> >>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>
> >>>>>> Hi Milind,
> >>>>>> Thanks for your questions about the KIP.
> >>>>>>
> >>>>>> 1) I did some archaeology and looked at historical versions of the
> KIP.
> >>>> I
> >>>>>> think this is
> >>>>>> just a mistake. 5 minutes is the default metric push interval. 30
> >>>> minutes
> >>>>>> is a mystery
> >>>>>> to me. I’ve updated the KIP.
> >>>>>>
> >>>>>> 2) I think there are two situations in which INVALID_RECORD might
> occur.
> >>>>>> a) The client might perhaps be using a content-type that the broker
> does
> >>>>>> not support.
> >>>>>> The KIP mentions content-type as a future extension, but there’s
> only
> >>>> one
> >>>>>> supported
> >>>>>> to start with. Until we have multiple content-types, this seems out
> of
> >>>>>> scope. I think a
> >>>>>> future KIP would add another error code for this.
> >>>>>> b) The client might perhaps have a bug which means the metrics
> payload
> >>>> is
> >>>>>> malformed.
> >>>>>> Logging a warning and attempting the next metrics push on the push
> >>>>>> interval seems
> >>>>>> appropriate.
> >>>>>>
> >>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an
> immediate
> >>>>>> GetTelemetrySubscriptionsRequest.
> >>>>>>
> >>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or
> perhaps
> >>>>>> a situation in which a broker sends a compression type in a
> >>>>>> GetTelemetrySubscriptionsResponse
> >>>>>> which is subsequently not supported when its used with a
> >>>>>> PushTelemetryRequest.
> >>>>>> We do want the client to have the opportunity to get an up-to-date
> list
> >>>> of
> >>>>>> supported
> >>>>>> compression types. I think an immediate
> GetTelemetrySubscriptionsRequest
> >>>>>> is appropriate.
> >>>>>>
> >>>>>> 3) If a client attempts a subsequent handshake with a Null
> >>>>>> ClientInstanceId, the
> >>>>>> receiving broker may not already know the client's existing
> >>>>>> ClientInstanceId. If the
> >>>>>> receiving broker knows the existing ClientInstanceId, it simply
> responds
> >>>>>> the existing
> >>>>>> value back to the client. If it does not know the existing
> >>>>>> ClientInstanceId, it will create
> >>>>>> a new client instance ID and respond with that.
> >>>>>>
> >>>>>> I will update the KIP with these clarifications.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Andrew
> >>>>>>
> >>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
> <milut...@confluent.io.INVALID
> >>>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Hi Andrew, thanks for this KIP.
> >>>>>>>
> >>>>>>> I had a few questions regarding the "Error handling" section.
> >>>>>>>
> >>>>>>> 1. It mentions that "The 5 and 30 minute retries are to eventually
> >>>>>> trigger
> >>>>>>> a retry and avoid having to restart clients if the cluster metrics
> >>>>>>> configuration is disabled temporarily, e.g., by operator error,
> rolling
> >>>>>>> upgrades, etc."
> >>>>>>> But this 30 min interval isn't mentioned anywhere else. What is it
> >>>>>>> referring to?
> >>>>>>>
> >>>>>>> 2. For the actual errors:
> >>>>>>> INVALID_RECORD : The action required is to "Log a warning to the
> >>>>>>> application and schedule the next GetTelemetrySubscriptionsRequest
> to 5
> >>>>>>> minutes". Why is this 5 minutes, and not something like
> PushIntervalMs?
> >>>>>> And
> >>>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in
> this
> >>>>>>> case, if the serialization is broken?
> >>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to
> >>>> confirm,
> >>>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled
> immediately
> >>>>>>> after the PushTelemetry response, correct?
> >>>>>>>
> >>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include
> the
> >>>>>>> ClientInstanceId returned in the first response, regardless of
> broker":
> >>>>>>> Will a broker error be returned in case some implementation of
> this KIP
> >>>>>>> violates this accidentally and sends a request with
> ClientInstanceId =
> >>>>>> Null
> >>>>>>> even when it's been obtained already? Or will a new
> ClientInstanceId be
> >>>>>>> returned without an error?
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
> >>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>> I would like to start a new discussion thread on KIP-714: Client
> >>>> metrics
> >>>>>>>> and observability.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >>>>>>>>
> >>>>>>>> I have edited the proposal significantly to reduce the scope. The
> >>>>>> overall
> >>>>>>>> mechanism for client metric subscriptions is unchanged, but the
> >>>>>>>> KIP is now based on the existing client metrics, rather than
> >>>> introducing
> >>>>>>>> new metrics. The purpose remains helping cluster operators
> >>>>>>>> investigate performance problems experienced by clients without
> >>>>>> requiring
> >>>>>>>> changes to the client application code or configuration.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Andrew
>
>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to