Re: [DISCUSS] KIP-714: Client metrics and observability

David Jacot Fri, 29 Sep 2023 17:45:53 -0700

Hi Andrew,

Thanks for driving this one. I haven't read all the KIP yet but I already
have an initial question. In the Threading section, it is written
"KafkaConsumer: the "background" thread (based on the consumer threading
refactor which is underway)". If I understand this correctly, it means
that KIP-714 won't work if the "old consumer" is used. Am I correct?


Cheers,
David


On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
[email protected]> wrote:

> Hi Philip,
> No, I do not think it should actively search for a broker that supports
> the new
> RPCs. In general, either all of the brokers or none of the brokers will
> support it.
> In the window, where the cluster is being upgraded or client telemetry is
> being
> enabled, there might be a mixed situation. I wouldn’t put too much effort
> into
> this mixed scenario. As the client finds brokers which support the new
> RPCs,
> it can begin to follow the KIP-714 mechanism.
>
> Thanks,
> Andrew
>
> > On 22 Sep 2023, at 20:01, Philip Nee <[email protected]> wrote:
> >
> > Hi Andrew -
> >
> > Question on top of your answers: Do you think the client should actively
> > search for a broker that supports this RPC? As previously mentioned, the
> > broker uses the leastLoadedNode to find its first connection (am
> > I correct?), and what if that broker doesn't support the metric push?
> >
> > P
> >
> > On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
> > [email protected]> wrote:
> >
> >> Hi Kirk,
> >> Thanks for your question. You are correct that the presence or absence
> of
> >> the new RPCs in the
> >> ApiVersionsResponse tells the client whether to request the telemetry
> >> subscriptions and push
> >> metrics.
> >>
> >> This is of course tricky in practice. It would be conceivable, as a
> >> cluster is upgraded to AK 3.7
> >> or as a client metrics receiver plugin is deployed across the cluster,
> >> that a client connects to some
> >> brokers that support the new RPCs and some that do not.
> >>
> >> Here’s my suggestion:
> >> * If a client is not connected to any brokers that support in the new
> >> RPCs, it cannot push metrics.
> >> * If a client is only connected to brokers that support the new RPCs, it
> >> will use the new RPCs in
> >> accordance with the KIP.
> >> * If a client is connected to some brokers that support the new RPCs and
> >> some that do not, it will
> >> use the new RPCs with the supporting subset of brokers in accordance
> with
> >> the KIP.
> >>
> >> Comments?
> >>
> >> Thanks,
> >> Andrew
> >>
> >>> On 22 Sep 2023, at 16:01, Kirk True <[email protected]> wrote:
> >>>
> >>> Hi Andrew/Jun,
> >>>
> >>> I want to make sure I understand question/comment #119… In the case
> >> where a cluster without a metrics client receiver is later reconfigured
> and
> >> restarted to include a metrics client receiver, do we want the client to
> >> thereafter begin pushing metrics to the cluster? From Andrew’s response
> to
> >> question #119, it sounds like we’re using the presence/absence of the
> >> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push
> >> indicator. Do I have that correct?
> >>>
> >>> Thanks,
> >>> Kirk
> >>>
> >>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
> >> [email protected]> wrote:
> >>>>
> >>>> Hi Jun,
> >>>> Thanks for your comments. I’ve updated the KIP to clarify where
> >> necessary.
> >>>>
> >>>> 110. Yes, agree. The motivation section mentions this.
> >>>>
> >>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the
> >> replacement of
> >>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I
> >> think it’s a bit
> >>>> of a debatable point. OTLP makes a distinction between a namespace
> and a
> >>>> multi-word component. If it was “client.id” then “client” would be a
> >> namespace with
> >>>> an attribute key “id”. But “client_id” is just a key. So, it was
> >> intentional, but debatable.
> >>>>
> >>>> 112. Thanks. The link target moved. Fixed.
> >>>>
> >>>> 113. Thanks. Fixed.
> >>>>
> >>>> 114.1. If a standard metric makes sense for a client, it should use
> the
> >> exact same
> >>>> name. If a standard metric doesn’t make sense for a client, then it
> can
> >> omit that metric.
> >>>>
> >>>> For a required metric, the situation is stronger. All clients must
> >> implement these
> >>>> metrics with these names in order to implement the KIP. But the
> >> required metrics
> >>>> are essentially the number of connections and the request latency,
> >> which do not
> >>>> reference the underlying implementation of the client (which
> >> producer.record.queue.time.max
> >>>> of course does).
> >>>>
> >>>> I suppose someone might build a producer-only client that didn’t have
> >> consumer metrics.
> >>>> In this case, the consumer metrics would conceptually have the value 0
> >> and would not
> >>>> need to be sent to the broker.
> >>>>
> >>>> 114.2. If a client does not implement some metrics, they will not be
> >> available for
> >>>> analysis and troubleshooting. It just makes the ability to combine
> >> metrics from lots
> >>>> different clients less complete.
> >>>>
> >>>> 115. I think it was probably a mistake to be so specific about
> >> threading in this KIP.
> >>>> When the consumer threading refactor is complete, of course, it would
> >> do the appropriate
> >>>> equivalent. I’ve added a clarification and massively simplified this
> >> section.
> >>>>
> >>>> 116. I removed “client.terminating”.
> >>>>
> >>>> 117. Yes. Horrid. Fixed.
> >>>>
> >>>> 118. The Terminating flag just indicates that this is the final
> >> PushTelemetryRequest
> >>>> from this client. Any subsequent request will be rejected. I think
> this
> >> flag should remain.
> >>>>
> >>>> 119. Good catch. This was actually contradicting another part of the
> >> KIP. The current behaviour
> >>>> is indeed preserved. If the broker doesn’t have a client metrics
> >> receiver plugin, the new RPCs
> >>>> in this KIP are “turned off” and not reported in ApiVersionsResponse.
> >> The client will not
> >>>> attempt to push metrics.
> >>>>
> >>>> 120. The error handling table lists the error codes for
> >> PushTelemetryResponse. I’ve added one
> >>>> but it looked good to me. GetTelemetrySubscriptions doesn’t have any
> >> error codes, since the
> >>>> situation in which the client telemetry is not supported is handled by
> >> the RPCs not being offered
> >>>> by the broker.
> >>>>
> >>>> 121. Again, I think it’s probably a mistake to be specific about
> >> threading. Removed.
> >>>>
> >>>> 122. Good catch. For DescribeConfigs, the ACL operation should be
> >>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be
> >>>> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER
> >>>> resource.
> >>>>
> >>>> Thanks for the detailed review.
> >>>>
> >>>> Thanks,
> >>>> Andrew
> >>>>
> >>>>>
> >>>>> 110. Another potential motivation is the multiple clients support.
> >> Some of
> >>>>> the places may not have good monitoring support for non-java clients.
> >>>>>
> >>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name
> and
> >>>>> replace '-' with '_' for attributes. Why is the inconsistency?
> >>>>>
> >>>>> 112. OTLP specification: Page is not found from the link.
> >>>>>
> >>>>> 113. "Defining standard and required metrics makes the monitoring and
> >>>>> troubleshooting of clients from various client types ": Incomplete
> >> sentence.
> >>>>>
> >>>>> 114. standard/required metrics
> >>>>> 114.1 Do other clients need to implement those metrics with the exact
> >> same
> >>>>> names?
> >>>>> 114.2 What happens if some of those metrics are missing from a
> client?
> >>>>>
> >>>>> 115. "KafkaConsumer: both the "heart beat" and application threads":
> We
> >>>>> have an ongoing effort to refactor the consumer threading model (
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
> >> ).
> >>>>> Once this is done, PRC requests will only be made from the background
> >>>>> thread. Should this KIP follow the new model only?
> >>>>>
> >>>>> 116. 'The metrics should contain the reason for the client
> termination
> >> by
> >>>>> including the client.terminating metric with the label “reason” ...'.
> >> Hmm,
> >>>>> are we introducing a new metric client.terminating? If so, that needs
> >> to be
> >>>>> explicitly listed.
> >>>>>
> >>>>> 117. "As the metrics plugin may need to add additional metrics on top
> >> of
> >>>>> this the generic metrics receiver in the broker will not add these
> >> labels
> >>>>> but rely on the plugins to do so," The sentence doesn't read well.
> >>>>>
> >>>>> 118. "it is possible for the client to send at most one accepted
> >>>>> out-of-profile per connection before the rate-limiter kicks in": If
> we
> >> do
> >>>>> this, do we still need the Terminating flag in
> PushTelemetryRequestV0?
> >>>>>
> >>>>> 119. "If there is no client metrics receiver plugin configured on the
> >>>>> broker, it will respond to GetTelemetrySubscriptionsRequest with
> >>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The client
> should
> >>>>> send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs
> >> has
> >>>>> expired. This allows the metrics receiver to be enabled or disabled
> >> without
> >>>>> having to restart the broker or reset the client connection."
> >>>>> "no client metrics receiver plugin configured" is defined by no
> metric
> >>>>> reporter implementing the ClientTelemetry interface, right? In that
> >> case,
> >>>>> it would be useful to avoid the clients sending
> >>>>> GetTelemetrySubscriptionsRequest periodically to preserve the current
> >>>>> behavior.
> >>>>>
> >>>>> 120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0:
> >> Could
> >>>>> we list error codes for each?
> >>>>>
> >>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may
> >> be
> >>>>> called from the request handling thread": Where else can this method
> be
> >>>>> called?
> >>>>>
> >>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we changing the
> >> ACL?
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jun
> >>>>>
> >>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> Hi Milind,
> >>>>>> Thanks for your question.
> >>>>>>
> >>>>>> On reflection, I agree that INVALID_RECORD is most likely to be
> >> caused by a
> >>>>>> problem in the serialization in the client. I have changed the
> client
> >>>>>> action in this case
> >>>>>> to “Log an error and stop pushing metrics”.
> >>>>>>
> >>>>>> I have updated the KIP text accordingly.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Andrew
> >>>>>>
> >>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra
> >> <[email protected]>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Hi Andrew,
> >>>>>>> Thanks for the clarifications.
> >>>>>>>
> >>>>>>> About 2b:
> >>>>>>> In case a client has a bug while serializing, it might be difficult
> >> for
> >>>>>> the
> >>>>>>> client to recover from that without code changes. In that, it might
> >> be
> >>>>>> good
> >>>>>>> to just log the INVALID_RECORD as an error, and treat the error as
> >> fatal
> >>>>>>> for the client (only fatal in terms of sending the metrics, the
> >> client
> >>>>>> can
> >>>>>>> keep functioning otherwise). What do you think?
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> Milind
> >>>>>>>
> >>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> Hi Milind,
> >>>>>>>> Thanks for your questions about the KIP.
> >>>>>>>>
> >>>>>>>> 1) I did some archaeology and looked at historical versions of the
> >> KIP.
> >>>>>> I
> >>>>>>>> think this is
> >>>>>>>> just a mistake. 5 minutes is the default metric push interval. 30
> >>>>>> minutes
> >>>>>>>> is a mystery
> >>>>>>>> to me. I’ve updated the KIP.
> >>>>>>>>
> >>>>>>>> 2) I think there are two situations in which INVALID_RECORD might
> >> occur.
> >>>>>>>> a) The client might perhaps be using a content-type that the
> broker
> >> does
> >>>>>>>> not support.
> >>>>>>>> The KIP mentions content-type as a future extension, but there’s
> >> only
> >>>>>> one
> >>>>>>>> supported
> >>>>>>>> to start with. Until we have multiple content-types, this seems
> out
> >> of
> >>>>>>>> scope. I think a
> >>>>>>>> future KIP would add another error code for this.
> >>>>>>>> b) The client might perhaps have a bug which means the metrics
> >> payload
> >>>>>> is
> >>>>>>>> malformed.
> >>>>>>>> Logging a warning and attempting the next metrics push on the push
> >>>>>>>> interval seems
> >>>>>>>> appropriate.
> >>>>>>>>
> >>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an
> >> immediate
> >>>>>>>> GetTelemetrySubscriptionsRequest.
> >>>>>>>>
> >>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or
> >> perhaps
> >>>>>>>> a situation in which a broker sends a compression type in a
> >>>>>>>> GetTelemetrySubscriptionsResponse
> >>>>>>>> which is subsequently not supported when its used with a
> >>>>>>>> PushTelemetryRequest.
> >>>>>>>> We do want the client to have the opportunity to get an up-to-date
> >> list
> >>>>>> of
> >>>>>>>> supported
> >>>>>>>> compression types. I think an immediate
> >> GetTelemetrySubscriptionsRequest
> >>>>>>>> is appropriate.
> >>>>>>>>
> >>>>>>>> 3) If a client attempts a subsequent handshake with a Null
> >>>>>>>> ClientInstanceId, the
> >>>>>>>> receiving broker may not already know the client's existing
> >>>>>>>> ClientInstanceId. If the
> >>>>>>>> receiving broker knows the existing ClientInstanceId, it simply
> >> responds
> >>>>>>>> the existing
> >>>>>>>> value back to the client. If it does not know the existing
> >>>>>>>> ClientInstanceId, it will create
> >>>>>>>> a new client instance ID and respond with that.
> >>>>>>>>
> >>>>>>>> I will update the KIP with these clarifications.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Andrew
> >>>>>>>>
> >>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
> >> <[email protected]
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Andrew, thanks for this KIP.
> >>>>>>>>>
> >>>>>>>>> I had a few questions regarding the "Error handling" section.
> >>>>>>>>>
> >>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to
> eventually
> >>>>>>>> trigger
> >>>>>>>>> a retry and avoid having to restart clients if the cluster
> metrics
> >>>>>>>>> configuration is disabled temporarily, e.g., by operator error,
> >> rolling
> >>>>>>>>> upgrades, etc."
> >>>>>>>>> But this 30 min interval isn't mentioned anywhere else. What is
> it
> >>>>>>>>> referring to?
> >>>>>>>>>
> >>>>>>>>> 2. For the actual errors:
> >>>>>>>>> INVALID_RECORD : The action required is to "Log a warning to the
> >>>>>>>>> application and schedule the next
> GetTelemetrySubscriptionsRequest
> >> to 5
> >>>>>>>>> minutes". Why is this 5 minutes, and not something like
> >> PushIntervalMs?
> >>>>>>>> And
> >>>>>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in
> >> this
> >>>>>>>>> case, if the serialization is broken?
> >>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to
> >>>>>> confirm,
> >>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled
> >> immediately
> >>>>>>>>> after the PushTelemetry response, correct?
> >>>>>>>>>
> >>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include
> >> the
> >>>>>>>>> ClientInstanceId returned in the first response, regardless of
> >> broker":
> >>>>>>>>> Will a broker error be returned in case some implementation of
> >> this KIP
> >>>>>>>>> violates this accidentally and sends a request with
> >> ClientInstanceId =
> >>>>>>>> Null
> >>>>>>>>> even when it's been obtained already? Or will a new
> >> ClientInstanceId be
> >>>>>>>>> returned without an error?
> >>>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>>
> >>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
> >>>>>>>>> [email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>> I would like to start a new discussion thread on KIP-714: Client
> >>>>>> metrics
> >>>>>>>>>> and observability.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >>>>>>>>>>
> >>>>>>>>>> I have edited the proposal significantly to reduce the scope.
> The
> >>>>>>>> overall
> >>>>>>>>>> mechanism for client metric subscriptions is unchanged, but the
> >>>>>>>>>> KIP is now based on the existing client metrics, rather than
> >>>>>> introducing
> >>>>>>>>>> new metrics. The purpose remains helping cluster operators
> >>>>>>>>>> investigate performance problems experienced by clients without
> >>>>>>>> requiring
> >>>>>>>>>> changes to the client application code or configuration.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Andrew
>
>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to