Re: [DISCUSS] KIP-714: Client metrics and observability

Andrew Schofield Fri, 22 Sep 2023 12:18:50 -0700

Hi Philip,
No, I do not think it should actively search for a broker that supports the new
RPCs. In general, either all of the brokers or none of the brokers will support 
it.
In the window, where the cluster is being upgraded or client telemetry is being
enabled, there might be a mixed situation. I wouldn’t put too much effort into
this mixed scenario. As the client finds brokers which support the new RPCs,
it can begin to follow the KIP-714 mechanism.


Thanks,
Andrew

> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com> wrote:
>
> Hi Andrew -
>
> Question on top of your answers: Do you think the client should actively
> search for a broker that supports this RPC? As previously mentioned, the
> broker uses the leastLoadedNode to find its first connection (am
> I correct?), and what if that broker doesn't support the metric push?
>
> P
>
> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Kirk,
>> Thanks for your question. You are correct that the presence or absence of
>> the new RPCs in the
>> ApiVersionsResponse tells the client whether to request the telemetry
>> subscriptions and push
>> metrics.
>>
>> This is of course tricky in practice. It would be conceivable, as a
>> cluster is upgraded to AK 3.7
>> or as a client metrics receiver plugin is deployed across the cluster,
>> that a client connects to some
>> brokers that support the new RPCs and some that do not.
>>
>> Here’s my suggestion:
>> * If a client is not connected to any brokers that support in the new
>> RPCs, it cannot push metrics.
>> * If a client is only connected to brokers that support the new RPCs, it
>> will use the new RPCs in
>> accordance with the KIP.
>> * If a client is connected to some brokers that support the new RPCs and
>> some that do not, it will
>> use the new RPCs with the supporting subset of brokers in accordance with
>> the KIP.
>>
>> Comments?
>>
>> Thanks,
>> Andrew
>>
>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote:
>>>
>>> Hi Andrew/Jun,
>>>
>>> I want to make sure I understand question/comment #119… In the case
>> where a cluster without a metrics client receiver is later reconfigured and
>> restarted to include a metrics client receiver, do we want the client to
>> thereafter begin pushing metrics to the cluster? From Andrew’s response to
>> question #119, it sounds like we’re using the presence/absence of the
>> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push
>> indicator. Do I have that correct?
>>>
>>> Thanks,
>>> Kirk
>>>
>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
>> andrew_schofield_j...@outlook.com> wrote:
>>>>
>>>> Hi Jun,
>>>> Thanks for your comments. I’ve updated the KIP to clarify where
>> necessary.
>>>>
>>>> 110. Yes, agree. The motivation section mentions this.
>>>>
>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the
>> replacement of
>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I
>> think it’s a bit
>>>> of a debatable point. OTLP makes a distinction between a namespace and a
>>>> multi-word component. If it was “client.id” then “client” would be a
>> namespace with
>>>> an attribute key “id”. But “client_id” is just a key. So, it was
>> intentional, but debatable.
>>>>
>>>> 112. Thanks. The link target moved. Fixed.
>>>>
>>>> 113. Thanks. Fixed.
>>>>
>>>> 114.1. If a standard metric makes sense for a client, it should use the
>> exact same
>>>> name. If a standard metric doesn’t make sense for a client, then it can
>> omit that metric.
>>>>
>>>> For a required metric, the situation is stronger. All clients must
>> implement these
>>>> metrics with these names in order to implement the KIP. But the
>> required metrics
>>>> are essentially the number of connections and the request latency,
>> which do not
>>>> reference the underlying implementation of the client (which
>> producer.record.queue.time.max
>>>> of course does).
>>>>
>>>> I suppose someone might build a producer-only client that didn’t have
>> consumer metrics.
>>>> In this case, the consumer metrics would conceptually have the value 0
>> and would not
>>>> need to be sent to the broker.
>>>>
>>>> 114.2. If a client does not implement some metrics, they will not be
>> available for
>>>> analysis and troubleshooting. It just makes the ability to combine
>> metrics from lots
>>>> different clients less complete.
>>>>
>>>> 115. I think it was probably a mistake to be so specific about
>> threading in this KIP.
>>>> When the consumer threading refactor is complete, of course, it would
>> do the appropriate
>>>> equivalent. I’ve added a clarification and massively simplified this
>> section.
>>>>
>>>> 116. I removed “client.terminating”.
>>>>
>>>> 117. Yes. Horrid. Fixed.
>>>>
>>>> 118. The Terminating flag just indicates that this is the final
>> PushTelemetryRequest
>>>> from this client. Any subsequent request will be rejected. I think this
>> flag should remain.
>>>>
>>>> 119. Good catch. This was actually contradicting another part of the
>> KIP. The current behaviour
>>>> is indeed preserved. If the broker doesn’t have a client metrics
>> receiver plugin, the new RPCs
>>>> in this KIP are “turned off” and not reported in ApiVersionsResponse.
>> The client will not
>>>> attempt to push metrics.
>>>>
>>>> 120. The error handling table lists the error codes for
>> PushTelemetryResponse. I’ve added one
>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t have any
>> error codes, since the
>>>> situation in which the client telemetry is not supported is handled by
>> the RPCs not being offered
>>>> by the broker.
>>>>
>>>> 121. Again, I think it’s probably a mistake to be specific about
>> threading. Removed.
>>>>
>>>> 122. Good catch. For DescribeConfigs, the ACL operation should be
>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be
>>>> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER
>>>> resource.
>>>>
>>>> Thanks for the detailed review.
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>>>
>>>>> 110. Another potential motivation is the multiple clients support.
>> Some of
>>>>> the places may not have good monitoring support for non-java clients.
>>>>>
>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name and
>>>>> replace '-' with '_' for attributes. Why is the inconsistency?
>>>>>
>>>>> 112. OTLP specification: Page is not found from the link.
>>>>>
>>>>> 113. "Defining standard and required metrics makes the monitoring and
>>>>> troubleshooting of clients from various client types ": Incomplete
>> sentence.
>>>>>
>>>>> 114. standard/required metrics
>>>>> 114.1 Do other clients need to implement those metrics with the exact
>> same
>>>>> names?
>>>>> 114.2 What happens if some of those metrics are missing from a client?
>>>>>
>>>>> 115. "KafkaConsumer: both the "heart beat" and application threads": We
>>>>> have an ongoing effort to refactor the consumer threading model (
>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
>> ).
>>>>> Once this is done, PRC requests will only be made from the background
>>>>> thread. Should this KIP follow the new model only?
>>>>>
>>>>> 116. 'The metrics should contain the reason for the client termination
>> by
>>>>> including the client.terminating metric with the label “reason” ...'.
>> Hmm,
>>>>> are we introducing a new metric client.terminating? If so, that needs
>> to be
>>>>> explicitly listed.
>>>>>
>>>>> 117. "As the metrics plugin may need to add additional metrics on top
>> of
>>>>> this the generic metrics receiver in the broker will not add these
>> labels
>>>>> but rely on the plugins to do so," The sentence doesn't read well.
>>>>>
>>>>> 118. "it is possible for the client to send at most one accepted
>>>>> out-of-profile per connection before the rate-limiter kicks in": If we
>> do
>>>>> this, do we still need the Terminating flag in PushTelemetryRequestV0?
>>>>>
>>>>> 119. "If there is no client metrics receiver plugin configured on the
>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest with
>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The client should
>>>>> send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs
>> has
>>>>> expired. This allows the metrics receiver to be enabled or disabled
>> without
>>>>> having to restart the broker or reset the client connection."
>>>>> "no client metrics receiver plugin configured" is defined by no metric
>>>>> reporter implementing the ClientTelemetry interface, right? In that
>> case,
>>>>> it would be useful to avoid the clients sending
>>>>> GetTelemetrySubscriptionsRequest periodically to preserve the current
>>>>> behavior.
>>>>>
>>>>> 120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0:
>> Could
>>>>> we list error codes for each?
>>>>>
>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may
>> be
>>>>> called from the request handling thread": Where else can this method be
>>>>> called?
>>>>>
>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we changing the
>> ACL?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jun
>>>>>
>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>
>>>>>> Hi Milind,
>>>>>> Thanks for your question.
>>>>>>
>>>>>> On reflection, I agree that INVALID_RECORD is most likely to be
>> caused by a
>>>>>> problem in the serialization in the client. I have changed the client
>>>>>> action in this case
>>>>>> to “Log an error and stop pushing metrics”.
>>>>>>
>>>>>> I have updated the KIP text accordingly.
>>>>>>
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra
>> <milut...@confluent.io.INVALID>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Andrew,
>>>>>>> Thanks for the clarifications.
>>>>>>>
>>>>>>> About 2b:
>>>>>>> In case a client has a bug while serializing, it might be difficult
>> for
>>>>>> the
>>>>>>> client to recover from that without code changes. In that, it might
>> be
>>>>>> good
>>>>>>> to just log the INVALID_RECORD as an error, and treat the error as
>> fatal
>>>>>>> for the client (only fatal in terms of sending the metrics, the
>> client
>>>>>> can
>>>>>>> keep functioning otherwise). What do you think?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Milind
>>>>>>>
>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>
>>>>>>>> Hi Milind,
>>>>>>>> Thanks for your questions about the KIP.
>>>>>>>>
>>>>>>>> 1) I did some archaeology and looked at historical versions of the
>> KIP.
>>>>>> I
>>>>>>>> think this is
>>>>>>>> just a mistake. 5 minutes is the default metric push interval. 30
>>>>>> minutes
>>>>>>>> is a mystery
>>>>>>>> to me. I’ve updated the KIP.
>>>>>>>>
>>>>>>>> 2) I think there are two situations in which INVALID_RECORD might
>> occur.
>>>>>>>> a) The client might perhaps be using a content-type that the broker
>> does
>>>>>>>> not support.
>>>>>>>> The KIP mentions content-type as a future extension, but there’s
>> only
>>>>>> one
>>>>>>>> supported
>>>>>>>> to start with. Until we have multiple content-types, this seems out
>> of
>>>>>>>> scope. I think a
>>>>>>>> future KIP would add another error code for this.
>>>>>>>> b) The client might perhaps have a bug which means the metrics
>> payload
>>>>>> is
>>>>>>>> malformed.
>>>>>>>> Logging a warning and attempting the next metrics push on the push
>>>>>>>> interval seems
>>>>>>>> appropriate.
>>>>>>>>
>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an
>> immediate
>>>>>>>> GetTelemetrySubscriptionsRequest.
>>>>>>>>
>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or
>> perhaps
>>>>>>>> a situation in which a broker sends a compression type in a
>>>>>>>> GetTelemetrySubscriptionsResponse
>>>>>>>> which is subsequently not supported when its used with a
>>>>>>>> PushTelemetryRequest.
>>>>>>>> We do want the client to have the opportunity to get an up-to-date
>> list
>>>>>> of
>>>>>>>> supported
>>>>>>>> compression types. I think an immediate
>> GetTelemetrySubscriptionsRequest
>>>>>>>> is appropriate.
>>>>>>>>
>>>>>>>> 3) If a client attempts a subsequent handshake with a Null
>>>>>>>> ClientInstanceId, the
>>>>>>>> receiving broker may not already know the client's existing
>>>>>>>> ClientInstanceId. If the
>>>>>>>> receiving broker knows the existing ClientInstanceId, it simply
>> responds
>>>>>>>> the existing
>>>>>>>> value back to the client. If it does not know the existing
>>>>>>>> ClientInstanceId, it will create
>>>>>>>> a new client instance ID and respond with that.
>>>>>>>>
>>>>>>>> I will update the KIP with these clarifications.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
>> <milut...@confluent.io.INVALID
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Andrew, thanks for this KIP.
>>>>>>>>>
>>>>>>>>> I had a few questions regarding the "Error handling" section.
>>>>>>>>>
>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to eventually
>>>>>>>> trigger
>>>>>>>>> a retry and avoid having to restart clients if the cluster metrics
>>>>>>>>> configuration is disabled temporarily, e.g., by operator error,
>> rolling
>>>>>>>>> upgrades, etc."
>>>>>>>>> But this 30 min interval isn't mentioned anywhere else. What is it
>>>>>>>>> referring to?
>>>>>>>>>
>>>>>>>>> 2. For the actual errors:
>>>>>>>>> INVALID_RECORD : The action required is to "Log a warning to the
>>>>>>>>> application and schedule the next GetTelemetrySubscriptionsRequest
>> to 5
>>>>>>>>> minutes". Why is this 5 minutes, and not something like
>> PushIntervalMs?
>>>>>>>> And
>>>>>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in
>> this
>>>>>>>>> case, if the serialization is broken?
>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to
>>>>>> confirm,
>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled
>> immediately
>>>>>>>>> after the PushTelemetry response, correct?
>>>>>>>>>
>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include
>> the
>>>>>>>>> ClientInstanceId returned in the first response, regardless of
>> broker":
>>>>>>>>> Will a broker error be returned in case some implementation of
>> this KIP
>>>>>>>>> violates this accidentally and sends a request with
>> ClientInstanceId =
>>>>>>>> Null
>>>>>>>>> even when it's been obtained already? Or will a new
>> ClientInstanceId be
>>>>>>>>> returned without an error?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> I would like to start a new discussion thread on KIP-714: Client
>>>>>> metrics
>>>>>>>>>> and observability.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>>>>>>>>>
>>>>>>>>>> I have edited the proposal significantly to reduce the scope. The
>>>>>>>> overall
>>>>>>>>>> mechanism for client metric subscriptions is unchanged, but the
>>>>>>>>>> KIP is now based on the existing client metrics, rather than
>>>>>> introducing
>>>>>>>>>> new metrics. The purpose remains helping cluster operators
>>>>>>>>>> investigate performance problems experienced by clients without
>>>>>>>> requiring
>>>>>>>>>> changes to the client application code or configuration.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Andrew

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to