Re: [DISCUSS] KIP-714: Client metrics and observability

Kirk True Fri, 22 Sep 2023 08:03:39 -0700

Hi Andrew/Jun,

I want to make sure I understand question/comment #119… In the case where a 
cluster without a metrics client receiver is later reconfigured and restarted 
to include a metrics client receiver, do we want the client to thereafter begin 
pushing metrics to the cluster? From Andrew’s response to question #119, it 
sounds like we’re using the presence/absence of the relevant RPCs in 
ApiVersionsResponse as the to-push-or-not-to-push indicator. Do I have that 
correct?


Thanks,
Kirk

> On Sep 21, 2023, at 7:42 AM, Andrew Schofield 
> <andrew_schofield_j...@outlook.com> wrote:
> 
> Hi Jun,
> Thanks for your comments. I’ve updated the KIP to clarify where necessary.
> 
> 110. Yes, agree. The motivation section mentions this.
> 
> 111. The replacement of ‘-‘ with ‘.’ for metric names and the replacement of
> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I think 
> it’s a bit
> of a debatable point. OTLP makes a distinction between a namespace and a
> multi-word component. If it was “client.id” then “client” would be a 
> namespace with
> an attribute key “id”. But “client_id” is just a key. So, it was intentional, 
> but debatable.
> 
> 112. Thanks. The link target moved. Fixed.
> 
> 113. Thanks. Fixed.
> 
> 114.1. If a standard metric makes sense for a client, it should use the exact 
> same
> name. If a standard metric doesn’t make sense for a client, then it can omit 
> that metric.
> 
> For a required metric, the situation is stronger. All clients must implement 
> these
> metrics with these names in order to implement the KIP. But the required 
> metrics
> are essentially the number of connections and the request latency, which do 
> not
> reference the underlying implementation of the client (which 
> producer.record.queue.time.max
> of course does).
> 
> I suppose someone might build a producer-only client that didn’t have 
> consumer metrics.
> In this case, the consumer metrics would conceptually have the value 0 and 
> would not
> need to be sent to the broker.
> 
> 114.2. If a client does not implement some metrics, they will not be 
> available for
> analysis and troubleshooting. It just makes the ability to combine metrics 
> from lots
> different clients less complete.
> 
> 115. I think it was probably a mistake to be so specific about threading in 
> this KIP.
> When the consumer threading refactor is complete, of course, it would do the 
> appropriate
> equivalent. I’ve added a clarification and massively simplified this section.
> 
> 116. I removed “client.terminating”.
> 
> 117. Yes. Horrid. Fixed.
> 
> 118. The Terminating flag just indicates that this is the final 
> PushTelemetryRequest
> from this client. Any subsequent request will be rejected. I think this flag 
> should remain.
> 
> 119. Good catch. This was actually contradicting another part of the KIP. The 
> current behaviour
> is indeed preserved. If the broker doesn’t have a client metrics receiver 
> plugin, the new RPCs
> in this KIP are “turned off” and not reported in ApiVersionsResponse. The 
> client will not
> attempt to push metrics.
> 
> 120. The error handling table lists the error codes for 
> PushTelemetryResponse. I’ve added one
> but it looked good to me. GetTelemetrySubscriptions doesn’t have any error 
> codes, since the
> situation in which the client telemetry is not supported is handled by the 
> RPCs not being offered
> by the broker.
> 
> 121. Again, I think it’s probably a mistake to be specific about threading. 
> Removed.
> 
> 122. Good catch. For DescribeConfigs, the ACL operation should be
> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be
> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER
> resource.
> 
> Thanks for the detailed review.
> 
> Thanks,
> Andrew
> 
>> 
>> 110. Another potential motivation is the multiple clients support. Some of
>> the places may not have good monitoring support for non-java clients.
>> 
>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name and
>> replace '-' with '_' for attributes. Why is the inconsistency?
>> 
>> 112. OTLP specification: Page is not found from the link.
>> 
>> 113. "Defining standard and required metrics makes the monitoring and
>> troubleshooting of clients from various client types ": Incomplete sentence.
>> 
>> 114. standard/required metrics
>> 114.1 Do other clients need to implement those metrics with the exact same
>> names?
>> 114.2 What happens if some of those metrics are missing from a client?
>> 
>> 115. "KafkaConsumer: both the "heart beat" and application threads": We
>> have an ongoing effort to refactor the consumer threading model (
>> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design).
>> Once this is done, PRC requests will only be made from the background
>> thread. Should this KIP follow the new model only?
>> 
>> 116. 'The metrics should contain the reason for the client termination by
>> including the client.terminating metric with the label “reason” ...'. Hmm,
>> are we introducing a new metric client.terminating? If so, that needs to be
>> explicitly listed.
>> 
>> 117. "As the metrics plugin may need to add additional metrics on top of
>> this the generic metrics receiver in the broker will not add these labels
>> but rely on the plugins to do so," The sentence doesn't read well.
>> 
>> 118. "it is possible for the client to send at most one accepted
>> out-of-profile per connection before the rate-limiter kicks in": If we do
>> this, do we still need the Terminating flag in PushTelemetryRequestV0?
>> 
>> 119. "If there is no client metrics receiver plugin configured on the
>> broker, it will respond to GetTelemetrySubscriptionsRequest with
>> RequestedMetrics set to Null and a -1 SubscriptionId. The client should
>> send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs has
>> expired. This allows the metrics receiver to be enabled or disabled without
>> having to restart the broker or reset the client connection."
>> "no client metrics receiver plugin configured" is defined by no metric
>> reporter implementing the ClientTelemetry interface, right? In that case,
>> it would be useful to avoid the clients sending
>> GetTelemetrySubscriptionsRequest periodically to preserve the current
>> behavior.
>> 
>> 120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0: Could
>> we list error codes for each?
>> 
>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may be
>> called from the request handling thread": Where else can this method be
>> called?
>> 
>> 122. DescribeConfigs/AlterConfigs already exist. Are we changing the ACL?
>> 
>> Thanks,
>> 
>> Jun
>> 
>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
>> andrew_schofield_j...@outlook.com> wrote:
>> 
>>> Hi Milind,
>>> Thanks for your question.
>>> 
>>> On reflection, I agree that INVALID_RECORD is most likely to be caused by a
>>> problem in the serialization in the client. I have changed the client
>>> action in this case
>>> to “Log an error and stop pushing metrics”.
>>> 
>>> I have updated the KIP text accordingly.
>>> 
>>> Thanks,
>>> Andrew
>>> 
>>>> On 31 Jul 2023, at 12:09, Milind Luthra <milut...@confluent.io.INVALID>
>>> wrote:
>>>> 
>>>> Hi Andrew,
>>>> Thanks for the clarifications.
>>>> 
>>>> About 2b:
>>>> In case a client has a bug while serializing, it might be difficult for
>>> the
>>>> client to recover from that without code changes. In that, it might be
>>> good
>>>> to just log the INVALID_RECORD as an error, and treat the error as fatal
>>>> for the client (only fatal in terms of sending the metrics, the client
>>> can
>>>> keep functioning otherwise). What do you think?
>>>> 
>>>> Thanks
>>>> Milind
>>>> 
>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
>>>> andrew_schofield_j...@outlook.com> wrote:
>>>> 
>>>>> Hi Milind,
>>>>> Thanks for your questions about the KIP.
>>>>> 
>>>>> 1) I did some archaeology and looked at historical versions of the KIP.
>>> I
>>>>> think this is
>>>>> just a mistake. 5 minutes is the default metric push interval. 30
>>> minutes
>>>>> is a mystery
>>>>> to me. I’ve updated the KIP.
>>>>> 
>>>>> 2) I think there are two situations in which INVALID_RECORD might occur.
>>>>> a) The client might perhaps be using a content-type that the broker does
>>>>> not support.
>>>>> The KIP mentions content-type as a future extension, but there’s only
>>> one
>>>>> supported
>>>>> to start with. Until we have multiple content-types, this seems out of
>>>>> scope. I think a
>>>>> future KIP would add another error code for this.
>>>>> b) The client might perhaps have a bug which means the metrics payload
>>> is
>>>>> malformed.
>>>>> Logging a warning and attempting the next metrics push on the push
>>>>> interval seems
>>>>> appropriate.
>>>>> 
>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an immediate
>>>>> GetTelemetrySubscriptionsRequest.
>>>>> 
>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or perhaps
>>>>> a situation in which a broker sends a compression type in a
>>>>> GetTelemetrySubscriptionsResponse
>>>>> which is subsequently not supported when its used with a
>>>>> PushTelemetryRequest.
>>>>> We do want the client to have the opportunity to get an up-to-date list
>>> of
>>>>> supported
>>>>> compression types. I think an immediate GetTelemetrySubscriptionsRequest
>>>>> is appropriate.
>>>>> 
>>>>> 3) If a client attempts a subsequent handshake with a Null
>>>>> ClientInstanceId, the
>>>>> receiving broker may not already know the client's existing
>>>>> ClientInstanceId. If the
>>>>> receiving broker knows the existing ClientInstanceId, it simply responds
>>>>> the existing
>>>>> value back to the client. If it does not know the existing
>>>>> ClientInstanceId, it will create
>>>>> a new client instance ID and respond with that.
>>>>> 
>>>>> I will update the KIP with these clarifications.
>>>>> 
>>>>> Thanks,
>>>>> Andrew
>>>>> 
>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra <milut...@confluent.io.INVALID
>>>> 
>>>>> wrote:
>>>>>> 
>>>>>> Hi Andrew, thanks for this KIP.
>>>>>> 
>>>>>> I had a few questions regarding the "Error handling" section.
>>>>>> 
>>>>>> 1. It mentions that "The 5 and 30 minute retries are to eventually
>>>>> trigger
>>>>>> a retry and avoid having to restart clients if the cluster metrics
>>>>>> configuration is disabled temporarily, e.g., by operator error, rolling
>>>>>> upgrades, etc."
>>>>>> But this 30 min interval isn't mentioned anywhere else. What is it
>>>>>> referring to?
>>>>>> 
>>>>>> 2. For the actual errors:
>>>>>> INVALID_RECORD : The action required is to "Log a warning to the
>>>>>> application and schedule the next GetTelemetrySubscriptionsRequest to 5
>>>>>> minutes". Why is this 5 minutes, and not something like PushIntervalMs?
>>>>> And
>>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in this
>>>>>> case, if the serialization is broken?
>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to
>>> confirm,
>>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled immediately
>>>>>> after the PushTelemetry response, correct?
>>>>>> 
>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include the
>>>>>> ClientInstanceId returned in the first response, regardless of broker":
>>>>>> Will a broker error be returned in case some implementation of this KIP
>>>>>> violates this accidentally and sends a request with ClientInstanceId =
>>>>> Null
>>>>>> even when it's been obtained already? Or will a new ClientInstanceId be
>>>>>> returned without an error?
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> I would like to start a new discussion thread on KIP-714: Client
>>> metrics
>>>>>>> and observability.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>>>>>> 
>>>>>>> I have edited the proposal significantly to reduce the scope. The
>>>>> overall
>>>>>>> mechanism for client metric subscriptions is unchanged, but the
>>>>>>> KIP is now based on the existing client metrics, rather than
>>> introducing
>>>>>>> new metrics. The purpose remains helping cluster operators
>>>>>>> investigate performance problems experienced by clients without
>>>>> requiring
>>>>>>> changes to the client application code or configuration.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Andrew
> 
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to