Re: [DISCUSS] KIP-714: Client metrics and observability

Andrew Schofield Mon, 09 Oct 2023 02:15:29 -0700

Hi Matthias,
Good point. Makes sense to me.

Is this something that can also be included in the proposed Kafka Streams 
follow-on KIP, or would you prefer that I add it to KIP-714?
I have a slight preference for the former to put all of the KS enhancements 
into a separate KIP.


Thanks,
Andrew

> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org> wrote:
>
> Thanks Andrew. SGTM.
>
> One point you did not address is the idea to add a method to `KafkaStreams` 
> similar to the proposed `clientInstanceId()` that will be added to 
> consumer/producer/admin clients.
>
> Without addressing this, Kafka Streams users won't have a way to get the 
> assigned `instanceId` of the internally created clients, and thus it would be 
> very difficult for them to know which metrics that the broker receives belong 
> to a Kafka Streams app. It seems they would only find the `instanceIds` in 
> the log4j output if they enable client logging?
>
> Of course, because there is multiple clients inside Kafka Streams, the return 
> type cannot be an single "String", but must be some some complex data 
> structure -- we could either add a new class, or return a Map<String,String> 
> using a client key that maps to the `instanceId`.
>
> For example we could use the following key:
>
>   [Global]StreamThread[-<threadIndex>][-restore][consumer|producer]
>
> (Of course, only the valid combination.)
>
> Or maybe even better, we might want to return a `Future` because collection 
> all the `instanceId` might be a blocking all on each client? I have already a 
> few idea how it could be implemented but I don't think it must be discussed 
> on the KIP, as it's an implementation detail.
>
> Thoughts?
>
>
> -Matthias
>
> On 10/6/23 4:21 AM, Andrew Schofield wrote:
>> Hi Matthias,
>> Thanks for your comments. I agree that a follow-up KIP for Kafka Streams 
>> makes sense. This KIP currently has made a bit
>> of an effort to embrace KS, but it’s not enough by a long way.
>> I have removed `application.id <http://application.id/>`. This should be 
>> done properly in the follow-up KIP. I don’t believe there’s a downside to
>> removing it from this KIP.
>> I have reworded the statement about temporarily. In practice, the 
>> implementation of this KIP that’s going on while the voting
>> progresses happens to use delta temporality, but that’s an implementation 
>> detail. Supporting clients must support both
>> temporalities.
>> I thought about exposing the client instance ID as a metric, but non-numeric 
>> metrics are not usual practice and tools
>> do not universally support them. I don’t think the KIP is improved by adding 
>> one now.
>> I have also added constants for the various Config classes for 
>> ENABLE_METRICS_PUSH_CONFIG, including to
>> StreamsConfig. It’s best to be explicit about this.
>> Thanks,
>> Andrew
>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org> wrote:
>>>
>>> Hi,
>>>
>>> I did not pay attention to this KIP in the past; seems it was on-hold for a 
>>> while.
>>>
>>> Overall it sounds very useful, and I think we should extend this with a 
>>> follow up KIP for Kafka Streams. What is unclear to me at this point is the 
>>> statement:
>>>
>>>> Kafka Streams applications have an application.id configured and this 
>>>> identifier should be included as the application_id metrics label.
>>>
>>> The `application.id` is currently only used as the (main) consumer's 
>>> `group.id` (and is part of an auto-generated `client.id` if the user does 
>>> not set one).
>>>
>>> This comment related to:
>>>
>>>> The following labels should be added by the client as appropriate before 
>>>> metrics are pushed.
>>>
>>> Given that Kafka Streams uses the consumer/producer/admin client as "black 
>>> boxes", a client does at this point not know that it's part of a Kafka 
>>> Streams application, and thus, it won't be able to attach any such label to 
>>> the metrics it sends. (Also producer and admin don't even know the value of 
>>> `application.id` -- only the (main) consumer, indirectly via `group.id`, 
>>> but also restore and global consumer don't know it, because they don't have 
>>> `group.id` set).
>>>
>>> While I am totally in favor of the proposal, I am wondering how we intent 
>>> to implement it in clean way? Or would we do ok to have some internal 
>>> client APIs that KS can use to "register" itself with the client?
>>>
>>>
>>>
>>>> While clients must support both temporalities, the broker will initially 
>>>> only send GetTelemetrySubscriptionsResponse.DeltaTemporality=True
>>>
>>> Not sure if I can follow. How make the decision about DELTA or CUMULATIVE 
>>> metrics? Should the broker side plugin not decide what metrics it what to 
>>> receive in which form? So what does "initially" mean -- the broker won't 
>>> ship with a default plugin implementation?
>>>
>>>
>>>
>>>> The following method is added to the Producer, Consumer, and Admin client 
>>>> interfaces:
>>>
>>> Should we add anything to Kafka Streams to expose the underlying clients' 
>>> assigned client-instance-ids programmatically? I am also wondering if 
>>> clients should report their assigned client-instance-ids as metrics itself 
>>> (for this case, Kafka Streams won't need to do anything, because we already 
>>> expose all client metrics).
>>>
>>> If we add anything programmatic, we need to make it simple, given that 
>>> Kafka Streams has many clients per `StreamThread` and may have multiple 
>>> threads.
>>>
>>>
>>>
>>>> enable.metrics.push
>>> It might be worth to add this to `StreamsConfig`, too? It set via 
>>> StreamsConfig, we would forward it to all clients automatically.
>>>
>>>
>>>
>>>
>>> -Matthias
>>>
>>>
>>> On 9/29/23 5:45 PM, David Jacot wrote:
>>>> Hi Andrew,
>>>> Thanks for driving this one. I haven't read all the KIP yet but I already
>>>> have an initial question. In the Threading section, it is written
>>>> "KafkaConsumer: the "background" thread (based on the consumer threading
>>>> refactor which is underway)". If I understand this correctly, it means
>>>> that KIP-714 won't work if the "old consumer" is used. Am I correct?
>>>> Cheers,
>>>> David
>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>> Hi Philip,
>>>>> No, I do not think it should actively search for a broker that supports
>>>>> the new
>>>>> RPCs. In general, either all of the brokers or none of the brokers will
>>>>> support it.
>>>>> In the window, where the cluster is being upgraded or client telemetry is
>>>>> being
>>>>> enabled, there might be a mixed situation. I wouldn’t put too much effort
>>>>> into
>>>>> this mixed scenario. As the client finds brokers which support the new
>>>>> RPCs,
>>>>> it can begin to follow the KIP-714 mechanism.
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Andrew -
>>>>>>
>>>>>> Question on top of your answers: Do you think the client should actively
>>>>>> search for a broker that supports this RPC? As previously mentioned, the
>>>>>> broker uses the leastLoadedNode to find its first connection (am
>>>>>> I correct?), and what if that broker doesn't support the metric push?
>>>>>>
>>>>>> P
>>>>>>
>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>
>>>>>>> Hi Kirk,
>>>>>>> Thanks for your question. You are correct that the presence or absence
>>>>> of
>>>>>>> the new RPCs in the
>>>>>>> ApiVersionsResponse tells the client whether to request the telemetry
>>>>>>> subscriptions and push
>>>>>>> metrics.
>>>>>>>
>>>>>>> This is of course tricky in practice. It would be conceivable, as a
>>>>>>> cluster is upgraded to AK 3.7
>>>>>>> or as a client metrics receiver plugin is deployed across the cluster,
>>>>>>> that a client connects to some
>>>>>>> brokers that support the new RPCs and some that do not.
>>>>>>>
>>>>>>> Here’s my suggestion:
>>>>>>> * If a client is not connected to any brokers that support in the new
>>>>>>> RPCs, it cannot push metrics.
>>>>>>> * If a client is only connected to brokers that support the new RPCs, it
>>>>>>> will use the new RPCs in
>>>>>>> accordance with the KIP.
>>>>>>> * If a client is connected to some brokers that support the new RPCs and
>>>>>>> some that do not, it will
>>>>>>> use the new RPCs with the supporting subset of brokers in accordance
>>>>> with
>>>>>>> the KIP.
>>>>>>>
>>>>>>> Comments?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote:
>>>>>>>>
>>>>>>>> Hi Andrew/Jun,
>>>>>>>>
>>>>>>>> I want to make sure I understand question/comment #119… In the case
>>>>>>> where a cluster without a metrics client receiver is later reconfigured
>>>>> and
>>>>>>> restarted to include a metrics client receiver, do we want the client to
>>>>>>> thereafter begin pushing metrics to the cluster? From Andrew’s response
>>>>> to
>>>>>>> question #119, it sounds like we’re using the presence/absence of the
>>>>>>> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push
>>>>>>> indicator. Do I have that correct?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Kirk
>>>>>>>>
>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Jun,
>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify where
>>>>>>> necessary.
>>>>>>>>>
>>>>>>>>> 110. Yes, agree. The motivation section mentions this.
>>>>>>>>>
>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the
>>>>>>> replacement of
>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I
>>>>>>> think it’s a bit
>>>>>>>>> of a debatable point. OTLP makes a distinction between a namespace
>>>>> and a
>>>>>>>>> multi-word component. If it was “client.id” then “client” would be a
>>>>>>> namespace with
>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So, it was
>>>>>>> intentional, but debatable.
>>>>>>>>>
>>>>>>>>> 112. Thanks. The link target moved. Fixed.
>>>>>>>>>
>>>>>>>>> 113. Thanks. Fixed.
>>>>>>>>>
>>>>>>>>> 114.1. If a standard metric makes sense for a client, it should use
>>>>> the
>>>>>>> exact same
>>>>>>>>> name. If a standard metric doesn’t make sense for a client, then it
>>>>> can
>>>>>>> omit that metric.
>>>>>>>>>
>>>>>>>>> For a required metric, the situation is stronger. All clients must
>>>>>>> implement these
>>>>>>>>> metrics with these names in order to implement the KIP. But the
>>>>>>> required metrics
>>>>>>>>> are essentially the number of connections and the request latency,
>>>>>>> which do not
>>>>>>>>> reference the underlying implementation of the client (which
>>>>>>> producer.record.queue.time.max
>>>>>>>>> of course does).
>>>>>>>>>
>>>>>>>>> I suppose someone might build a producer-only client that didn’t have
>>>>>>> consumer metrics.
>>>>>>>>> In this case, the consumer metrics would conceptually have the value 0
>>>>>>> and would not
>>>>>>>>> need to be sent to the broker.
>>>>>>>>>
>>>>>>>>> 114.2. If a client does not implement some metrics, they will not be
>>>>>>> available for
>>>>>>>>> analysis and troubleshooting. It just makes the ability to combine
>>>>>>> metrics from lots
>>>>>>>>> different clients less complete.
>>>>>>>>>
>>>>>>>>> 115. I think it was probably a mistake to be so specific about
>>>>>>> threading in this KIP.
>>>>>>>>> When the consumer threading refactor is complete, of course, it would
>>>>>>> do the appropriate
>>>>>>>>> equivalent. I’ve added a clarification and massively simplified this
>>>>>>> section.
>>>>>>>>>
>>>>>>>>> 116. I removed “client.terminating”.
>>>>>>>>>
>>>>>>>>> 117. Yes. Horrid. Fixed.
>>>>>>>>>
>>>>>>>>> 118. The Terminating flag just indicates that this is the final
>>>>>>> PushTelemetryRequest
>>>>>>>>> from this client. Any subsequent request will be rejected. I think
>>>>> this
>>>>>>> flag should remain.
>>>>>>>>>
>>>>>>>>> 119. Good catch. This was actually contradicting another part of the
>>>>>>> KIP. The current behaviour
>>>>>>>>> is indeed preserved. If the broker doesn’t have a client metrics
>>>>>>> receiver plugin, the new RPCs
>>>>>>>>> in this KIP are “turned off” and not reported in ApiVersionsResponse.
>>>>>>> The client will not
>>>>>>>>> attempt to push metrics.
>>>>>>>>>
>>>>>>>>> 120. The error handling table lists the error codes for
>>>>>>> PushTelemetryResponse. I’ve added one
>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t have any
>>>>>>> error codes, since the
>>>>>>>>> situation in which the client telemetry is not supported is handled by
>>>>>>> the RPCs not being offered
>>>>>>>>> by the broker.
>>>>>>>>>
>>>>>>>>> 121. Again, I think it’s probably a mistake to be specific about
>>>>>>> threading. Removed.
>>>>>>>>>
>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation should be
>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be
>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER
>>>>>>>>> resource.
>>>>>>>>>
>>>>>>>>> Thanks for the detailed review.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 110. Another potential motivation is the multiple clients support.
>>>>>>> Some of
>>>>>>>>>> the places may not have good monitoring support for non-java clients.
>>>>>>>>>>
>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name
>>>>> and
>>>>>>>>>> replace '-' with '_' for attributes. Why is the inconsistency?
>>>>>>>>>>
>>>>>>>>>> 112. OTLP specification: Page is not found from the link.
>>>>>>>>>>
>>>>>>>>>> 113. "Defining standard and required metrics makes the monitoring and
>>>>>>>>>> troubleshooting of clients from various client types ": Incomplete
>>>>>>> sentence.
>>>>>>>>>>
>>>>>>>>>> 114. standard/required metrics
>>>>>>>>>> 114.1 Do other clients need to implement those metrics with the exact
>>>>>>> same
>>>>>>>>>> names?
>>>>>>>>>> 114.2 What happens if some of those metrics are missing from a
>>>>> client?
>>>>>>>>>>
>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and application threads":
>>>>> We
>>>>>>>>>> have an ongoing effort to refactor the consumer threading model (
>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
>>>>>>> ).
>>>>>>>>>> Once this is done, PRC requests will only be made from the background
>>>>>>>>>> thread. Should this KIP follow the new model only?
>>>>>>>>>>
>>>>>>>>>> 116. 'The metrics should contain the reason for the client
>>>>> termination
>>>>>>> by
>>>>>>>>>> including the client.terminating metric with the label “reason” ...'.
>>>>>>> Hmm,
>>>>>>>>>> are we introducing a new metric client.terminating? If so, that needs
>>>>>>> to be
>>>>>>>>>> explicitly listed.
>>>>>>>>>>
>>>>>>>>>> 117. "As the metrics plugin may need to add additional metrics on top
>>>>>>> of
>>>>>>>>>> this the generic metrics receiver in the broker will not add these
>>>>>>> labels
>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't read well.
>>>>>>>>>>
>>>>>>>>>> 118. "it is possible for the client to send at most one accepted
>>>>>>>>>> out-of-profile per connection before the rate-limiter kicks in": If
>>>>> we
>>>>>>> do
>>>>>>>>>> this, do we still need the Terminating flag in
>>>>> PushTelemetryRequestV0?
>>>>>>>>>>
>>>>>>>>>> 119. "If there is no client metrics receiver plugin configured on the
>>>>>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest with
>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The client
>>>>> should
>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs
>>>>>>> has
>>>>>>>>>> expired. This allows the metrics receiver to be enabled or disabled
>>>>>>> without
>>>>>>>>>> having to restart the broker or reset the client connection."
>>>>>>>>>> "no client metrics receiver plugin configured" is defined by no
>>>>> metric
>>>>>>>>>> reporter implementing the ClientTelemetry interface, right? In that
>>>>>>> case,
>>>>>>>>>> it would be useful to avoid the clients sending
>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve the current
>>>>>>>>>> behavior.
>>>>>>>>>>
>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0:
>>>>>>> Could
>>>>>>>>>> we list error codes for each?
>>>>>>>>>>
>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may
>>>>>>> be
>>>>>>>>>> called from the request handling thread": Where else can this method
>>>>> be
>>>>>>>>>> called?
>>>>>>>>>>
>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we changing the
>>>>>>> ACL?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Jun
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Milind,
>>>>>>>>>>> Thanks for your question.
>>>>>>>>>>>
>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most likely to be
>>>>>>> caused by a
>>>>>>>>>>> problem in the serialization in the client. I have changed the
>>>>> client
>>>>>>>>>>> action in this case
>>>>>>>>>>> to “Log an error and stop pushing metrics”.
>>>>>>>>>>>
>>>>>>>>>>> I have updated the KIP text accordingly.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Andrew
>>>>>>>>>>>
>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra
>>>>>>> <milut...@confluent.io.INVALID>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>> Thanks for the clarifications.
>>>>>>>>>>>>
>>>>>>>>>>>> About 2b:
>>>>>>>>>>>> In case a client has a bug while serializing, it might be difficult
>>>>>>> for
>>>>>>>>>>> the
>>>>>>>>>>>> client to recover from that without code changes. In that, it might
>>>>>>> be
>>>>>>>>>>> good
>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat the error as
>>>>>>> fatal
>>>>>>>>>>>> for the client (only fatal in terms of sending the metrics, the
>>>>>>> client
>>>>>>>>>>> can
>>>>>>>>>>>> keep functioning otherwise). What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Milind
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Milind,
>>>>>>>>>>>>> Thanks for your questions about the KIP.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) I did some archaeology and looked at historical versions of the
>>>>>>> KIP.
>>>>>>>>>>> I
>>>>>>>>>>>>> think this is
>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push interval. 30
>>>>>>>>>>> minutes
>>>>>>>>>>>>> is a mystery
>>>>>>>>>>>>> to me. I’ve updated the KIP.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) I think there are two situations in which INVALID_RECORD might
>>>>>>> occur.
>>>>>>>>>>>>> a) The client might perhaps be using a content-type that the
>>>>> broker
>>>>>>> does
>>>>>>>>>>>>> not support.
>>>>>>>>>>>>> The KIP mentions content-type as a future extension, but there’s
>>>>>>> only
>>>>>>>>>>> one
>>>>>>>>>>>>> supported
>>>>>>>>>>>>> to start with. Until we have multiple content-types, this seems
>>>>> out
>>>>>>> of
>>>>>>>>>>>>> scope. I think a
>>>>>>>>>>>>> future KIP would add another error code for this.
>>>>>>>>>>>>> b) The client might perhaps have a bug which means the metrics
>>>>>>> payload
>>>>>>>>>>> is
>>>>>>>>>>>>> malformed.
>>>>>>>>>>>>> Logging a warning and attempting the next metrics push on the push
>>>>>>>>>>>>> interval seems
>>>>>>>>>>>>> appropriate.
>>>>>>>>>>>>>
>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an
>>>>>>> immediate
>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest.
>>>>>>>>>>>>>
>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or
>>>>>>> perhaps
>>>>>>>>>>>>> a situation in which a broker sends a compression type in a
>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse
>>>>>>>>>>>>> which is subsequently not supported when its used with a
>>>>>>>>>>>>> PushTelemetryRequest.
>>>>>>>>>>>>> We do want the client to have the opportunity to get an up-to-date
>>>>>>> list
>>>>>>>>>>> of
>>>>>>>>>>>>> supported
>>>>>>>>>>>>> compression types. I think an immediate
>>>>>>> GetTelemetrySubscriptionsRequest
>>>>>>>>>>>>> is appropriate.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a Null
>>>>>>>>>>>>> ClientInstanceId, the
>>>>>>>>>>>>> receiving broker may not already know the client's existing
>>>>>>>>>>>>> ClientInstanceId. If the
>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId, it simply
>>>>>>> responds
>>>>>>>>>>>>> the existing
>>>>>>>>>>>>> value back to the client. If it does not know the existing
>>>>>>>>>>>>> ClientInstanceId, it will create
>>>>>>>>>>>>> a new client instance ID and respond with that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I will update the KIP with these clarifications.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
>>>>>>> <milut...@confluent.io.INVALID
>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I had a few questions regarding the "Error handling" section.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to
>>>>> eventually
>>>>>>>>>>>>> trigger
>>>>>>>>>>>>>> a retry and avoid having to restart clients if the cluster
>>>>> metrics
>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by operator error,
>>>>>>> rolling
>>>>>>>>>>>>>> upgrades, etc."
>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else. What is
>>>>> it
>>>>>>>>>>>>>> referring to?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. For the actual errors:
>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a warning to the
>>>>>>>>>>>>>> application and schedule the next
>>>>> GetTelemetrySubscriptionsRequest
>>>>>>> to 5
>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something like
>>>>>>> PushIntervalMs?
>>>>>>>>>>>>> And
>>>>>>>>>>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in
>>>>>>> this
>>>>>>>>>>>>>> case, if the serialization is broken?
>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to
>>>>>>>>>>> confirm,
>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled
>>>>>>> immediately
>>>>>>>>>>>>>> after the PushTelemetry response, correct?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include
>>>>>>> the
>>>>>>>>>>>>>> ClientInstanceId returned in the first response, regardless of
>>>>>>> broker":
>>>>>>>>>>>>>> Will a broker error be returned in case some implementation of
>>>>>>> this KIP
>>>>>>>>>>>>>> violates this accidentally and sends a request with
>>>>>>> ClientInstanceId =
>>>>>>>>>>>>> Null
>>>>>>>>>>>>>> even when it's been obtained already? Or will a new
>>>>>>> ClientInstanceId be
>>>>>>>>>>>>>> returned without an error?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> I would like to start a new discussion thread on KIP-714: Client
>>>>>>>>>>> metrics
>>>>>>>>>>>>>>> and observability.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce the scope.
>>>>> The
>>>>>>>>>>>>> overall
>>>>>>>>>>>>>>> mechanism for client metric subscriptions is unchanged, but the
>>>>>>>>>>>>>>> KIP is now based on the existing client metrics, rather than
>>>>>>>>>>> introducing
>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster operators
>>>>>>>>>>>>>>> investigate performance problems experienced by clients without
>>>>>>>>>>>>> requiring
>>>>>>>>>>>>>>> changes to the client application code or configuration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Andrew

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to