Re: [DISCUSS] KIP-714: Client metrics and observability

Andrew Schofield Tue, 10 Oct 2023 23:33:19 -0700

Matthias,
Yes, I think that’s a sensible way forward and the interface you propose looks 
good. I’ll update the KIP accordingly.


Thanks,
Andrew

> On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote:
>
> Andrew,
>
> yes I would like to get this change into KIP-714 right way. Seems to be 
> important, as we don't know if/when a follow-up KIP for Kafka Streams would 
> land.
>
> I was also thinking (and discussed with a few others) how to expose it, and 
> we would propose the following:
>
> We add a new method to `KafkaStreams` class:
>
>    public ClientsInstanceIds clientsInstanceIds(Duration timeout);
>
> The returned object is like below:
>
>  public class ClientsInstanceIds {
>    // we only have a single admin client per KS instance
>    String adminInstanceId();
>
>    // we only have a single global consumer per KS instance (if any)
>    // Optional<> because we might not have global-thread
>    Optional<String> globalConsumerInstanceId();
>
>    // return a <threadKey -> ClientInstanceId> mapping
>    // for the underlying (restore-)consumers/producers
>    Map<String, String> mainConsumerInstanceIds();
>    Map<String, String> restoreConsumerInstanceIds();
>    Map<String, String> producerInstanceIds();
> }
>
> For the `threadKey`, we would use some pattern like this:
>
>  [Stream|StateUpdater]Thread-<threadIdx>
>
>
> Would this work from your POV?
>
>
>
> -Matthias
>
>
> On 10/9/23 2:15 AM, Andrew Schofield wrote:
>> Hi Matthias,
>> Good point. Makes sense to me.
>> Is this something that can also be included in the proposed Kafka Streams 
>> follow-on KIP, or would you prefer that I add it to KIP-714?
>> I have a slight preference for the former to put all of the KS enhancements 
>> into a separate KIP.
>> Thanks,
>> Andrew
>>> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org> wrote:
>>>
>>> Thanks Andrew. SGTM.
>>>
>>> One point you did not address is the idea to add a method to `KafkaStreams` 
>>> similar to the proposed `clientInstanceId()` that will be added to 
>>> consumer/producer/admin clients.
>>>
>>> Without addressing this, Kafka Streams users won't have a way to get the 
>>> assigned `instanceId` of the internally created clients, and thus it would 
>>> be very difficult for them to know which metrics that the broker receives 
>>> belong to a Kafka Streams app. It seems they would only find the 
>>> `instanceIds` in the log4j output if they enable client logging?
>>>
>>> Of course, because there is multiple clients inside Kafka Streams, the 
>>> return type cannot be an single "String", but must be some some complex 
>>> data structure -- we could either add a new class, or return a 
>>> Map<String,String> using a client key that maps to the `instanceId`.
>>>
>>> For example we could use the following key:
>>>
>>>   [Global]StreamThread[-<threadIndex>][-restore][consumer|producer]
>>>
>>> (Of course, only the valid combination.)
>>>
>>> Or maybe even better, we might want to return a `Future` because collection 
>>> all the `instanceId` might be a blocking all on each client? I have already 
>>> a few idea how it could be implemented but I don't think it must be 
>>> discussed on the KIP, as it's an implementation detail.
>>>
>>> Thoughts?
>>>
>>>
>>> -Matthias
>>>
>>> On 10/6/23 4:21 AM, Andrew Schofield wrote:
>>>> Hi Matthias,
>>>> Thanks for your comments. I agree that a follow-up KIP for Kafka Streams 
>>>> makes sense. This KIP currently has made a bit
>>>> of an effort to embrace KS, but it’s not enough by a long way.
>>>> I have removed `application.id <http://application.id/>`. This should be 
>>>> done properly in the follow-up KIP. I don’t believe there’s a downside to
>>>> removing it from this KIP.
>>>> I have reworded the statement about temporarily. In practice, the 
>>>> implementation of this KIP that’s going on while the voting
>>>> progresses happens to use delta temporality, but that’s an implementation 
>>>> detail. Supporting clients must support both
>>>> temporalities.
>>>> I thought about exposing the client instance ID as a metric, but 
>>>> non-numeric metrics are not usual practice and tools
>>>> do not universally support them. I don’t think the KIP is improved by 
>>>> adding one now.
>>>> I have also added constants for the various Config classes for 
>>>> ENABLE_METRICS_PUSH_CONFIG, including to
>>>> StreamsConfig. It’s best to be explicit about this.
>>>> Thanks,
>>>> Andrew
>>>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I did not pay attention to this KIP in the past; seems it was on-hold for 
>>>>> a while.
>>>>>
>>>>> Overall it sounds very useful, and I think we should extend this with a 
>>>>> follow up KIP for Kafka Streams. What is unclear to me at this point is 
>>>>> the statement:
>>>>>
>>>>>> Kafka Streams applications have an application.id configured and this 
>>>>>> identifier should be included as the application_id metrics label.
>>>>>
>>>>> The `application.id` is currently only used as the (main) consumer's 
>>>>> `group.id` (and is part of an auto-generated `client.id` if the user does 
>>>>> not set one).
>>>>>
>>>>> This comment related to:
>>>>>
>>>>>> The following labels should be added by the client as appropriate before 
>>>>>> metrics are pushed.
>>>>>
>>>>> Given that Kafka Streams uses the consumer/producer/admin client as 
>>>>> "black boxes", a client does at this point not know that it's part of a 
>>>>> Kafka Streams application, and thus, it won't be able to attach any such 
>>>>> label to the metrics it sends. (Also producer and admin don't even know 
>>>>> the value of `application.id` -- only the (main) consumer, indirectly via 
>>>>> `group.id`, but also restore and global consumer don't know it, because 
>>>>> they don't have `group.id` set).
>>>>>
>>>>> While I am totally in favor of the proposal, I am wondering how we intent 
>>>>> to implement it in clean way? Or would we do ok to have some internal 
>>>>> client APIs that KS can use to "register" itself with the client?
>>>>>
>>>>>
>>>>>
>>>>>> While clients must support both temporalities, the broker will initially 
>>>>>> only send GetTelemetrySubscriptionsResponse.DeltaTemporality=True
>>>>>
>>>>> Not sure if I can follow. How make the decision about DELTA or CUMULATIVE 
>>>>> metrics? Should the broker side plugin not decide what metrics it what to 
>>>>> receive in which form? So what does "initially" mean -- the broker won't 
>>>>> ship with a default plugin implementation?
>>>>>
>>>>>
>>>>>
>>>>>> The following method is added to the Producer, Consumer, and Admin 
>>>>>> client interfaces:
>>>>>
>>>>> Should we add anything to Kafka Streams to expose the underlying clients' 
>>>>> assigned client-instance-ids programmatically? I am also wondering if 
>>>>> clients should report their assigned client-instance-ids as metrics 
>>>>> itself (for this case, Kafka Streams won't need to do anything, because 
>>>>> we already expose all client metrics).
>>>>>
>>>>> If we add anything programmatic, we need to make it simple, given that 
>>>>> Kafka Streams has many clients per `StreamThread` and may have multiple 
>>>>> threads.
>>>>>
>>>>>
>>>>>
>>>>>> enable.metrics.push
>>>>> It might be worth to add this to `StreamsConfig`, too? It set via 
>>>>> StreamsConfig, we would forward it to all clients automatically.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>>
>>>>> On 9/29/23 5:45 PM, David Jacot wrote:
>>>>>> Hi Andrew,
>>>>>> Thanks for driving this one. I haven't read all the KIP yet but I already
>>>>>> have an initial question. In the Threading section, it is written
>>>>>> "KafkaConsumer: the "background" thread (based on the consumer threading
>>>>>> refactor which is underway)". If I understand this correctly, it means
>>>>>> that KIP-714 won't work if the "old consumer" is used. Am I correct?
>>>>>> Cheers,
>>>>>> David
>>>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>> Hi Philip,
>>>>>>> No, I do not think it should actively search for a broker that supports
>>>>>>> the new
>>>>>>> RPCs. In general, either all of the brokers or none of the brokers will
>>>>>>> support it.
>>>>>>> In the window, where the cluster is being upgraded or client telemetry 
>>>>>>> is
>>>>>>> being
>>>>>>> enabled, there might be a mixed situation. I wouldn’t put too much 
>>>>>>> effort
>>>>>>> into
>>>>>>> this mixed scenario. As the client finds brokers which support the new
>>>>>>> RPCs,
>>>>>>> it can begin to follow the KIP-714 mechanism.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi Andrew -
>>>>>>>>
>>>>>>>> Question on top of your answers: Do you think the client should 
>>>>>>>> actively
>>>>>>>> search for a broker that supports this RPC? As previously mentioned, 
>>>>>>>> the
>>>>>>>> broker uses the leastLoadedNode to find its first connection (am
>>>>>>>> I correct?), and what if that broker doesn't support the metric push?
>>>>>>>>
>>>>>>>> P
>>>>>>>>
>>>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Kirk,
>>>>>>>>> Thanks for your question. You are correct that the presence or absence
>>>>>>> of
>>>>>>>>> the new RPCs in the
>>>>>>>>> ApiVersionsResponse tells the client whether to request the telemetry
>>>>>>>>> subscriptions and push
>>>>>>>>> metrics.
>>>>>>>>>
>>>>>>>>> This is of course tricky in practice. It would be conceivable, as a
>>>>>>>>> cluster is upgraded to AK 3.7
>>>>>>>>> or as a client metrics receiver plugin is deployed across the cluster,
>>>>>>>>> that a client connects to some
>>>>>>>>> brokers that support the new RPCs and some that do not.
>>>>>>>>>
>>>>>>>>> Here’s my suggestion:
>>>>>>>>> * If a client is not connected to any brokers that support in the new
>>>>>>>>> RPCs, it cannot push metrics.
>>>>>>>>> * If a client is only connected to brokers that support the new RPCs, 
>>>>>>>>> it
>>>>>>>>> will use the new RPCs in
>>>>>>>>> accordance with the KIP.
>>>>>>>>> * If a client is connected to some brokers that support the new RPCs 
>>>>>>>>> and
>>>>>>>>> some that do not, it will
>>>>>>>>> use the new RPCs with the supporting subset of brokers in accordance
>>>>>>> with
>>>>>>>>> the KIP.
>>>>>>>>>
>>>>>>>>> Comments?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Andrew/Jun,
>>>>>>>>>>
>>>>>>>>>> I want to make sure I understand question/comment #119… In the case
>>>>>>>>> where a cluster without a metrics client receiver is later 
>>>>>>>>> reconfigured
>>>>>>> and
>>>>>>>>> restarted to include a metrics client receiver, do we want the client 
>>>>>>>>> to
>>>>>>>>> thereafter begin pushing metrics to the cluster? From Andrew’s 
>>>>>>>>> response
>>>>>>> to
>>>>>>>>> question #119, it sounds like we’re using the presence/absence of the
>>>>>>>>> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push
>>>>>>>>> indicator. Do I have that correct?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Kirk
>>>>>>>>>>
>>>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Jun,
>>>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify where
>>>>>>>>> necessary.
>>>>>>>>>>>
>>>>>>>>>>> 110. Yes, agree. The motivation section mentions this.
>>>>>>>>>>>
>>>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the
>>>>>>>>> replacement of
>>>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I
>>>>>>>>> think it’s a bit
>>>>>>>>>>> of a debatable point. OTLP makes a distinction between a namespace
>>>>>>> and a
>>>>>>>>>>> multi-word component. If it was “client.id” then “client” would be a
>>>>>>>>> namespace with
>>>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So, it was
>>>>>>>>> intentional, but debatable.
>>>>>>>>>>>
>>>>>>>>>>> 112. Thanks. The link target moved. Fixed.
>>>>>>>>>>>
>>>>>>>>>>> 113. Thanks. Fixed.
>>>>>>>>>>>
>>>>>>>>>>> 114.1. If a standard metric makes sense for a client, it should use
>>>>>>> the
>>>>>>>>> exact same
>>>>>>>>>>> name. If a standard metric doesn’t make sense for a client, then it
>>>>>>> can
>>>>>>>>> omit that metric.
>>>>>>>>>>>
>>>>>>>>>>> For a required metric, the situation is stronger. All clients must
>>>>>>>>> implement these
>>>>>>>>>>> metrics with these names in order to implement the KIP. But the
>>>>>>>>> required metrics
>>>>>>>>>>> are essentially the number of connections and the request latency,
>>>>>>>>> which do not
>>>>>>>>>>> reference the underlying implementation of the client (which
>>>>>>>>> producer.record.queue.time.max
>>>>>>>>>>> of course does).
>>>>>>>>>>>
>>>>>>>>>>> I suppose someone might build a producer-only client that didn’t 
>>>>>>>>>>> have
>>>>>>>>> consumer metrics.
>>>>>>>>>>> In this case, the consumer metrics would conceptually have the 
>>>>>>>>>>> value 0
>>>>>>>>> and would not
>>>>>>>>>>> need to be sent to the broker.
>>>>>>>>>>>
>>>>>>>>>>> 114.2. If a client does not implement some metrics, they will not be
>>>>>>>>> available for
>>>>>>>>>>> analysis and troubleshooting. It just makes the ability to combine
>>>>>>>>> metrics from lots
>>>>>>>>>>> different clients less complete.
>>>>>>>>>>>
>>>>>>>>>>> 115. I think it was probably a mistake to be so specific about
>>>>>>>>> threading in this KIP.
>>>>>>>>>>> When the consumer threading refactor is complete, of course, it 
>>>>>>>>>>> would
>>>>>>>>> do the appropriate
>>>>>>>>>>> equivalent. I’ve added a clarification and massively simplified this
>>>>>>>>> section.
>>>>>>>>>>>
>>>>>>>>>>> 116. I removed “client.terminating”.
>>>>>>>>>>>
>>>>>>>>>>> 117. Yes. Horrid. Fixed.
>>>>>>>>>>>
>>>>>>>>>>> 118. The Terminating flag just indicates that this is the final
>>>>>>>>> PushTelemetryRequest
>>>>>>>>>>> from this client. Any subsequent request will be rejected. I think
>>>>>>> this
>>>>>>>>> flag should remain.
>>>>>>>>>>>
>>>>>>>>>>> 119. Good catch. This was actually contradicting another part of the
>>>>>>>>> KIP. The current behaviour
>>>>>>>>>>> is indeed preserved. If the broker doesn’t have a client metrics
>>>>>>>>> receiver plugin, the new RPCs
>>>>>>>>>>> in this KIP are “turned off” and not reported in 
>>>>>>>>>>> ApiVersionsResponse.
>>>>>>>>> The client will not
>>>>>>>>>>> attempt to push metrics.
>>>>>>>>>>>
>>>>>>>>>>> 120. The error handling table lists the error codes for
>>>>>>>>> PushTelemetryResponse. I’ve added one
>>>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t have any
>>>>>>>>> error codes, since the
>>>>>>>>>>> situation in which the client telemetry is not supported is handled 
>>>>>>>>>>> by
>>>>>>>>> the RPCs not being offered
>>>>>>>>>>> by the broker.
>>>>>>>>>>>
>>>>>>>>>>> 121. Again, I think it’s probably a mistake to be specific about
>>>>>>>>> threading. Removed.
>>>>>>>>>>>
>>>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation should be
>>>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be
>>>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER
>>>>>>>>>>> resource.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the detailed review.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Andrew
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 110. Another potential motivation is the multiple clients support.
>>>>>>>>> Some of
>>>>>>>>>>>> the places may not have good monitoring support for non-java 
>>>>>>>>>>>> clients.
>>>>>>>>>>>>
>>>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name
>>>>>>> and
>>>>>>>>>>>> replace '-' with '_' for attributes. Why is the inconsistency?
>>>>>>>>>>>>
>>>>>>>>>>>> 112. OTLP specification: Page is not found from the link.
>>>>>>>>>>>>
>>>>>>>>>>>> 113. "Defining standard and required metrics makes the monitoring 
>>>>>>>>>>>> and
>>>>>>>>>>>> troubleshooting of clients from various client types ": Incomplete
>>>>>>>>> sentence.
>>>>>>>>>>>>
>>>>>>>>>>>> 114. standard/required metrics
>>>>>>>>>>>> 114.1 Do other clients need to implement those metrics with the 
>>>>>>>>>>>> exact
>>>>>>>>> same
>>>>>>>>>>>> names?
>>>>>>>>>>>> 114.2 What happens if some of those metrics are missing from a
>>>>>>> client?
>>>>>>>>>>>>
>>>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and application 
>>>>>>>>>>>> threads":
>>>>>>> We
>>>>>>>>>>>> have an ongoing effort to refactor the consumer threading model (
>>>>>>>>>>>>
>>>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
>>>>>>>>> ).
>>>>>>>>>>>> Once this is done, PRC requests will only be made from the 
>>>>>>>>>>>> background
>>>>>>>>>>>> thread. Should this KIP follow the new model only?
>>>>>>>>>>>>
>>>>>>>>>>>> 116. 'The metrics should contain the reason for the client
>>>>>>> termination
>>>>>>>>> by
>>>>>>>>>>>> including the client.terminating metric with the label “reason” 
>>>>>>>>>>>> ...'.
>>>>>>>>> Hmm,
>>>>>>>>>>>> are we introducing a new metric client.terminating? If so, that 
>>>>>>>>>>>> needs
>>>>>>>>> to be
>>>>>>>>>>>> explicitly listed.
>>>>>>>>>>>>
>>>>>>>>>>>> 117. "As the metrics plugin may need to add additional metrics on 
>>>>>>>>>>>> top
>>>>>>>>> of
>>>>>>>>>>>> this the generic metrics receiver in the broker will not add these
>>>>>>>>> labels
>>>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't read well.
>>>>>>>>>>>>
>>>>>>>>>>>> 118. "it is possible for the client to send at most one accepted
>>>>>>>>>>>> out-of-profile per connection before the rate-limiter kicks in": If
>>>>>>> we
>>>>>>>>> do
>>>>>>>>>>>> this, do we still need the Terminating flag in
>>>>>>> PushTelemetryRequestV0?
>>>>>>>>>>>>
>>>>>>>>>>>> 119. "If there is no client metrics receiver plugin configured on 
>>>>>>>>>>>> the
>>>>>>>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest with
>>>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The client
>>>>>>> should
>>>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the 
>>>>>>>>>>>> PushIntervalMs
>>>>>>>>> has
>>>>>>>>>>>> expired. This allows the metrics receiver to be enabled or disabled
>>>>>>>>> without
>>>>>>>>>>>> having to restart the broker or reset the client connection."
>>>>>>>>>>>> "no client metrics receiver plugin configured" is defined by no
>>>>>>> metric
>>>>>>>>>>>> reporter implementing the ClientTelemetry interface, right? In that
>>>>>>>>> case,
>>>>>>>>>>>> it would be useful to avoid the clients sending
>>>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve the 
>>>>>>>>>>>> current
>>>>>>>>>>>> behavior.
>>>>>>>>>>>>
>>>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and 
>>>>>>>>>>>> PushTelemetryRequestV0:
>>>>>>>>> Could
>>>>>>>>>>>> we list error codes for each?
>>>>>>>>>>>>
>>>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method 
>>>>>>>>>>>> may
>>>>>>>>> be
>>>>>>>>>>>> called from the request handling thread": Where else can this 
>>>>>>>>>>>> method
>>>>>>> be
>>>>>>>>>>>> called?
>>>>>>>>>>>>
>>>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we changing 
>>>>>>>>>>>> the
>>>>>>>>> ACL?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Jun
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Milind,
>>>>>>>>>>>>> Thanks for your question.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most likely to be
>>>>>>>>> caused by a
>>>>>>>>>>>>> problem in the serialization in the client. I have changed the
>>>>>>> client
>>>>>>>>>>>>> action in this case
>>>>>>>>>>>>> to “Log an error and stop pushing metrics”.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have updated the KIP text accordingly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra
>>>>>>>>> <milut...@confluent.io.INVALID>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>> Thanks for the clarifications.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> About 2b:
>>>>>>>>>>>>>> In case a client has a bug while serializing, it might be 
>>>>>>>>>>>>>> difficult
>>>>>>>>> for
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> client to recover from that without code changes. In that, it 
>>>>>>>>>>>>>> might
>>>>>>>>> be
>>>>>>>>>>>>> good
>>>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat the error 
>>>>>>>>>>>>>> as
>>>>>>>>> fatal
>>>>>>>>>>>>>> for the client (only fatal in terms of sending the metrics, the
>>>>>>>>> client
>>>>>>>>>>>>> can
>>>>>>>>>>>>>> keep functioning otherwise). What do you think?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Milind
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Milind,
>>>>>>>>>>>>>>> Thanks for your questions about the KIP.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) I did some archaeology and looked at historical versions of 
>>>>>>>>>>>>>>> the
>>>>>>>>> KIP.
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> think this is
>>>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push interval. 
>>>>>>>>>>>>>>> 30
>>>>>>>>>>>>> minutes
>>>>>>>>>>>>>>> is a mystery
>>>>>>>>>>>>>>> to me. I’ve updated the KIP.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) I think there are two situations in which INVALID_RECORD 
>>>>>>>>>>>>>>> might
>>>>>>>>> occur.
>>>>>>>>>>>>>>> a) The client might perhaps be using a content-type that the
>>>>>>> broker
>>>>>>>>> does
>>>>>>>>>>>>>>> not support.
>>>>>>>>>>>>>>> The KIP mentions content-type as a future extension, but there’s
>>>>>>>>> only
>>>>>>>>>>>>> one
>>>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>> to start with. Until we have multiple content-types, this seems
>>>>>>> out
>>>>>>>>> of
>>>>>>>>>>>>>>> scope. I think a
>>>>>>>>>>>>>>> future KIP would add another error code for this.
>>>>>>>>>>>>>>> b) The client might perhaps have a bug which means the metrics
>>>>>>>>> payload
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> malformed.
>>>>>>>>>>>>>>> Logging a warning and attempting the next metrics push on the 
>>>>>>>>>>>>>>> push
>>>>>>>>>>>>>>> interval seems
>>>>>>>>>>>>>>> appropriate.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an
>>>>>>>>> immediate
>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or
>>>>>>>>> perhaps
>>>>>>>>>>>>>>> a situation in which a broker sends a compression type in a
>>>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse
>>>>>>>>>>>>>>> which is subsequently not supported when its used with a
>>>>>>>>>>>>>>> PushTelemetryRequest.
>>>>>>>>>>>>>>> We do want the client to have the opportunity to get an 
>>>>>>>>>>>>>>> up-to-date
>>>>>>>>> list
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>> compression types. I think an immediate
>>>>>>>>> GetTelemetrySubscriptionsRequest
>>>>>>>>>>>>>>> is appropriate.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a Null
>>>>>>>>>>>>>>> ClientInstanceId, the
>>>>>>>>>>>>>>> receiving broker may not already know the client's existing
>>>>>>>>>>>>>>> ClientInstanceId. If the
>>>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId, it simply
>>>>>>>>> responds
>>>>>>>>>>>>>>> the existing
>>>>>>>>>>>>>>> value back to the client. If it does not know the existing
>>>>>>>>>>>>>>> ClientInstanceId, it will create
>>>>>>>>>>>>>>> a new client instance ID and respond with that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I will update the KIP with these clarifications.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
>>>>>>>>> <milut...@confluent.io.INVALID
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I had a few questions regarding the "Error handling" section.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to
>>>>>>> eventually
>>>>>>>>>>>>>>> trigger
>>>>>>>>>>>>>>>> a retry and avoid having to restart clients if the cluster
>>>>>>> metrics
>>>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by operator error,
>>>>>>>>> rolling
>>>>>>>>>>>>>>>> upgrades, etc."
>>>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else. What is
>>>>>>> it
>>>>>>>>>>>>>>>> referring to?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. For the actual errors:
>>>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a warning to 
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> application and schedule the next
>>>>>>> GetTelemetrySubscriptionsRequest
>>>>>>>>> to 5
>>>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something like
>>>>>>>>> PushIntervalMs?
>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest 
>>>>>>>>>>>>>>>> in
>>>>>>>>> this
>>>>>>>>>>>>>>>> case, if the serialization is broken?
>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just 
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>> confirm,
>>>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled
>>>>>>>>> immediately
>>>>>>>>>>>>>>>> after the PushTelemetry response, correct?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must 
>>>>>>>>>>>>>>>> include
>>>>>>>>> the
>>>>>>>>>>>>>>>> ClientInstanceId returned in the first response, regardless of
>>>>>>>>> broker":
>>>>>>>>>>>>>>>> Will a broker error be returned in case some implementation of
>>>>>>>>> this KIP
>>>>>>>>>>>>>>>> violates this accidentally and sends a request with
>>>>>>>>> ClientInstanceId =
>>>>>>>>>>>>>>> Null
>>>>>>>>>>>>>>>> even when it's been obtained already? Or will a new
>>>>>>>>> ClientInstanceId be
>>>>>>>>>>>>>>>> returned without an error?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>> I would like to start a new discussion thread on KIP-714: 
>>>>>>>>>>>>>>>>> Client
>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>> and observability.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce the scope.
>>>>>>> The
>>>>>>>>>>>>>>> overall
>>>>>>>>>>>>>>>>> mechanism for client metric subscriptions is unchanged, but 
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> KIP is now based on the existing client metrics, rather than
>>>>>>>>>>>>> introducing
>>>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster operators
>>>>>>>>>>>>>>>>> investigate performance problems experienced by clients 
>>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>> requiring
>>>>>>>>>>>>>>>>> changes to the client application code or configuration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Andrew

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to