Re: [DISCUSS] KIP-714: Client metrics and observability

Jun Rao Wed, 11 Oct 2023 17:07:41 -0700

Hi, Andrew,

Thanks for the updated KIP. Just a few more minor comments.


130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for all
consumer/producer/adminClient instances to be initialized? Are all those
instances created during KafkaStreams initialization?

131. Why does globalConsumerInstanceId() return Optional<String> while
other consumer instances don't return Optional?

132. ClientMetricsSubscriptionRequestCount: Do we need this since we have a
set of generic metrics
(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
report Request rate for every request type?

Thanks,

Jun

On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax <mj...@apache.org> wrote:

> Thanks!
>
> On 10/10/23 11:31 PM, Andrew Schofield wrote:
> > Matthias,
> > Yes, I think that’s a sensible way forward and the interface you propose
> looks good. I’ll update the KIP accordingly.
> >
> > Thanks,
> > Andrew
> >
> >> On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote:
> >>
> >> Andrew,
> >>
> >> yes I would like to get this change into KIP-714 right way. Seems to be
> important, as we don't know if/when a follow-up KIP for Kafka Streams would
> land.
> >>
> >> I was also thinking (and discussed with a few others) how to expose it,
> and we would propose the following:
> >>
> >> We add a new method to `KafkaStreams` class:
> >>
> >>     public ClientsInstanceIds clientsInstanceIds(Duration timeout);
> >>
> >> The returned object is like below:
> >>
> >>   public class ClientsInstanceIds {
> >>     // we only have a single admin client per KS instance
> >>     String adminInstanceId();
> >>
> >>     // we only have a single global consumer per KS instance (if any)
> >>     // Optional<> because we might not have global-thread
> >>     Optional<String> globalConsumerInstanceId();
> >>
> >>     // return a <threadKey -> ClientInstanceId> mapping
> >>     // for the underlying (restore-)consumers/producers
> >>     Map<String, String> mainConsumerInstanceIds();
> >>     Map<String, String> restoreConsumerInstanceIds();
> >>     Map<String, String> producerInstanceIds();
> >> }
> >>
> >> For the `threadKey`, we would use some pattern like this:
> >>
> >>   [Stream|StateUpdater]Thread-<threadIdx>
> >>
> >>
> >> Would this work from your POV?
> >>
> >>
> >>
> >> -Matthias
> >>
> >>
> >> On 10/9/23 2:15 AM, Andrew Schofield wrote:
> >>> Hi Matthias,
> >>> Good point. Makes sense to me.
> >>> Is this something that can also be included in the proposed Kafka
> Streams follow-on KIP, or would you prefer that I add it to KIP-714?
> >>> I have a slight preference for the former to put all of the KS
> enhancements into a separate KIP.
> >>> Thanks,
> >>> Andrew
> >>>> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org> wrote:
> >>>>
> >>>> Thanks Andrew. SGTM.
> >>>>
> >>>> One point you did not address is the idea to add a method to
> `KafkaStreams` similar to the proposed `clientInstanceId()` that will be
> added to consumer/producer/admin clients.
> >>>>
> >>>> Without addressing this, Kafka Streams users won't have a way to get
> the assigned `instanceId` of the internally created clients, and thus it
> would be very difficult for them to know which metrics that the broker
> receives belong to a Kafka Streams app. It seems they would only find the
> `instanceIds` in the log4j output if they enable client logging?
> >>>>
> >>>> Of course, because there is multiple clients inside Kafka Streams,
> the return type cannot be an single "String", but must be some some complex
> data structure -- we could either add a new class, or return a
> Map<String,String> using a client key that maps to the `instanceId`.
> >>>>
> >>>> For example we could use the following key:
> >>>>
> >>>>    [Global]StreamThread[-<threadIndex>][-restore][consumer|producer]
> >>>>
> >>>> (Of course, only the valid combination.)
> >>>>
> >>>> Or maybe even better, we might want to return a `Future` because
> collection all the `instanceId` might be a blocking all on each client? I
> have already a few idea how it could be implemented but I don't think it
> must be discussed on the KIP, as it's an implementation detail.
> >>>>
> >>>> Thoughts?
> >>>>
> >>>>
> >>>> -Matthias
> >>>>
> >>>> On 10/6/23 4:21 AM, Andrew Schofield wrote:
> >>>>> Hi Matthias,
> >>>>> Thanks for your comments. I agree that a follow-up KIP for Kafka
> Streams makes sense. This KIP currently has made a bit
> >>>>> of an effort to embrace KS, but it’s not enough by a long way.
> >>>>> I have removed `application.id <http://application.id/>`. This
> should be done properly in the follow-up KIP. I don’t believe there’s a
> downside to
> >>>>> removing it from this KIP.
> >>>>> I have reworded the statement about temporarily. In practice, the
> implementation of this KIP that’s going on while the voting
> >>>>> progresses happens to use delta temporality, but that’s an
> implementation detail. Supporting clients must support both
> >>>>> temporalities.
> >>>>> I thought about exposing the client instance ID as a metric, but
> non-numeric metrics are not usual practice and tools
> >>>>> do not universally support them. I don’t think the KIP is improved
> by adding one now.
> >>>>> I have also added constants for the various Config classes for
> ENABLE_METRICS_PUSH_CONFIG, including to
> >>>>> StreamsConfig. It’s best to be explicit about this.
> >>>>> Thanks,
> >>>>> Andrew
> >>>>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I did not pay attention to this KIP in the past; seems it was
> on-hold for a while.
> >>>>>>
> >>>>>> Overall it sounds very useful, and I think we should extend this
> with a follow up KIP for Kafka Streams. What is unclear to me at this point
> is the statement:
> >>>>>>
> >>>>>>> Kafka Streams applications have an application.id configured and
> this identifier should be included as the application_id metrics label.
> >>>>>>
> >>>>>> The `application.id` is currently only used as the (main)
> consumer's `group.id` (and is part of an auto-generated `client.id` if
> the user does not set one).
> >>>>>>
> >>>>>> This comment related to:
> >>>>>>
> >>>>>>> The following labels should be added by the client as appropriate
> before metrics are pushed.
> >>>>>>
> >>>>>> Given that Kafka Streams uses the consumer/producer/admin client as
> "black boxes", a client does at this point not know that it's part of a
> Kafka Streams application, and thus, it won't be able to attach any such
> label to the metrics it sends. (Also producer and admin don't even know the
> value of `application.id` -- only the (main) consumer, indirectly via `
> group.id`, but also restore and global consumer don't know it, because
> they don't have `group.id` set).
> >>>>>>
> >>>>>> While I am totally in favor of the proposal, I am wondering how we
> intent to implement it in clean way? Or would we do ok to have some
> internal client APIs that KS can use to "register" itself with the client?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> While clients must support both temporalities, the broker will
> initially only send GetTelemetrySubscriptionsResponse.DeltaTemporality=True
> >>>>>>
> >>>>>> Not sure if I can follow. How make the decision about DELTA or
> CUMULATIVE metrics? Should the broker side plugin not decide what metrics
> it what to receive in which form? So what does "initially" mean -- the
> broker won't ship with a default plugin implementation?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> The following method is added to the Producer, Consumer, and Admin
> client interfaces:
> >>>>>>
> >>>>>> Should we add anything to Kafka Streams to expose the underlying
> clients' assigned client-instance-ids programmatically? I am also wondering
> if clients should report their assigned client-instance-ids as metrics
> itself (for this case, Kafka Streams won't need to do anything, because we
> already expose all client metrics).
> >>>>>>
> >>>>>> If we add anything programmatic, we need to make it simple, given
> that Kafka Streams has many clients per `StreamThread` and may have
> multiple threads.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> enable.metrics.push
> >>>>>> It might be worth to add this to `StreamsConfig`, too? It set via
> StreamsConfig, we would forward it to all clients automatically.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -Matthias
> >>>>>>
> >>>>>>
> >>>>>> On 9/29/23 5:45 PM, David Jacot wrote:
> >>>>>>> Hi Andrew,
> >>>>>>> Thanks for driving this one. I haven't read all the KIP yet but I
> already
> >>>>>>> have an initial question. In the Threading section, it is written
> >>>>>>> "KafkaConsumer: the "background" thread (based on the consumer
> threading
> >>>>>>> refactor which is underway)". If I understand this correctly, it
> means
> >>>>>>> that KIP-714 won't work if the "old consumer" is used. Am I
> correct?
> >>>>>>> Cheers,
> >>>>>>> David
> >>>>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
> >>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>> Hi Philip,
> >>>>>>>> No, I do not think it should actively search for a broker that
> supports
> >>>>>>>> the new
> >>>>>>>> RPCs. In general, either all of the brokers or none of the
> brokers will
> >>>>>>>> support it.
> >>>>>>>> In the window, where the cluster is being upgraded or client
> telemetry is
> >>>>>>>> being
> >>>>>>>> enabled, there might be a mixed situation. I wouldn’t put too
> much effort
> >>>>>>>> into
> >>>>>>>> this mixed scenario. As the client finds brokers which support
> the new
> >>>>>>>> RPCs,
> >>>>>>>> it can begin to follow the KIP-714 mechanism.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Andrew
> >>>>>>>>
> >>>>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Andrew -
> >>>>>>>>>
> >>>>>>>>> Question on top of your answers: Do you think the client should
> actively
> >>>>>>>>> search for a broker that supports this RPC? As previously
> mentioned, the
> >>>>>>>>> broker uses the leastLoadedNode to find its first connection (am
> >>>>>>>>> I correct?), and what if that broker doesn't support the metric
> push?
> >>>>>>>>>
> >>>>>>>>> P
> >>>>>>>>>
> >>>>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
> >>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Kirk,
> >>>>>>>>>> Thanks for your question. You are correct that the presence or
> absence
> >>>>>>>> of
> >>>>>>>>>> the new RPCs in the
> >>>>>>>>>> ApiVersionsResponse tells the client whether to request the
> telemetry
> >>>>>>>>>> subscriptions and push
> >>>>>>>>>> metrics.
> >>>>>>>>>>
> >>>>>>>>>> This is of course tricky in practice. It would be conceivable,
> as a
> >>>>>>>>>> cluster is upgraded to AK 3.7
> >>>>>>>>>> or as a client metrics receiver plugin is deployed across the
> cluster,
> >>>>>>>>>> that a client connects to some
> >>>>>>>>>> brokers that support the new RPCs and some that do not.
> >>>>>>>>>>
> >>>>>>>>>> Here’s my suggestion:
> >>>>>>>>>> * If a client is not connected to any brokers that support in
> the new
> >>>>>>>>>> RPCs, it cannot push metrics.
> >>>>>>>>>> * If a client is only connected to brokers that support the new
> RPCs, it
> >>>>>>>>>> will use the new RPCs in
> >>>>>>>>>> accordance with the KIP.
> >>>>>>>>>> * If a client is connected to some brokers that support the new
> RPCs and
> >>>>>>>>>> some that do not, it will
> >>>>>>>>>> use the new RPCs with the supporting subset of brokers in
> accordance
> >>>>>>>> with
> >>>>>>>>>> the KIP.
> >>>>>>>>>>
> >>>>>>>>>> Comments?
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Andrew
> >>>>>>>>>>
> >>>>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Andrew/Jun,
> >>>>>>>>>>>
> >>>>>>>>>>> I want to make sure I understand question/comment #119… In the
> case
> >>>>>>>>>> where a cluster without a metrics client receiver is later
> reconfigured
> >>>>>>>> and
> >>>>>>>>>> restarted to include a metrics client receiver, do we want the
> client to
> >>>>>>>>>> thereafter begin pushing metrics to the cluster? From Andrew’s
> response
> >>>>>>>> to
> >>>>>>>>>> question #119, it sounds like we’re using the presence/absence
> of the
> >>>>>>>>>> relevant RPCs in ApiVersionsResponse as the
> to-push-or-not-to-push
> >>>>>>>>>> indicator. Do I have that correct?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Kirk
> >>>>>>>>>>>
> >>>>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
> >>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Jun,
> >>>>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify
> where
> >>>>>>>>>> necessary.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 110. Yes, agree. The motivation section mentions this.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the
> >>>>>>>>>> replacement of
> >>>>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP
> guidelines. I
> >>>>>>>>>> think it’s a bit
> >>>>>>>>>>>> of a debatable point. OTLP makes a distinction between a
> namespace
> >>>>>>>> and a
> >>>>>>>>>>>> multi-word component. If it was “client.id” then “client”
> would be a
> >>>>>>>>>> namespace with
> >>>>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So, it
> was
> >>>>>>>>>> intentional, but debatable.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 112. Thanks. The link target moved. Fixed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 113. Thanks. Fixed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 114.1. If a standard metric makes sense for a client, it
> should use
> >>>>>>>> the
> >>>>>>>>>> exact same
> >>>>>>>>>>>> name. If a standard metric doesn’t make sense for a client,
> then it
> >>>>>>>> can
> >>>>>>>>>> omit that metric.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For a required metric, the situation is stronger. All clients
> must
> >>>>>>>>>> implement these
> >>>>>>>>>>>> metrics with these names in order to implement the KIP. But
> the
> >>>>>>>>>> required metrics
> >>>>>>>>>>>> are essentially the number of connections and the request
> latency,
> >>>>>>>>>> which do not
> >>>>>>>>>>>> reference the underlying implementation of the client (which
> >>>>>>>>>> producer.record.queue.time.max
> >>>>>>>>>>>> of course does).
> >>>>>>>>>>>>
> >>>>>>>>>>>> I suppose someone might build a producer-only client that
> didn’t have
> >>>>>>>>>> consumer metrics.
> >>>>>>>>>>>> In this case, the consumer metrics would conceptually have
> the value 0
> >>>>>>>>>> and would not
> >>>>>>>>>>>> need to be sent to the broker.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 114.2. If a client does not implement some metrics, they will
> not be
> >>>>>>>>>> available for
> >>>>>>>>>>>> analysis and troubleshooting. It just makes the ability to
> combine
> >>>>>>>>>> metrics from lots
> >>>>>>>>>>>> different clients less complete.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 115. I think it was probably a mistake to be so specific about
> >>>>>>>>>> threading in this KIP.
> >>>>>>>>>>>> When the consumer threading refactor is complete, of course,
> it would
> >>>>>>>>>> do the appropriate
> >>>>>>>>>>>> equivalent. I’ve added a clarification and massively
> simplified this
> >>>>>>>>>> section.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 116. I removed “client.terminating”.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 117. Yes. Horrid. Fixed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 118. The Terminating flag just indicates that this is the
> final
> >>>>>>>>>> PushTelemetryRequest
> >>>>>>>>>>>> from this client. Any subsequent request will be rejected. I
> think
> >>>>>>>> this
> >>>>>>>>>> flag should remain.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 119. Good catch. This was actually contradicting another part
> of the
> >>>>>>>>>> KIP. The current behaviour
> >>>>>>>>>>>> is indeed preserved. If the broker doesn’t have a client
> metrics
> >>>>>>>>>> receiver plugin, the new RPCs
> >>>>>>>>>>>> in this KIP are “turned off” and not reported in
> ApiVersionsResponse.
> >>>>>>>>>> The client will not
> >>>>>>>>>>>> attempt to push metrics.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 120. The error handling table lists the error codes for
> >>>>>>>>>> PushTelemetryResponse. I’ve added one
> >>>>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t
> have any
> >>>>>>>>>> error codes, since the
> >>>>>>>>>>>> situation in which the client telemetry is not supported is
> handled by
> >>>>>>>>>> the RPCs not being offered
> >>>>>>>>>>>> by the broker.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 121. Again, I think it’s probably a mistake to be specific
> about
> >>>>>>>>>> threading. Removed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation
> should be
> >>>>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation
> should be
> >>>>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the
> CLUSTER
> >>>>>>>>>>>> resource.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the detailed review.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Andrew
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 110. Another potential motivation is the multiple clients
> support.
> >>>>>>>>>> Some of
> >>>>>>>>>>>>> the places may not have good monitoring support for non-java
> clients.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for
> metric name
> >>>>>>>> and
> >>>>>>>>>>>>> replace '-' with '_' for attributes. Why is the
> inconsistency?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 112. OTLP specification: Page is not found from the link.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 113. "Defining standard and required metrics makes the
> monitoring and
> >>>>>>>>>>>>> troubleshooting of clients from various client types ":
> Incomplete
> >>>>>>>>>> sentence.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 114. standard/required metrics
> >>>>>>>>>>>>> 114.1 Do other clients need to implement those metrics with
> the exact
> >>>>>>>>>> same
> >>>>>>>>>>>>> names?
> >>>>>>>>>>>>> 114.2 What happens if some of those metrics are missing from
> a
> >>>>>>>> client?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and application
> threads":
> >>>>>>>> We
> >>>>>>>>>>>>> have an ongoing effort to refactor the consumer threading
> model (
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
> >>>>>>>>>> ).
> >>>>>>>>>>>>> Once this is done, PRC requests will only be made from the
> background
> >>>>>>>>>>>>> thread. Should this KIP follow the new model only?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 116. 'The metrics should contain the reason for the client
> >>>>>>>> termination
> >>>>>>>>>> by
> >>>>>>>>>>>>> including the client.terminating metric with the label
> “reason” ...'.
> >>>>>>>>>> Hmm,
> >>>>>>>>>>>>> are we introducing a new metric client.terminating? If so,
> that needs
> >>>>>>>>>> to be
> >>>>>>>>>>>>> explicitly listed.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 117. "As the metrics plugin may need to add additional
> metrics on top
> >>>>>>>>>> of
> >>>>>>>>>>>>> this the generic metrics receiver in the broker will not add
> these
> >>>>>>>>>> labels
> >>>>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't read
> well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 118. "it is possible for the client to send at most one
> accepted
> >>>>>>>>>>>>> out-of-profile per connection before the rate-limiter kicks
> in": If
> >>>>>>>> we
> >>>>>>>>>> do
> >>>>>>>>>>>>> this, do we still need the Terminating flag in
> >>>>>>>> PushTelemetryRequestV0?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 119. "If there is no client metrics receiver plugin
> configured on the
> >>>>>>>>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest
> with
> >>>>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The
> client
> >>>>>>>> should
> >>>>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the
> PushIntervalMs
> >>>>>>>>>> has
> >>>>>>>>>>>>> expired. This allows the metrics receiver to be enabled or
> disabled
> >>>>>>>>>> without
> >>>>>>>>>>>>> having to restart the broker or reset the client connection."
> >>>>>>>>>>>>> "no client metrics receiver plugin configured" is defined by
> no
> >>>>>>>> metric
> >>>>>>>>>>>>> reporter implementing the ClientTelemetry interface, right?
> In that
> >>>>>>>>>> case,
> >>>>>>>>>>>>> it would be useful to avoid the clients sending
> >>>>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve
> the current
> >>>>>>>>>>>>> behavior.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and
> PushTelemetryRequestV0:
> >>>>>>>>>> Could
> >>>>>>>>>>>>> we list error codes for each?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This
> method may
> >>>>>>>>>> be
> >>>>>>>>>>>>> called from the request handling thread": Where else can
> this method
> >>>>>>>> be
> >>>>>>>>>>>>> called?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we
> changing the
> >>>>>>>>>> ACL?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jun
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
> >>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Milind,
> >>>>>>>>>>>>>> Thanks for your question.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most likely
> to be
> >>>>>>>>>> caused by a
> >>>>>>>>>>>>>> problem in the serialization in the client. I have changed
> the
> >>>>>>>> client
> >>>>>>>>>>>>>> action in this case
> >>>>>>>>>>>>>> to “Log an error and stop pushing metrics”.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I have updated the KIP text accordingly.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra
> >>>>>>>>>> <milut...@confluent.io.INVALID>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Andrew,
> >>>>>>>>>>>>>>> Thanks for the clarifications.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> About 2b:
> >>>>>>>>>>>>>>> In case a client has a bug while serializing, it might be
> difficult
> >>>>>>>>>> for
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> client to recover from that without code changes. In that,
> it might
> >>>>>>>>>> be
> >>>>>>>>>>>>>> good
> >>>>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat the
> error as
> >>>>>>>>>> fatal
> >>>>>>>>>>>>>>> for the client (only fatal in terms of sending the
> metrics, the
> >>>>>>>>>> client
> >>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>> keep functioning otherwise). What do you think?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>> Milind
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
> >>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Milind,
> >>>>>>>>>>>>>>>> Thanks for your questions about the KIP.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1) I did some archaeology and looked at historical
> versions of the
> >>>>>>>>>> KIP.
> >>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>> think this is
> >>>>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push
> interval. 30
> >>>>>>>>>>>>>> minutes
> >>>>>>>>>>>>>>>> is a mystery
> >>>>>>>>>>>>>>>> to me. I’ve updated the KIP.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 2) I think there are two situations in which
> INVALID_RECORD might
> >>>>>>>>>> occur.
> >>>>>>>>>>>>>>>> a) The client might perhaps be using a content-type that
> the
> >>>>>>>> broker
> >>>>>>>>>> does
> >>>>>>>>>>>>>>>> not support.
> >>>>>>>>>>>>>>>> The KIP mentions content-type as a future extension, but
> there’s
> >>>>>>>>>> only
> >>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>> to start with. Until we have multiple content-types, this
> seems
> >>>>>>>> out
> >>>>>>>>>> of
> >>>>>>>>>>>>>>>> scope. I think a
> >>>>>>>>>>>>>>>> future KIP would add another error code for this.
> >>>>>>>>>>>>>>>> b) The client might perhaps have a bug which means the
> metrics
> >>>>>>>>>> payload
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> malformed.
> >>>>>>>>>>>>>>>> Logging a warning and attempting the next metrics push on
> the push
> >>>>>>>>>>>>>>>> interval seems
> >>>>>>>>>>>>>>>> appropriate.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making
> an
> >>>>>>>>>> immediate
> >>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client
> bug or
> >>>>>>>>>> perhaps
> >>>>>>>>>>>>>>>> a situation in which a broker sends a compression type in
> a
> >>>>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse
> >>>>>>>>>>>>>>>> which is subsequently not supported when its used with a
> >>>>>>>>>>>>>>>> PushTelemetryRequest.
> >>>>>>>>>>>>>>>> We do want the client to have the opportunity to get an
> up-to-date
> >>>>>>>>>> list
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>> compression types. I think an immediate
> >>>>>>>>>> GetTelemetrySubscriptionsRequest
> >>>>>>>>>>>>>>>> is appropriate.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a Null
> >>>>>>>>>>>>>>>> ClientInstanceId, the
> >>>>>>>>>>>>>>>> receiving broker may not already know the client's
> existing
> >>>>>>>>>>>>>>>> ClientInstanceId. If the
> >>>>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId, it
> simply
> >>>>>>>>>> responds
> >>>>>>>>>>>>>>>> the existing
> >>>>>>>>>>>>>>>> value back to the client. If it does not know the existing
> >>>>>>>>>>>>>>>> ClientInstanceId, it will create
> >>>>>>>>>>>>>>>> a new client instance ID and respond with that.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I will update the KIP with these clarifications.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
> >>>>>>>>>> <milut...@confluent.io.INVALID
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I had a few questions regarding the "Error handling"
> section.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to
> >>>>>>>> eventually
> >>>>>>>>>>>>>>>> trigger
> >>>>>>>>>>>>>>>>> a retry and avoid having to restart clients if the
> cluster
> >>>>>>>> metrics
> >>>>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by operator
> error,
> >>>>>>>>>> rolling
> >>>>>>>>>>>>>>>>> upgrades, etc."
> >>>>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else.
> What is
> >>>>>>>> it
> >>>>>>>>>>>>>>>>> referring to?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2. For the actual errors:
> >>>>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a
> warning to the
> >>>>>>>>>>>>>>>>> application and schedule the next
> >>>>>>>> GetTelemetrySubscriptionsRequest
> >>>>>>>>>> to 5
> >>>>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something like
> >>>>>>>>>> PushIntervalMs?
> >>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>> also, why are we scheduling a
> GetTelemetrySubscriptionsRequest in
> >>>>>>>>>> this
> >>>>>>>>>>>>>>>>> case, if the serialization is broken?
> >>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE :
> just to
> >>>>>>>>>>>>>> confirm,
> >>>>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be
> scheduled
> >>>>>>>>>> immediately
> >>>>>>>>>>>>>>>>> after the PushTelemetry response, correct?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests
> must include
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>> ClientInstanceId returned in the first response,
> regardless of
> >>>>>>>>>> broker":
> >>>>>>>>>>>>>>>>> Will a broker error be returned in case some
> implementation of
> >>>>>>>>>> this KIP
> >>>>>>>>>>>>>>>>> violates this accidentally and sends a request with
> >>>>>>>>>> ClientInstanceId =
> >>>>>>>>>>>>>>>> Null
> >>>>>>>>>>>>>>>>> even when it's been obtained already? Or will a new
> >>>>>>>>>> ClientInstanceId be
> >>>>>>>>>>>>>>>>> returned without an error?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
> >>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>> I would like to start a new discussion thread on
> KIP-714: Client
> >>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>> and observability.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce the
> scope.
> >>>>>>>> The
> >>>>>>>>>>>>>>>> overall
> >>>>>>>>>>>>>>>>>> mechanism for client metric subscriptions is unchanged,
> but the
> >>>>>>>>>>>>>>>>>> KIP is now based on the existing client metrics, rather
> than
> >>>>>>>>>>>>>> introducing
> >>>>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster
> operators
> >>>>>>>>>>>>>>>>>> investigate performance problems experienced by clients
> without
> >>>>>>>>>>>>>>>> requiring
> >>>>>>>>>>>>>>>>>> changes to the client application code or configuration.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>> Andrew
> >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to