Re: [DISCUSS] KIP-714: Client metrics and observability

Sophie Blee-Goldman Thu, 12 Oct 2023 00:00:33 -0700

Regarding the naming, I personally think `clientInstanceId` makes sense for
the plain clients
 -- especially if we might later introduce the notion of an
`applicationInstanceId`.


I'm not a huge fan of `clientsInstanceIds` for the Kafka Streams API,
though, can we use
`clientInstanceIds` instead? (The difference being the placement of the
plural 's')
I would similarly rename the class to just ClientInstanceIds

we can also not have a timeout-less overload,  because `KafkaStreams` does
> not have a `default.api.timeout.ms` config either

With respect to the timeout for the Kafka Streams API, I'm a bit confused
by the
doubletriple-negative of Matthias' comment here, but I was thinking about
this
earlier and this was my take: with the current proposal, we would allow
users to pass
in an absolute timeout as a parameter that would apply to the method as a
whole.
Meanwhile within the method we would issue separate calls to each of the
clients using
the default or user-configured value of their  `default.api.timeout.ms` as
the timeout
parameter.

So the API as proposed makes sense to me.


On Wed, Oct 11, 2023 at 6:48 PM Matthias J. Sax <mj...@apache.org> wrote:

> In can answer 130 and 131.
>
> 130) We cannot guarantee that all clients are already initialized due to
> race conditions. We plan to not allow calling
> `KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or
> REBALANCING) though -- guess this slipped on the KIP and should be
> added? But because StreamThreads can be added dynamically (and producer
> might be created dynamically at runtime; cf below), we still cannot
> guarantee that all clients are already initialized when the method is
> called. Of course, we assume that all clients are most likely initialize
> on the happy path, and blocking calls to `client.clientInstanceId()`
> should be rare.
>
> To address the worst case, we won't do a naive implementation and just
> loop over all clients, but fan-out the call to the different
> StreamThreads (and GlobalStreamThread if it exists), and use Futures to
> gather the results.
>
> Currently, `StreamThreads` has 3 clients (if ALOS or EOSv2 is used), so
> we might do 3 blocking calls in the worst case (for EOSv1 we get a
> producer per tasks, and we might end up doing more blocking calls if the
> producers are not initialized yet). Note that EOSv1 is already
> deprecated, and we are also working on thread refactoring that will
> reduce the number of client on StreamThread to 2 -- and we have more
> refactoring planned to reduce the number of clients even further.
>
> Inside `KafakStreams#clientsInstanceIds()` we might only do single
> blocking call for the admin client (ie, `admin.clientInstanceId()`).
>
> I agree that we need to do some clever timeout management, but it seems
> to be more of an implementation detail?
>
> Do you have any particular concerns, or does the proposed implementation
> as sketched above address your question?
>
>
> 130) If the Topology does not have a global-state-store, there won't be
> a GlobalThread and thus not global consumer. Thus, we return an Optional.
>
>
>
> On three related question for Andrew.
>
> (1) Why is the method called `clientInstanceId()` and not just plain
> `instanceId()`?
>
> (2) Why so we return a `String` while but not a UUID type? The added
> protocol request/response classes use UUIDs.
>
> (3) Would it make sense to have an overloaded `clientInstanceId()`
> method that does not take any parameter but uses `default.api.timeout`
> config (this config does no exist on the producer though, so we could
> only have it for consumer and admin at this point). We could of course
> also add overloads like this later if user request them (and/or add
> `default.api.timeout.ms` to the producer, too).
>
> Btw: For KafkaStreams, I think `clientsInstanceIds` still makes sense as
> a method name though, as `KafkaStreams` itself does not have an
> `instanceId` -- we can also not have a timeout-less overload, because
> `KafkaStreams` does not have a `default.api.timeout.ms` config either
> (and I don't think it make sense to add).
>
>
>
> -Matthias
>
> On 10/11/23 5:07 PM, Jun Rao wrote:
> > Hi, Andrew,
> >
> > Thanks for the updated KIP. Just a few more minor comments.
> >
> > 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for
> all
> > consumer/producer/adminClient instances to be initialized? Are all those
> > instances created during KafkaStreams initialization?
> >
> > 131. Why does globalConsumerInstanceId() return Optional<String> while
> > other consumer instances don't return Optional?
> >
> > 132. ClientMetricsSubscriptionRequestCount: Do we need this since we
> have a
> > set of generic metrics
> > (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
> > report Request rate for every request type?
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax <mj...@apache.org>
> wrote:
> >
> >> Thanks!
> >>
> >> On 10/10/23 11:31 PM, Andrew Schofield wrote:
> >>> Matthias,
> >>> Yes, I think that’s a sensible way forward and the interface you
> propose
> >> looks good. I’ll update the KIP accordingly.
> >>>
> >>> Thanks,
> >>> Andrew
> >>>
> >>>> On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote:
> >>>>
> >>>> Andrew,
> >>>>
> >>>> yes I would like to get this change into KIP-714 right way. Seems to
> be
> >> important, as we don't know if/when a follow-up KIP for Kafka Streams
> would
> >> land.
> >>>>
> >>>> I was also thinking (and discussed with a few others) how to expose
> it,
> >> and we would propose the following:
> >>>>
> >>>> We add a new method to `KafkaStreams` class:
> >>>>
> >>>>      public ClientsInstanceIds clientsInstanceIds(Duration timeout);
> >>>>
> >>>> The returned object is like below:
> >>>>
> >>>>    public class ClientsInstanceIds {
> >>>>      // we only have a single admin client per KS instance
> >>>>      String adminInstanceId();
> >>>>
> >>>>      // we only have a single global consumer per KS instance (if any)
> >>>>      // Optional<> because we might not have global-thread
> >>>>      Optional<String> globalConsumerInstanceId();
> >>>>
> >>>>      // return a <threadKey -> ClientInstanceId> mapping
> >>>>      // for the underlying (restore-)consumers/producers
> >>>>      Map<String, String> mainConsumerInstanceIds();
> >>>>      Map<String, String> restoreConsumerInstanceIds();
> >>>>      Map<String, String> producerInstanceIds();
> >>>> }
> >>>>
> >>>> For the `threadKey`, we would use some pattern like this:
> >>>>
> >>>>    [Stream|StateUpdater]Thread-<threadIdx>
> >>>>
> >>>>
> >>>> Would this work from your POV?
> >>>>
> >>>>
> >>>>
> >>>> -Matthias
> >>>>
> >>>>
> >>>> On 10/9/23 2:15 AM, Andrew Schofield wrote:
> >>>>> Hi Matthias,
> >>>>> Good point. Makes sense to me.
> >>>>> Is this something that can also be included in the proposed Kafka
> >> Streams follow-on KIP, or would you prefer that I add it to KIP-714?
> >>>>> I have a slight preference for the former to put all of the KS
> >> enhancements into a separate KIP.
> >>>>> Thanks,
> >>>>> Andrew
> >>>>>> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org> wrote:
> >>>>>>
> >>>>>> Thanks Andrew. SGTM.
> >>>>>>
> >>>>>> One point you did not address is the idea to add a method to
> >> `KafkaStreams` similar to the proposed `clientInstanceId()` that will be
> >> added to consumer/producer/admin clients.
> >>>>>>
> >>>>>> Without addressing this, Kafka Streams users won't have a way to get
> >> the assigned `instanceId` of the internally created clients, and thus it
> >> would be very difficult for them to know which metrics that the broker
> >> receives belong to a Kafka Streams app. It seems they would only find
> the
> >> `instanceIds` in the log4j output if they enable client logging?
> >>>>>>
> >>>>>> Of course, because there is multiple clients inside Kafka Streams,
> >> the return type cannot be an single "String", but must be some some
> complex
> >> data structure -- we could either add a new class, or return a
> >> Map<String,String> using a client key that maps to the `instanceId`.
> >>>>>>
> >>>>>> For example we could use the following key:
> >>>>>>
> >>>>>>
>  [Global]StreamThread[-<threadIndex>][-restore][consumer|producer]
> >>>>>>
> >>>>>> (Of course, only the valid combination.)
> >>>>>>
> >>>>>> Or maybe even better, we might want to return a `Future` because
> >> collection all the `instanceId` might be a blocking all on each client?
> I
> >> have already a few idea how it could be implemented but I don't think it
> >> must be discussed on the KIP, as it's an implementation detail.
> >>>>>>
> >>>>>> Thoughts?
> >>>>>>
> >>>>>>
> >>>>>> -Matthias
> >>>>>>
> >>>>>> On 10/6/23 4:21 AM, Andrew Schofield wrote:
> >>>>>>> Hi Matthias,
> >>>>>>> Thanks for your comments. I agree that a follow-up KIP for Kafka
> >> Streams makes sense. This KIP currently has made a bit
> >>>>>>> of an effort to embrace KS, but it’s not enough by a long way.
> >>>>>>> I have removed `application.id <http://application.id/>`. This
> >> should be done properly in the follow-up KIP. I don’t believe there’s a
> >> downside to
> >>>>>>> removing it from this KIP.
> >>>>>>> I have reworded the statement about temporarily. In practice, the
> >> implementation of this KIP that’s going on while the voting
> >>>>>>> progresses happens to use delta temporality, but that’s an
> >> implementation detail. Supporting clients must support both
> >>>>>>> temporalities.
> >>>>>>> I thought about exposing the client instance ID as a metric, but
> >> non-numeric metrics are not usual practice and tools
> >>>>>>> do not universally support them. I don’t think the KIP is improved
> >> by adding one now.
> >>>>>>> I have also added constants for the various Config classes for
> >> ENABLE_METRICS_PUSH_CONFIG, including to
> >>>>>>> StreamsConfig. It’s best to be explicit about this.
> >>>>>>> Thanks,
> >>>>>>> Andrew
> >>>>>>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org>
> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I did not pay attention to this KIP in the past; seems it was
> >> on-hold for a while.
> >>>>>>>>
> >>>>>>>> Overall it sounds very useful, and I think we should extend this
> >> with a follow up KIP for Kafka Streams. What is unclear to me at this
> point
> >> is the statement:
> >>>>>>>>
> >>>>>>>>> Kafka Streams applications have an application.id configured and
> >> this identifier should be included as the application_id metrics label.
> >>>>>>>>
> >>>>>>>> The `application.id` is currently only used as the (main)
> >> consumer's `group.id` (and is part of an auto-generated `client.id` if
> >> the user does not set one).
> >>>>>>>>
> >>>>>>>> This comment related to:
> >>>>>>>>
> >>>>>>>>> The following labels should be added by the client as appropriate
> >> before metrics are pushed.
> >>>>>>>>
> >>>>>>>> Given that Kafka Streams uses the consumer/producer/admin client
> as
> >> "black boxes", a client does at this point not know that it's part of a
> >> Kafka Streams application, and thus, it won't be able to attach any such
> >> label to the metrics it sends. (Also producer and admin don't even know
> the
> >> value of `application.id` -- only the (main) consumer, indirectly via `
> >> group.id`, but also restore and global consumer don't know it, because
> >> they don't have `group.id` set).
> >>>>>>>>
> >>>>>>>> While I am totally in favor of the proposal, I am wondering how we
> >> intent to implement it in clean way? Or would we do ok to have some
> >> internal client APIs that KS can use to "register" itself with the
> client?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> While clients must support both temporalities, the broker will
> >> initially only send
> GetTelemetrySubscriptionsResponse.DeltaTemporality=True
> >>>>>>>>
> >>>>>>>> Not sure if I can follow. How make the decision about DELTA or
> >> CUMULATIVE metrics? Should the broker side plugin not decide what
> metrics
> >> it what to receive in which form? So what does "initially" mean -- the
> >> broker won't ship with a default plugin implementation?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> The following method is added to the Producer, Consumer, and
> Admin
> >> client interfaces:
> >>>>>>>>
> >>>>>>>> Should we add anything to Kafka Streams to expose the underlying
> >> clients' assigned client-instance-ids programmatically? I am also
> wondering
> >> if clients should report their assigned client-instance-ids as metrics
> >> itself (for this case, Kafka Streams won't need to do anything, because
> we
> >> already expose all client metrics).
> >>>>>>>>
> >>>>>>>> If we add anything programmatic, we need to make it simple, given
> >> that Kafka Streams has many clients per `StreamThread` and may have
> >> multiple threads.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> enable.metrics.push
> >>>>>>>> It might be worth to add this to `StreamsConfig`, too? It set via
> >> StreamsConfig, we would forward it to all clients automatically.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -Matthias
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 9/29/23 5:45 PM, David Jacot wrote:
> >>>>>>>>> Hi Andrew,
> >>>>>>>>> Thanks for driving this one. I haven't read all the KIP yet but I
> >> already
> >>>>>>>>> have an initial question. In the Threading section, it is written
> >>>>>>>>> "KafkaConsumer: the "background" thread (based on the consumer
> >> threading
> >>>>>>>>> refactor which is underway)". If I understand this correctly, it
> >> means
> >>>>>>>>> that KIP-714 won't work if the "old consumer" is used. Am I
> >> correct?
> >>>>>>>>> Cheers,
> >>>>>>>>> David
> >>>>>>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
> >>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>> Hi Philip,
> >>>>>>>>>> No, I do not think it should actively search for a broker that
> >> supports
> >>>>>>>>>> the new
> >>>>>>>>>> RPCs. In general, either all of the brokers or none of the
> >> brokers will
> >>>>>>>>>> support it.
> >>>>>>>>>> In the window, where the cluster is being upgraded or client
> >> telemetry is
> >>>>>>>>>> being
> >>>>>>>>>> enabled, there might be a mixed situation. I wouldn’t put too
> >> much effort
> >>>>>>>>>> into
> >>>>>>>>>> this mixed scenario. As the client finds brokers which support
> >> the new
> >>>>>>>>>> RPCs,
> >>>>>>>>>> it can begin to follow the KIP-714 mechanism.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Andrew
> >>>>>>>>>>
> >>>>>>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com>
> >> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Andrew -
> >>>>>>>>>>>
> >>>>>>>>>>> Question on top of your answers: Do you think the client should
> >> actively
> >>>>>>>>>>> search for a broker that supports this RPC? As previously
> >> mentioned, the
> >>>>>>>>>>> broker uses the leastLoadedNode to find its first connection
> (am
> >>>>>>>>>>> I correct?), and what if that broker doesn't support the metric
> >> push?
> >>>>>>>>>>>
> >>>>>>>>>>> P
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
> >>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Kirk,
> >>>>>>>>>>>> Thanks for your question. You are correct that the presence or
> >> absence
> >>>>>>>>>> of
> >>>>>>>>>>>> the new RPCs in the
> >>>>>>>>>>>> ApiVersionsResponse tells the client whether to request the
> >> telemetry
> >>>>>>>>>>>> subscriptions and push
> >>>>>>>>>>>> metrics.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This is of course tricky in practice. It would be conceivable,
> >> as a
> >>>>>>>>>>>> cluster is upgraded to AK 3.7
> >>>>>>>>>>>> or as a client metrics receiver plugin is deployed across the
> >> cluster,
> >>>>>>>>>>>> that a client connects to some
> >>>>>>>>>>>> brokers that support the new RPCs and some that do not.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Here’s my suggestion:
> >>>>>>>>>>>> * If a client is not connected to any brokers that support in
> >> the new
> >>>>>>>>>>>> RPCs, it cannot push metrics.
> >>>>>>>>>>>> * If a client is only connected to brokers that support the
> new
> >> RPCs, it
> >>>>>>>>>>>> will use the new RPCs in
> >>>>>>>>>>>> accordance with the KIP.
> >>>>>>>>>>>> * If a client is connected to some brokers that support the
> new
> >> RPCs and
> >>>>>>>>>>>> some that do not, it will
> >>>>>>>>>>>> use the new RPCs with the supporting subset of brokers in
> >> accordance
> >>>>>>>>>> with
> >>>>>>>>>>>> the KIP.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Comments?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Andrew
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro>
> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Andrew/Jun,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I want to make sure I understand question/comment #119… In
> the
> >> case
> >>>>>>>>>>>> where a cluster without a metrics client receiver is later
> >> reconfigured
> >>>>>>>>>> and
> >>>>>>>>>>>> restarted to include a metrics client receiver, do we want the
> >> client to
> >>>>>>>>>>>> thereafter begin pushing metrics to the cluster? From Andrew’s
> >> response
> >>>>>>>>>> to
> >>>>>>>>>>>> question #119, it sounds like we’re using the presence/absence
> >> of the
> >>>>>>>>>>>> relevant RPCs in ApiVersionsResponse as the
> >> to-push-or-not-to-push
> >>>>>>>>>>>> indicator. Do I have that correct?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Kirk
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
> >>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Jun,
> >>>>>>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify
> >> where
> >>>>>>>>>>>> necessary.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 110. Yes, agree. The motivation section mentions this.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and
> the
> >>>>>>>>>>>> replacement of
> >>>>>>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP
> >> guidelines. I
> >>>>>>>>>>>> think it’s a bit
> >>>>>>>>>>>>>> of a debatable point. OTLP makes a distinction between a
> >> namespace
> >>>>>>>>>> and a
> >>>>>>>>>>>>>> multi-word component. If it was “client.id” then “client”
> >> would be a
> >>>>>>>>>>>> namespace with
> >>>>>>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So, it
> >> was
> >>>>>>>>>>>> intentional, but debatable.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 112. Thanks. The link target moved. Fixed.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 113. Thanks. Fixed.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 114.1. If a standard metric makes sense for a client, it
> >> should use
> >>>>>>>>>> the
> >>>>>>>>>>>> exact same
> >>>>>>>>>>>>>> name. If a standard metric doesn’t make sense for a client,
> >> then it
> >>>>>>>>>> can
> >>>>>>>>>>>> omit that metric.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> For a required metric, the situation is stronger. All
> clients
> >> must
> >>>>>>>>>>>> implement these
> >>>>>>>>>>>>>> metrics with these names in order to implement the KIP. But
> >> the
> >>>>>>>>>>>> required metrics
> >>>>>>>>>>>>>> are essentially the number of connections and the request
> >> latency,
> >>>>>>>>>>>> which do not
> >>>>>>>>>>>>>> reference the underlying implementation of the client (which
> >>>>>>>>>>>> producer.record.queue.time.max
> >>>>>>>>>>>>>> of course does).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I suppose someone might build a producer-only client that
> >> didn’t have
> >>>>>>>>>>>> consumer metrics.
> >>>>>>>>>>>>>> In this case, the consumer metrics would conceptually have
> >> the value 0
> >>>>>>>>>>>> and would not
> >>>>>>>>>>>>>> need to be sent to the broker.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 114.2. If a client does not implement some metrics, they
> will
> >> not be
> >>>>>>>>>>>> available for
> >>>>>>>>>>>>>> analysis and troubleshooting. It just makes the ability to
> >> combine
> >>>>>>>>>>>> metrics from lots
> >>>>>>>>>>>>>> different clients less complete.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 115. I think it was probably a mistake to be so specific
> about
> >>>>>>>>>>>> threading in this KIP.
> >>>>>>>>>>>>>> When the consumer threading refactor is complete, of course,
> >> it would
> >>>>>>>>>>>> do the appropriate
> >>>>>>>>>>>>>> equivalent. I’ve added a clarification and massively
> >> simplified this
> >>>>>>>>>>>> section.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 116. I removed “client.terminating”.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 117. Yes. Horrid. Fixed.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 118. The Terminating flag just indicates that this is the
> >> final
> >>>>>>>>>>>> PushTelemetryRequest
> >>>>>>>>>>>>>> from this client. Any subsequent request will be rejected. I
> >> think
> >>>>>>>>>> this
> >>>>>>>>>>>> flag should remain.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 119. Good catch. This was actually contradicting another
> part
> >> of the
> >>>>>>>>>>>> KIP. The current behaviour
> >>>>>>>>>>>>>> is indeed preserved. If the broker doesn’t have a client
> >> metrics
> >>>>>>>>>>>> receiver plugin, the new RPCs
> >>>>>>>>>>>>>> in this KIP are “turned off” and not reported in
> >> ApiVersionsResponse.
> >>>>>>>>>>>> The client will not
> >>>>>>>>>>>>>> attempt to push metrics.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 120. The error handling table lists the error codes for
> >>>>>>>>>>>> PushTelemetryResponse. I’ve added one
> >>>>>>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t
> >> have any
> >>>>>>>>>>>> error codes, since the
> >>>>>>>>>>>>>> situation in which the client telemetry is not supported is
> >> handled by
> >>>>>>>>>>>> the RPCs not being offered
> >>>>>>>>>>>>>> by the broker.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 121. Again, I think it’s probably a mistake to be specific
> >> about
> >>>>>>>>>>>> threading. Removed.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation
> >> should be
> >>>>>>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation
> >> should be
> >>>>>>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the
> >> CLUSTER
> >>>>>>>>>>>>>> resource.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the detailed review.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 110. Another potential motivation is the multiple clients
> >> support.
> >>>>>>>>>>>> Some of
> >>>>>>>>>>>>>>> the places may not have good monitoring support for
> non-java
> >> clients.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for
> >> metric name
> >>>>>>>>>> and
> >>>>>>>>>>>>>>> replace '-' with '_' for attributes. Why is the
> >> inconsistency?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 112. OTLP specification: Page is not found from the link.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 113. "Defining standard and required metrics makes the
> >> monitoring and
> >>>>>>>>>>>>>>> troubleshooting of clients from various client types ":
> >> Incomplete
> >>>>>>>>>>>> sentence.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 114. standard/required metrics
> >>>>>>>>>>>>>>> 114.1 Do other clients need to implement those metrics with
> >> the exact
> >>>>>>>>>>>> same
> >>>>>>>>>>>>>>> names?
> >>>>>>>>>>>>>>> 114.2 What happens if some of those metrics are missing
> from
> >> a
> >>>>>>>>>> client?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and application
> >> threads":
> >>>>>>>>>> We
> >>>>>>>>>>>>>>> have an ongoing effort to refactor the consumer threading
> >> model (
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
> >>>>>>>>>>>> ).
> >>>>>>>>>>>>>>> Once this is done, PRC requests will only be made from the
> >> background
> >>>>>>>>>>>>>>> thread. Should this KIP follow the new model only?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 116. 'The metrics should contain the reason for the client
> >>>>>>>>>> termination
> >>>>>>>>>>>> by
> >>>>>>>>>>>>>>> including the client.terminating metric with the label
> >> “reason” ...'.
> >>>>>>>>>>>> Hmm,
> >>>>>>>>>>>>>>> are we introducing a new metric client.terminating? If so,
> >> that needs
> >>>>>>>>>>>> to be
> >>>>>>>>>>>>>>> explicitly listed.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 117. "As the metrics plugin may need to add additional
> >> metrics on top
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>> this the generic metrics receiver in the broker will not
> add
> >> these
> >>>>>>>>>>>> labels
> >>>>>>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't
> read
> >> well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 118. "it is possible for the client to send at most one
> >> accepted
> >>>>>>>>>>>>>>> out-of-profile per connection before the rate-limiter kicks
> >> in": If
> >>>>>>>>>> we
> >>>>>>>>>>>> do
> >>>>>>>>>>>>>>> this, do we still need the Terminating flag in
> >>>>>>>>>> PushTelemetryRequestV0?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 119. "If there is no client metrics receiver plugin
> >> configured on the
> >>>>>>>>>>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest
> >> with
> >>>>>>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The
> >> client
> >>>>>>>>>> should
> >>>>>>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the
> >> PushIntervalMs
> >>>>>>>>>>>> has
> >>>>>>>>>>>>>>> expired. This allows the metrics receiver to be enabled or
> >> disabled
> >>>>>>>>>>>> without
> >>>>>>>>>>>>>>> having to restart the broker or reset the client
> connection."
> >>>>>>>>>>>>>>> "no client metrics receiver plugin configured" is defined
> by
> >> no
> >>>>>>>>>> metric
> >>>>>>>>>>>>>>> reporter implementing the ClientTelemetry interface, right?
> >> In that
> >>>>>>>>>>>> case,
> >>>>>>>>>>>>>>> it would be useful to avoid the clients sending
> >>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve
> >> the current
> >>>>>>>>>>>>>>> behavior.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and
> >> PushTelemetryRequestV0:
> >>>>>>>>>>>> Could
> >>>>>>>>>>>>>>> we list error codes for each?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This
> >> method may
> >>>>>>>>>>>> be
> >>>>>>>>>>>>>>> called from the request handling thread": Where else can
> >> this method
> >>>>>>>>>> be
> >>>>>>>>>>>>>>> called?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we
> >> changing the
> >>>>>>>>>>>> ACL?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jun
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
> >>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Milind,
> >>>>>>>>>>>>>>>> Thanks for your question.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most likely
> >> to be
> >>>>>>>>>>>> caused by a
> >>>>>>>>>>>>>>>> problem in the serialization in the client. I have changed
> >> the
> >>>>>>>>>> client
> >>>>>>>>>>>>>>>> action in this case
> >>>>>>>>>>>>>>>> to “Log an error and stop pushing metrics”.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have updated the KIP text accordingly.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra
> >>>>>>>>>>>> <milut...@confluent.io.INVALID>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Andrew,
> >>>>>>>>>>>>>>>>> Thanks for the clarifications.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> About 2b:
> >>>>>>>>>>>>>>>>> In case a client has a bug while serializing, it might be
> >> difficult
> >>>>>>>>>>>> for
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> client to recover from that without code changes. In
> that,
> >> it might
> >>>>>>>>>>>> be
> >>>>>>>>>>>>>>>> good
> >>>>>>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat the
> >> error as
> >>>>>>>>>>>> fatal
> >>>>>>>>>>>>>>>>> for the client (only fatal in terms of sending the
> >> metrics, the
> >>>>>>>>>>>> client
> >>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> keep functioning otherwise). What do you think?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>> Milind
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
> >>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Milind,
> >>>>>>>>>>>>>>>>>> Thanks for your questions about the KIP.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 1) I did some archaeology and looked at historical
> >> versions of the
> >>>>>>>>>>>> KIP.
> >>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>> think this is
> >>>>>>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push
> >> interval. 30
> >>>>>>>>>>>>>>>> minutes
> >>>>>>>>>>>>>>>>>> is a mystery
> >>>>>>>>>>>>>>>>>> to me. I’ve updated the KIP.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 2) I think there are two situations in which
> >> INVALID_RECORD might
> >>>>>>>>>>>> occur.
> >>>>>>>>>>>>>>>>>> a) The client might perhaps be using a content-type that
> >> the
> >>>>>>>>>> broker
> >>>>>>>>>>>> does
> >>>>>>>>>>>>>>>>>> not support.
> >>>>>>>>>>>>>>>>>> The KIP mentions content-type as a future extension, but
> >> there’s
> >>>>>>>>>>>> only
> >>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>>>> to start with. Until we have multiple content-types,
> this
> >> seems
> >>>>>>>>>> out
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> scope. I think a
> >>>>>>>>>>>>>>>>>> future KIP would add another error code for this.
> >>>>>>>>>>>>>>>>>> b) The client might perhaps have a bug which means the
> >> metrics
> >>>>>>>>>>>> payload
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> malformed.
> >>>>>>>>>>>>>>>>>> Logging a warning and attempting the next metrics push
> on
> >> the push
> >>>>>>>>>>>>>>>>>> interval seems
> >>>>>>>>>>>>>>>>>> appropriate.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by
> making
> >> an
> >>>>>>>>>>>> immediate
> >>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client
> >> bug or
> >>>>>>>>>>>> perhaps
> >>>>>>>>>>>>>>>>>> a situation in which a broker sends a compression type
> in
> >> a
> >>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse
> >>>>>>>>>>>>>>>>>> which is subsequently not supported when its used with a
> >>>>>>>>>>>>>>>>>> PushTelemetryRequest.
> >>>>>>>>>>>>>>>>>> We do want the client to have the opportunity to get an
> >> up-to-date
> >>>>>>>>>>>> list
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>>>> compression types. I think an immediate
> >>>>>>>>>>>> GetTelemetrySubscriptionsRequest
> >>>>>>>>>>>>>>>>>> is appropriate.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a
> Null
> >>>>>>>>>>>>>>>>>> ClientInstanceId, the
> >>>>>>>>>>>>>>>>>> receiving broker may not already know the client's
> >> existing
> >>>>>>>>>>>>>>>>>> ClientInstanceId. If the
> >>>>>>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId, it
> >> simply
> >>>>>>>>>>>> responds
> >>>>>>>>>>>>>>>>>> the existing
> >>>>>>>>>>>>>>>>>> value back to the client. If it does not know the
> existing
> >>>>>>>>>>>>>>>>>> ClientInstanceId, it will create
> >>>>>>>>>>>>>>>>>> a new client instance ID and respond with that.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I will update the KIP with these clarifications.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
> >>>>>>>>>>>> <milut...@confluent.io.INVALID
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I had a few questions regarding the "Error handling"
> >> section.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to
> >>>>>>>>>> eventually
> >>>>>>>>>>>>>>>>>> trigger
> >>>>>>>>>>>>>>>>>>> a retry and avoid having to restart clients if the
> >> cluster
> >>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by
> operator
> >> error,
> >>>>>>>>>>>> rolling
> >>>>>>>>>>>>>>>>>>> upgrades, etc."
> >>>>>>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else.
> >> What is
> >>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>> referring to?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 2. For the actual errors:
> >>>>>>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a
> >> warning to the
> >>>>>>>>>>>>>>>>>>> application and schedule the next
> >>>>>>>>>> GetTelemetrySubscriptionsRequest
> >>>>>>>>>>>> to 5
> >>>>>>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something like
> >>>>>>>>>>>> PushIntervalMs?
> >>>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>> also, why are we scheduling a
> >> GetTelemetrySubscriptionsRequest in
> >>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> case, if the serialization is broken?
> >>>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE
> :
> >> just to
> >>>>>>>>>>>>>>>> confirm,
> >>>>>>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be
> >> scheduled
> >>>>>>>>>>>> immediately
> >>>>>>>>>>>>>>>>>>> after the PushTelemetry response, correct?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests
> >> must include
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> ClientInstanceId returned in the first response,
> >> regardless of
> >>>>>>>>>>>> broker":
> >>>>>>>>>>>>>>>>>>> Will a broker error be returned in case some
> >> implementation of
> >>>>>>>>>>>> this KIP
> >>>>>>>>>>>>>>>>>>> violates this accidentally and sends a request with
> >>>>>>>>>>>> ClientInstanceId =
> >>>>>>>>>>>>>>>>>> Null
> >>>>>>>>>>>>>>>>>>> even when it's been obtained already? Or will a new
> >>>>>>>>>>>> ClientInstanceId be
> >>>>>>>>>>>>>>>>>>> returned without an error?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
> >>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>> I would like to start a new discussion thread on
> >> KIP-714: Client
> >>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>> and observability.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce the
> >> scope.
> >>>>>>>>>> The
> >>>>>>>>>>>>>>>>>> overall
> >>>>>>>>>>>>>>>>>>>> mechanism for client metric subscriptions is
> unchanged,
> >> but the
> >>>>>>>>>>>>>>>>>>>> KIP is now based on the existing client metrics,
> rather
> >> than
> >>>>>>>>>>>>>>>> introducing
> >>>>>>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster
> >> operators
> >>>>>>>>>>>>>>>>>>>> investigate performance problems experienced by
> clients
> >> without
> >>>>>>>>>>>>>>>>>> requiring
> >>>>>>>>>>>>>>>>>>>> changes to the client application code or
> configuration.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>> Andrew
> >>>
> >>>
> >>
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to