Re: [DISCUSS] KIP-714: Client metrics and observability

Andrew Schofield Thu, 12 Oct 2023 09:42:32 -0700

Hi Matthias,
131) I also think that separating the main and restore consumers is excessively 
specific about the current implementation.
So, I think I’d prefer:


public class ClientInstanceIds {
  String adminInstanceId();

  Optional<String> globalConsumerInstanceId();

  Map<String, String> consumerInstanceIds();

  Map<String, String> producerInstanceIds();
}

I’m not sure whether it makes sense to combine the global consumer instance id 
too.

I’ve got rid of the double plural :)

My 2 cents.

Thanks,
Andrew

> On 12 Oct 2023, at 17:29, Matthias J. Sax <mj...@apache.org> wrote:
>
> Thanks Sophie and Jun.
>
> `clientInstanceIds()` is fine with me -- was not sure about the double plural 
> myself.
>
> Sorry if my comments was confusing. I was trying to say, that adding a 
> overload to `KafkaStreams` that does not take a timeout parameter does not 
> make sense, because there is no `default.api.timeout.ms` config for Kafka 
> Streams, so users always need to pass in a timeout. (Same for producer.)
>
> For the implementation, I think KS would always call 
> `client.clientInstanceId(timeout)` and never rely on `default.api.timeout.ms` 
> though, so we can stay in control -- if a timeout is passed by the user, it 
> would always overwrite `default.api.timeout.ms` on the consumer/admin and 
> thus we should follow the same semantics in Kafka Streams, and overwrite it 
> explicitly when calling `client.clientInstanceId()`.
>
> The proposed API also makes sense to me. I was just wondering if we want to 
> extend it for client users -- for KS we won't need/use the timeout-less 
> overloads.
>
>
>
> 130) My intent was to throw a TimeoutException if we cannot get all 
> instanceIds, because it's the standard contract for timeouts. It would also 
> be hard to tell for a user, if a full or partial result was returned (or we 
> add a method `boolean isPartialResult()` to make it easier for users).
>
> If there is concerns/objections, I am also ok to return a partial result -- 
> it would require a change to the newly added `ClientInstanceIds` return type 
> -- for `adminInstanceId` we only return a `String` right now -- we might need 
> to change this to `Optional<String>` so we are able to return a partial 
> result?
>
>
> 131) Of course we could, but I am not sure what we would gain? In the end, 
> implementation details would always leak because if we change the number of 
> consumer we use, we would return different keys in the `Map`. Atm, the 
> proposal implies that the same key might be used for the "main" and "restore" 
> consumer of the same thread -- but we can make keys unique by adding a 
> `-restore` suffix to the restore-consumer key if we merge both maps. -- 
> Curious to hear what others think. I am very open to do it differently than 
> currently proposed.
>
>
> -Matthias
>
>
> On 10/12/23 8:39 AM, Jun Rao wrote:
>> Hi, Matthias,
>> Thanks for the reply.
>> 130. What would be the semantic? If the timeout has expired and only some
>> of the client instances' id have been retrieved, does the call return the
>> partial result or throw an exception?
>> 131. Could we group all consumer instances in a single method since we are
>> returning the key for each instance already? This probably also avoids
>> exposing implementation details that could change over time.
>> Thanks,
>> Jun
>> On Thu, Oct 12, 2023 at 12:00 AM Sophie Blee-Goldman <sop...@responsive.dev>
>> wrote:
>>> Regarding the naming, I personally think `clientInstanceId` makes sense for
>>> the plain clients
>>>  -- especially if we might later introduce the notion of an
>>> `applicationInstanceId`.
>>>
>>> I'm not a huge fan of `clientsInstanceIds` for the Kafka Streams API,
>>> though, can we use
>>> `clientInstanceIds` instead? (The difference being the placement of the
>>> plural 's')
>>> I would similarly rename the class to just ClientInstanceIds
>>>
>>> we can also not have a timeout-less overload,  because `KafkaStreams` does
>>>> not have a `default.api.timeout.ms` config either
>>>
>>> With respect to the timeout for the Kafka Streams API, I'm a bit confused
>>> by the
>>> doubletriple-negative of Matthias' comment here, but I was thinking about
>>> this
>>> earlier and this was my take: with the current proposal, we would allow
>>> users to pass
>>> in an absolute timeout as a parameter that would apply to the method as a
>>> whole.
>>> Meanwhile within the method we would issue separate calls to each of the
>>> clients using
>>> the default or user-configured value of their  `default.api.timeout.ms` as
>>> the timeout
>>> parameter.
>>>
>>> So the API as proposed makes sense to me.
>>>
>>>
>>> On Wed, Oct 11, 2023 at 6:48 PM Matthias J. Sax <mj...@apache.org> wrote:
>>>
>>>> In can answer 130 and 131.
>>>>
>>>> 130) We cannot guarantee that all clients are already initialized due to
>>>> race conditions. We plan to not allow calling
>>>> `KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or
>>>> REBALANCING) though -- guess this slipped on the KIP and should be
>>>> added? But because StreamThreads can be added dynamically (and producer
>>>> might be created dynamically at runtime; cf below), we still cannot
>>>> guarantee that all clients are already initialized when the method is
>>>> called. Of course, we assume that all clients are most likely initialize
>>>> on the happy path, and blocking calls to `client.clientInstanceId()`
>>>> should be rare.
>>>>
>>>> To address the worst case, we won't do a naive implementation and just
>>>> loop over all clients, but fan-out the call to the different
>>>> StreamThreads (and GlobalStreamThread if it exists), and use Futures to
>>>> gather the results.
>>>>
>>>> Currently, `StreamThreads` has 3 clients (if ALOS or EOSv2 is used), so
>>>> we might do 3 blocking calls in the worst case (for EOSv1 we get a
>>>> producer per tasks, and we might end up doing more blocking calls if the
>>>> producers are not initialized yet). Note that EOSv1 is already
>>>> deprecated, and we are also working on thread refactoring that will
>>>> reduce the number of client on StreamThread to 2 -- and we have more
>>>> refactoring planned to reduce the number of clients even further.
>>>>
>>>> Inside `KafakStreams#clientsInstanceIds()` we might only do single
>>>> blocking call for the admin client (ie, `admin.clientInstanceId()`).
>>>>
>>>> I agree that we need to do some clever timeout management, but it seems
>>>> to be more of an implementation detail?
>>>>
>>>> Do you have any particular concerns, or does the proposed implementation
>>>> as sketched above address your question?
>>>>
>>>>
>>>> 130) If the Topology does not have a global-state-store, there won't be
>>>> a GlobalThread and thus not global consumer. Thus, we return an Optional.
>>>>
>>>>
>>>>
>>>> On three related question for Andrew.
>>>>
>>>> (1) Why is the method called `clientInstanceId()` and not just plain
>>>> `instanceId()`?
>>>>
>>>> (2) Why so we return a `String` while but not a UUID type? The added
>>>> protocol request/response classes use UUIDs.
>>>>
>>>> (3) Would it make sense to have an overloaded `clientInstanceId()`
>>>> method that does not take any parameter but uses `default.api.timeout`
>>>> config (this config does no exist on the producer though, so we could
>>>> only have it for consumer and admin at this point). We could of course
>>>> also add overloads like this later if user request them (and/or add
>>>> `default.api.timeout.ms` to the producer, too).
>>>>
>>>> Btw: For KafkaStreams, I think `clientsInstanceIds` still makes sense as
>>>> a method name though, as `KafkaStreams` itself does not have an
>>>> `instanceId` -- we can also not have a timeout-less overload, because
>>>> `KafkaStreams` does not have a `default.api.timeout.ms` config either
>>>> (and I don't think it make sense to add).
>>>>
>>>>
>>>>
>>>> -Matthias
>>>>
>>>> On 10/11/23 5:07 PM, Jun Rao wrote:
>>>>> Hi, Andrew,
>>>>>
>>>>> Thanks for the updated KIP. Just a few more minor comments.
>>>>>
>>>>> 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for
>>>> all
>>>>> consumer/producer/adminClient instances to be initialized? Are all
>>> those
>>>>> instances created during KafkaStreams initialization?
>>>>>
>>>>> 131. Why does globalConsumerInstanceId() return Optional<String> while
>>>>> other consumer instances don't return Optional?
>>>>>
>>>>> 132. ClientMetricsSubscriptionRequestCount: Do we need this since we
>>>> have a
>>>>> set of generic metrics
>>>>> (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
>>>>> report Request rate for every request type?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jun
>>>>>
>>>>> On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax <mj...@apache.org>
>>>> wrote:
>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On 10/10/23 11:31 PM, Andrew Schofield wrote:
>>>>>>> Matthias,
>>>>>>> Yes, I think that’s a sensible way forward and the interface you
>>>> propose
>>>>>> looks good. I’ll update the KIP accordingly.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>>> On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote:
>>>>>>>>
>>>>>>>> Andrew,
>>>>>>>>
>>>>>>>> yes I would like to get this change into KIP-714 right way. Seems to
>>>> be
>>>>>> important, as we don't know if/when a follow-up KIP for Kafka Streams
>>>> would
>>>>>> land.
>>>>>>>>
>>>>>>>> I was also thinking (and discussed with a few others) how to expose
>>>> it,
>>>>>> and we would propose the following:
>>>>>>>>
>>>>>>>> We add a new method to `KafkaStreams` class:
>>>>>>>>
>>>>>>>>      public ClientsInstanceIds clientsInstanceIds(Duration timeout);
>>>>>>>>
>>>>>>>> The returned object is like below:
>>>>>>>>
>>>>>>>>    public class ClientsInstanceIds {
>>>>>>>>      // we only have a single admin client per KS instance
>>>>>>>>      String adminInstanceId();
>>>>>>>>
>>>>>>>>      // we only have a single global consumer per KS instance (if
>>> any)
>>>>>>>>      // Optional<> because we might not have global-thread
>>>>>>>>      Optional<String> globalConsumerInstanceId();
>>>>>>>>
>>>>>>>>      // return a <threadKey -> ClientInstanceId> mapping
>>>>>>>>      // for the underlying (restore-)consumers/producers
>>>>>>>>      Map<String, String> mainConsumerInstanceIds();
>>>>>>>>      Map<String, String> restoreConsumerInstanceIds();
>>>>>>>>      Map<String, String> producerInstanceIds();
>>>>>>>> }
>>>>>>>>
>>>>>>>> For the `threadKey`, we would use some pattern like this:
>>>>>>>>
>>>>>>>>    [Stream|StateUpdater]Thread-<threadIdx>
>>>>>>>>
>>>>>>>>
>>>>>>>> Would this work from your POV?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/9/23 2:15 AM, Andrew Schofield wrote:
>>>>>>>>> Hi Matthias,
>>>>>>>>> Good point. Makes sense to me.
>>>>>>>>> Is this something that can also be included in the proposed Kafka
>>>>>> Streams follow-on KIP, or would you prefer that I add it to KIP-714?
>>>>>>>>> I have a slight preference for the former to put all of the KS
>>>>>> enhancements into a separate KIP.
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>>> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org>
>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks Andrew. SGTM.
>>>>>>>>>>
>>>>>>>>>> One point you did not address is the idea to add a method to
>>>>>> `KafkaStreams` similar to the proposed `clientInstanceId()` that will
>>> be
>>>>>> added to consumer/producer/admin clients.
>>>>>>>>>>
>>>>>>>>>> Without addressing this, Kafka Streams users won't have a way to
>>> get
>>>>>> the assigned `instanceId` of the internally created clients, and thus
>>> it
>>>>>> would be very difficult for them to know which metrics that the broker
>>>>>> receives belong to a Kafka Streams app. It seems they would only find
>>>> the
>>>>>> `instanceIds` in the log4j output if they enable client logging?
>>>>>>>>>>
>>>>>>>>>> Of course, because there is multiple clients inside Kafka Streams,
>>>>>> the return type cannot be an single "String", but must be some some
>>>> complex
>>>>>> data structure -- we could either add a new class, or return a
>>>>>> Map<String,String> using a client key that maps to the `instanceId`.
>>>>>>>>>>
>>>>>>>>>> For example we could use the following key:
>>>>>>>>>>
>>>>>>>>>>
>>>>  [Global]StreamThread[-<threadIndex>][-restore][consumer|producer]
>>>>>>>>>>
>>>>>>>>>> (Of course, only the valid combination.)
>>>>>>>>>>
>>>>>>>>>> Or maybe even better, we might want to return a `Future` because
>>>>>> collection all the `instanceId` might be a blocking all on each
>>> client?
>>>> I
>>>>>> have already a few idea how it could be implemented but I don't think
>>> it
>>>>>> must be discussed on the KIP, as it's an implementation detail.
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Matthias
>>>>>>>>>>
>>>>>>>>>> On 10/6/23 4:21 AM, Andrew Schofield wrote:
>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>> Thanks for your comments. I agree that a follow-up KIP for Kafka
>>>>>> Streams makes sense. This KIP currently has made a bit
>>>>>>>>>>> of an effort to embrace KS, but it’s not enough by a long way.
>>>>>>>>>>> I have removed `application.id <http://application.id/>`. This
>>>>>> should be done properly in the follow-up KIP. I don’t believe there’s
>>> a
>>>>>> downside to
>>>>>>>>>>> removing it from this KIP.
>>>>>>>>>>> I have reworded the statement about temporarily. In practice, the
>>>>>> implementation of this KIP that’s going on while the voting
>>>>>>>>>>> progresses happens to use delta temporality, but that’s an
>>>>>> implementation detail. Supporting clients must support both
>>>>>>>>>>> temporalities.
>>>>>>>>>>> I thought about exposing the client instance ID as a metric, but
>>>>>> non-numeric metrics are not usual practice and tools
>>>>>>>>>>> do not universally support them. I don’t think the KIP is
>>> improved
>>>>>> by adding one now.
>>>>>>>>>>> I have also added constants for the various Config classes for
>>>>>> ENABLE_METRICS_PUSH_CONFIG, including to
>>>>>>>>>>> StreamsConfig. It’s best to be explicit about this.
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Andrew
>>>>>>>>>>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org>
>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I did not pay attention to this KIP in the past; seems it was
>>>>>> on-hold for a while.
>>>>>>>>>>>>
>>>>>>>>>>>> Overall it sounds very useful, and I think we should extend this
>>>>>> with a follow up KIP for Kafka Streams. What is unclear to me at this
>>>> point
>>>>>> is the statement:
>>>>>>>>>>>>
>>>>>>>>>>>>> Kafka Streams applications have an application.id configured
>>> and
>>>>>> this identifier should be included as the application_id metrics
>>> label.
>>>>>>>>>>>>
>>>>>>>>>>>> The `application.id` is currently only used as the (main)
>>>>>> consumer's `group.id` (and is part of an auto-generated `client.id`
>>> if
>>>>>> the user does not set one).
>>>>>>>>>>>>
>>>>>>>>>>>> This comment related to:
>>>>>>>>>>>>
>>>>>>>>>>>>> The following labels should be added by the client as
>>> appropriate
>>>>>> before metrics are pushed.
>>>>>>>>>>>>
>>>>>>>>>>>> Given that Kafka Streams uses the consumer/producer/admin client
>>>> as
>>>>>> "black boxes", a client does at this point not know that it's part of
>>> a
>>>>>> Kafka Streams application, and thus, it won't be able to attach any
>>> such
>>>>>> label to the metrics it sends. (Also producer and admin don't even
>>> know
>>>> the
>>>>>> value of `application.id` -- only the (main) consumer, indirectly
>>> via `
>>>>>> group.id`, but also restore and global consumer don't know it,
>>> because
>>>>>> they don't have `group.id` set).
>>>>>>>>>>>>
>>>>>>>>>>>> While I am totally in favor of the proposal, I am wondering how
>>> we
>>>>>> intent to implement it in clean way? Or would we do ok to have some
>>>>>> internal client APIs that KS can use to "register" itself with the
>>>> client?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> While clients must support both temporalities, the broker will
>>>>>> initially only send
>>>> GetTelemetrySubscriptionsResponse.DeltaTemporality=True
>>>>>>>>>>>>
>>>>>>>>>>>> Not sure if I can follow. How make the decision about DELTA or
>>>>>> CUMULATIVE metrics? Should the broker side plugin not decide what
>>>> metrics
>>>>>> it what to receive in which form? So what does "initially" mean -- the
>>>>>> broker won't ship with a default plugin implementation?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> The following method is added to the Producer, Consumer, and
>>>> Admin
>>>>>> client interfaces:
>>>>>>>>>>>>
>>>>>>>>>>>> Should we add anything to Kafka Streams to expose the underlying
>>>>>> clients' assigned client-instance-ids programmatically? I am also
>>>> wondering
>>>>>> if clients should report their assigned client-instance-ids as metrics
>>>>>> itself (for this case, Kafka Streams won't need to do anything,
>>> because
>>>> we
>>>>>> already expose all client metrics).
>>>>>>>>>>>>
>>>>>>>>>>>> If we add anything programmatic, we need to make it simple,
>>> given
>>>>>> that Kafka Streams has many clients per `StreamThread` and may have
>>>>>> multiple threads.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> enable.metrics.push
>>>>>>>>>>>> It might be worth to add this to `StreamsConfig`, too? It set
>>> via
>>>>>> StreamsConfig, we would forward it to all clients automatically.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 9/29/23 5:45 PM, David Jacot wrote:
>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>> Thanks for driving this one. I haven't read all the KIP yet
>>> but I
>>>>>> already
>>>>>>>>>>>>> have an initial question. In the Threading section, it is
>>> written
>>>>>>>>>>>>> "KafkaConsumer: the "background" thread (based on the consumer
>>>>>> threading
>>>>>>>>>>>>> refactor which is underway)". If I understand this correctly,
>>> it
>>>>>> means
>>>>>>>>>>>>> that KIP-714 won't work if the "old consumer" is used. Am I
>>>>>> correct?
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> David
>>>>>>>>>>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>> Hi Philip,
>>>>>>>>>>>>>> No, I do not think it should actively search for a broker that
>>>>>> supports
>>>>>>>>>>>>>> the new
>>>>>>>>>>>>>> RPCs. In general, either all of the brokers or none of the
>>>>>> brokers will
>>>>>>>>>>>>>> support it.
>>>>>>>>>>>>>> In the window, where the cluster is being upgraded or client
>>>>>> telemetry is
>>>>>>>>>>>>>> being
>>>>>>>>>>>>>> enabled, there might be a mixed situation. I wouldn’t put too
>>>>>> much effort
>>>>>>>>>>>>>> into
>>>>>>>>>>>>>> this mixed scenario. As the client finds brokers which support
>>>>>> the new
>>>>>>>>>>>>>> RPCs,
>>>>>>>>>>>>>> it can begin to follow the KIP-714 mechanism.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com>
>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andrew -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Question on top of your answers: Do you think the client
>>> should
>>>>>> actively
>>>>>>>>>>>>>>> search for a broker that supports this RPC? As previously
>>>>>> mentioned, the
>>>>>>>>>>>>>>> broker uses the leastLoadedNode to find its first connection
>>>> (am
>>>>>>>>>>>>>>> I correct?), and what if that broker doesn't support the
>>> metric
>>>>>> push?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> P
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Kirk,
>>>>>>>>>>>>>>>> Thanks for your question. You are correct that the presence
>>> or
>>>>>> absence
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> the new RPCs in the
>>>>>>>>>>>>>>>> ApiVersionsResponse tells the client whether to request the
>>>>>> telemetry
>>>>>>>>>>>>>>>> subscriptions and push
>>>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is of course tricky in practice. It would be
>>> conceivable,
>>>>>> as a
>>>>>>>>>>>>>>>> cluster is upgraded to AK 3.7
>>>>>>>>>>>>>>>> or as a client metrics receiver plugin is deployed across
>>> the
>>>>>> cluster,
>>>>>>>>>>>>>>>> that a client connects to some
>>>>>>>>>>>>>>>> brokers that support the new RPCs and some that do not.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here’s my suggestion:
>>>>>>>>>>>>>>>> * If a client is not connected to any brokers that support
>>> in
>>>>>> the new
>>>>>>>>>>>>>>>> RPCs, it cannot push metrics.
>>>>>>>>>>>>>>>> * If a client is only connected to brokers that support the
>>>> new
>>>>>> RPCs, it
>>>>>>>>>>>>>>>> will use the new RPCs in
>>>>>>>>>>>>>>>> accordance with the KIP.
>>>>>>>>>>>>>>>> * If a client is connected to some brokers that support the
>>>> new
>>>>>> RPCs and
>>>>>>>>>>>>>>>> some that do not, it will
>>>>>>>>>>>>>>>> use the new RPCs with the supporting subset of brokers in
>>>>>> accordance
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> the KIP.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Comments?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro>
>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Andrew/Jun,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I want to make sure I understand question/comment #119… In
>>>> the
>>>>>> case
>>>>>>>>>>>>>>>> where a cluster without a metrics client receiver is later
>>>>>> reconfigured
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> restarted to include a metrics client receiver, do we want
>>> the
>>>>>> client to
>>>>>>>>>>>>>>>> thereafter begin pushing metrics to the cluster? From
>>> Andrew’s
>>>>>> response
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> question #119, it sounds like we’re using the
>>> presence/absence
>>>>>> of the
>>>>>>>>>>>>>>>> relevant RPCs in ApiVersionsResponse as the
>>>>>> to-push-or-not-to-push
>>>>>>>>>>>>>>>> indicator. Do I have that correct?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Kirk
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Jun,
>>>>>>>>>>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify
>>>>>> where
>>>>>>>>>>>>>>>> necessary.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 110. Yes, agree. The motivation section mentions this.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and
>>>> the
>>>>>>>>>>>>>>>> replacement of
>>>>>>>>>>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP
>>>>>> guidelines. I
>>>>>>>>>>>>>>>> think it’s a bit
>>>>>>>>>>>>>>>>>> of a debatable point. OTLP makes a distinction between a
>>>>>> namespace
>>>>>>>>>>>>>> and a
>>>>>>>>>>>>>>>>>> multi-word component. If it was “client.id” then “client”
>>>>>> would be a
>>>>>>>>>>>>>>>> namespace with
>>>>>>>>>>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So,
>>> it
>>>>>> was
>>>>>>>>>>>>>>>> intentional, but debatable.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 112. Thanks. The link target moved. Fixed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 113. Thanks. Fixed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 114.1. If a standard metric makes sense for a client, it
>>>>>> should use
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> exact same
>>>>>>>>>>>>>>>>>> name. If a standard metric doesn’t make sense for a
>>> client,
>>>>>> then it
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> omit that metric.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For a required metric, the situation is stronger. All
>>>> clients
>>>>>> must
>>>>>>>>>>>>>>>> implement these
>>>>>>>>>>>>>>>>>> metrics with these names in order to implement the KIP.
>>> But
>>>>>> the
>>>>>>>>>>>>>>>> required metrics
>>>>>>>>>>>>>>>>>> are essentially the number of connections and the request
>>>>>> latency,
>>>>>>>>>>>>>>>> which do not
>>>>>>>>>>>>>>>>>> reference the underlying implementation of the client
>>> (which
>>>>>>>>>>>>>>>> producer.record.queue.time.max
>>>>>>>>>>>>>>>>>> of course does).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I suppose someone might build a producer-only client that
>>>>>> didn’t have
>>>>>>>>>>>>>>>> consumer metrics.
>>>>>>>>>>>>>>>>>> In this case, the consumer metrics would conceptually have
>>>>>> the value 0
>>>>>>>>>>>>>>>> and would not
>>>>>>>>>>>>>>>>>> need to be sent to the broker.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 114.2. If a client does not implement some metrics, they
>>>> will
>>>>>> not be
>>>>>>>>>>>>>>>> available for
>>>>>>>>>>>>>>>>>> analysis and troubleshooting. It just makes the ability to
>>>>>> combine
>>>>>>>>>>>>>>>> metrics from lots
>>>>>>>>>>>>>>>>>> different clients less complete.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 115. I think it was probably a mistake to be so specific
>>>> about
>>>>>>>>>>>>>>>> threading in this KIP.
>>>>>>>>>>>>>>>>>> When the consumer threading refactor is complete, of
>>> course,
>>>>>> it would
>>>>>>>>>>>>>>>> do the appropriate
>>>>>>>>>>>>>>>>>> equivalent. I’ve added a clarification and massively
>>>>>> simplified this
>>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 116. I removed “client.terminating”.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 117. Yes. Horrid. Fixed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 118. The Terminating flag just indicates that this is the
>>>>>> final
>>>>>>>>>>>>>>>> PushTelemetryRequest
>>>>>>>>>>>>>>>>>> from this client. Any subsequent request will be
>>> rejected. I
>>>>>> think
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> flag should remain.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 119. Good catch. This was actually contradicting another
>>>> part
>>>>>> of the
>>>>>>>>>>>>>>>> KIP. The current behaviour
>>>>>>>>>>>>>>>>>> is indeed preserved. If the broker doesn’t have a client
>>>>>> metrics
>>>>>>>>>>>>>>>> receiver plugin, the new RPCs
>>>>>>>>>>>>>>>>>> in this KIP are “turned off” and not reported in
>>>>>> ApiVersionsResponse.
>>>>>>>>>>>>>>>> The client will not
>>>>>>>>>>>>>>>>>> attempt to push metrics.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 120. The error handling table lists the error codes for
>>>>>>>>>>>>>>>> PushTelemetryResponse. I’ve added one
>>>>>>>>>>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions
>>> doesn’t
>>>>>> have any
>>>>>>>>>>>>>>>> error codes, since the
>>>>>>>>>>>>>>>>>> situation in which the client telemetry is not supported
>>> is
>>>>>> handled by
>>>>>>>>>>>>>>>> the RPCs not being offered
>>>>>>>>>>>>>>>>>> by the broker.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 121. Again, I think it’s probably a mistake to be specific
>>>>>> about
>>>>>>>>>>>>>>>> threading. Removed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation
>>>>>> should be
>>>>>>>>>>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation
>>>>>> should be
>>>>>>>>>>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on
>>> the
>>>>>> CLUSTER
>>>>>>>>>>>>>>>>>> resource.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for the detailed review.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 110. Another potential motivation is the multiple clients
>>>>>> support.
>>>>>>>>>>>>>>>> Some of
>>>>>>>>>>>>>>>>>>> the places may not have good monitoring support for
>>>> non-java
>>>>>> clients.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for
>>>>>> metric name
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> replace '-' with '_' for attributes. Why is the
>>>>>> inconsistency?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 112. OTLP specification: Page is not found from the link.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 113. "Defining standard and required metrics makes the
>>>>>> monitoring and
>>>>>>>>>>>>>>>>>>> troubleshooting of clients from various client types ":
>>>>>> Incomplete
>>>>>>>>>>>>>>>> sentence.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 114. standard/required metrics
>>>>>>>>>>>>>>>>>>> 114.1 Do other clients need to implement those metrics
>>> with
>>>>>> the exact
>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>> names?
>>>>>>>>>>>>>>>>>>> 114.2 What happens if some of those metrics are missing
>>>> from
>>>>>> a
>>>>>>>>>>>>>> client?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and
>>> application
>>>>>> threads":
>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>> have an ongoing effort to refactor the consumer threading
>>>>>> model (
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>
>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
>>>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>>>>> Once this is done, PRC requests will only be made from
>>> the
>>>>>> background
>>>>>>>>>>>>>>>>>>> thread. Should this KIP follow the new model only?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 116. 'The metrics should contain the reason for the
>>> client
>>>>>>>>>>>>>> termination
>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>> including the client.terminating metric with the label
>>>>>> “reason” ...'.
>>>>>>>>>>>>>>>> Hmm,
>>>>>>>>>>>>>>>>>>> are we introducing a new metric client.terminating? If
>>> so,
>>>>>> that needs
>>>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>>>>>>> explicitly listed.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 117. "As the metrics plugin may need to add additional
>>>>>> metrics on top
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> this the generic metrics receiver in the broker will not
>>>> add
>>>>>> these
>>>>>>>>>>>>>>>> labels
>>>>>>>>>>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't
>>>> read
>>>>>> well.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 118. "it is possible for the client to send at most one
>>>>>> accepted
>>>>>>>>>>>>>>>>>>> out-of-profile per connection before the rate-limiter
>>> kicks
>>>>>> in": If
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>> this, do we still need the Terminating flag in
>>>>>>>>>>>>>> PushTelemetryRequestV0?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 119. "If there is no client metrics receiver plugin
>>>>>> configured on the
>>>>>>>>>>>>>>>>>>> broker, it will respond to
>>> GetTelemetrySubscriptionsRequest
>>>>>> with
>>>>>>>>>>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The
>>>>>> client
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the
>>>>>> PushIntervalMs
>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>> expired. This allows the metrics receiver to be enabled
>>> or
>>>>>> disabled
>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>>>>> having to restart the broker or reset the client
>>>> connection."
>>>>>>>>>>>>>>>>>>> "no client metrics receiver plugin configured" is defined
>>>> by
>>>>>> no
>>>>>>>>>>>>>> metric
>>>>>>>>>>>>>>>>>>> reporter implementing the ClientTelemetry interface,
>>> right?
>>>>>> In that
>>>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>>>> it would be useful to avoid the clients sending
>>>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve
>>>>>> the current
>>>>>>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and
>>>>>> PushTelemetryRequestV0:
>>>>>>>>>>>>>>>> Could
>>>>>>>>>>>>>>>>>>> we list error codes for each?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver
>>> This
>>>>>> method may
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> called from the request handling thread": Where else can
>>>>>> this method
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> called?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we
>>>>>> changing the
>>>>>>>>>>>>>>>> ACL?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Jun
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
>>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Milind,
>>>>>>>>>>>>>>>>>>>> Thanks for your question.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most
>>> likely
>>>>>> to be
>>>>>>>>>>>>>>>> caused by a
>>>>>>>>>>>>>>>>>>>> problem in the serialization in the client. I have
>>> changed
>>>>>> the
>>>>>>>>>>>>>> client
>>>>>>>>>>>>>>>>>>>> action in this case
>>>>>>>>>>>>>>>>>>>> to “Log an error and stop pushing metrics”.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have updated the KIP text accordingly.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra
>>>>>>>>>>>>>>>> <milut...@confluent.io.INVALID>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>> Thanks for the clarifications.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> About 2b:
>>>>>>>>>>>>>>>>>>>>> In case a client has a bug while serializing, it might
>>> be
>>>>>> difficult
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> client to recover from that without code changes. In
>>>> that,
>>>>>> it might
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat
>>> the
>>>>>> error as
>>>>>>>>>>>>>>>> fatal
>>>>>>>>>>>>>>>>>>>>> for the client (only fatal in terms of sending the
>>>>>> metrics, the
>>>>>>>>>>>>>>>> client
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>> keep functioning otherwise). What do you think?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>> Milind
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
>>>>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Milind,
>>>>>>>>>>>>>>>>>>>>>> Thanks for your questions about the KIP.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 1) I did some archaeology and looked at historical
>>>>>> versions of the
>>>>>>>>>>>>>>>> KIP.
>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>> think this is
>>>>>>>>>>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push
>>>>>> interval. 30
>>>>>>>>>>>>>>>>>>>> minutes
>>>>>>>>>>>>>>>>>>>>>> is a mystery
>>>>>>>>>>>>>>>>>>>>>> to me. I’ve updated the KIP.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2) I think there are two situations in which
>>>>>> INVALID_RECORD might
>>>>>>>>>>>>>>>> occur.
>>>>>>>>>>>>>>>>>>>>>> a) The client might perhaps be using a content-type
>>> that
>>>>>> the
>>>>>>>>>>>>>> broker
>>>>>>>>>>>>>>>> does
>>>>>>>>>>>>>>>>>>>>>> not support.
>>>>>>>>>>>>>>>>>>>>>> The KIP mentions content-type as a future extension,
>>> but
>>>>>> there’s
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>>>>>>>>> to start with. Until we have multiple content-types,
>>>> this
>>>>>> seems
>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> scope. I think a
>>>>>>>>>>>>>>>>>>>>>> future KIP would add another error code for this.
>>>>>>>>>>>>>>>>>>>>>> b) The client might perhaps have a bug which means the
>>>>>> metrics
>>>>>>>>>>>>>>>> payload
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> malformed.
>>>>>>>>>>>>>>>>>>>>>> Logging a warning and attempting the next metrics push
>>>> on
>>>>>> the push
>>>>>>>>>>>>>>>>>>>>>> interval seems
>>>>>>>>>>>>>>>>>>>>>> appropriate.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by
>>>> making
>>>>>> an
>>>>>>>>>>>>>>>> immediate
>>>>>>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a
>>> client
>>>>>> bug or
>>>>>>>>>>>>>>>> perhaps
>>>>>>>>>>>>>>>>>>>>>> a situation in which a broker sends a compression type
>>>> in
>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse
>>>>>>>>>>>>>>>>>>>>>> which is subsequently not supported when its used
>>> with a
>>>>>>>>>>>>>>>>>>>>>> PushTelemetryRequest.
>>>>>>>>>>>>>>>>>>>>>> We do want the client to have the opportunity to get
>>> an
>>>>>> up-to-date
>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>>>>>>>>> compression types. I think an immediate
>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest
>>>>>>>>>>>>>>>>>>>>>> is appropriate.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a
>>>> Null
>>>>>>>>>>>>>>>>>>>>>> ClientInstanceId, the
>>>>>>>>>>>>>>>>>>>>>> receiving broker may not already know the client's
>>>>>> existing
>>>>>>>>>>>>>>>>>>>>>> ClientInstanceId. If the
>>>>>>>>>>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId,
>>> it
>>>>>> simply
>>>>>>>>>>>>>>>> responds
>>>>>>>>>>>>>>>>>>>>>> the existing
>>>>>>>>>>>>>>>>>>>>>> value back to the client. If it does not know the
>>>> existing
>>>>>>>>>>>>>>>>>>>>>> ClientInstanceId, it will create
>>>>>>>>>>>>>>>>>>>>>> a new client instance ID and respond with that.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I will update the KIP with these clarifications.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
>>>>>>>>>>>>>>>> <milut...@confluent.io.INVALID
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I had a few questions regarding the "Error handling"
>>>>>> section.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are
>>> to
>>>>>>>>>>>>>> eventually
>>>>>>>>>>>>>>>>>>>>>> trigger
>>>>>>>>>>>>>>>>>>>>>>> a retry and avoid having to restart clients if the
>>>>>> cluster
>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by
>>>> operator
>>>>>> error,
>>>>>>>>>>>>>>>> rolling
>>>>>>>>>>>>>>>>>>>>>>> upgrades, etc."
>>>>>>>>>>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere
>>> else.
>>>>>> What is
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>> referring to?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 2. For the actual errors:
>>>>>>>>>>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a
>>>>>> warning to the
>>>>>>>>>>>>>>>>>>>>>>> application and schedule the next
>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest
>>>>>>>>>>>>>>>> to 5
>>>>>>>>>>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something
>>> like
>>>>>>>>>>>>>>>> PushIntervalMs?
>>>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>>>> also, why are we scheduling a
>>>>>> GetTelemetrySubscriptionsRequest in
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>> case, if the serialization is broken?
>>>>>>>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID ,
>>> UNSUPPORTED_COMPRESSION_TYPE
>>>> :
>>>>>> just to
>>>>>>>>>>>>>>>>>>>> confirm,
>>>>>>>>>>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be
>>>>>> scheduled
>>>>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>>>>>>>>>>> after the PushTelemetry response, correct?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests
>>>>>> must include
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> ClientInstanceId returned in the first response,
>>>>>> regardless of
>>>>>>>>>>>>>>>> broker":
>>>>>>>>>>>>>>>>>>>>>>> Will a broker error be returned in case some
>>>>>> implementation of
>>>>>>>>>>>>>>>> this KIP
>>>>>>>>>>>>>>>>>>>>>>> violates this accidentally and sends a request with
>>>>>>>>>>>>>>>> ClientInstanceId =
>>>>>>>>>>>>>>>>>>>>>> Null
>>>>>>>>>>>>>>>>>>>>>>> even when it's been obtained already? Or will a new
>>>>>>>>>>>>>>>> ClientInstanceId be
>>>>>>>>>>>>>>>>>>>>>>> returned without an error?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
>>>>>>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>> I would like to start a new discussion thread on
>>>>>> KIP-714: Client
>>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>>>> and observability.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>
>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce
>>> the
>>>>>> scope.
>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>> overall
>>>>>>>>>>>>>>>>>>>>>>>> mechanism for client metric subscriptions is
>>>> unchanged,
>>>>>> but the
>>>>>>>>>>>>>>>>>>>>>>>> KIP is now based on the existing client metrics,
>>>> rather
>>>>>> than
>>>>>>>>>>>>>>>>>>>> introducing
>>>>>>>>>>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster
>>>>>> operators
>>>>>>>>>>>>>>>>>>>>>>>> investigate performance problems experienced by
>>>> clients
>>>>>> without
>>>>>>>>>>>>>>>>>>>>>> requiring
>>>>>>>>>>>>>>>>>>>>>>>> changes to the client application code or
>>>> configuration.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to