Hi Jun,
Thanks for the clarifications.

131. The client instance ids returned from 
KafkaStreams.clientInstanceIds(Duration) correspond to the
client_instance_id labels added by the broker to the metrics pushed from the 
clients. This should be
sufficient information to enable correlation between the metrics available in 
the client, and the metrics
pushed to the broker.

132. Yes, I see. I used JMX to look at the metrics on my broker and you’re 
entirely right. I will
remove the redundant metric from the KIP.

Thanks,
Andrew

> On 12 Oct 2023, at 20:12, Jun Rao <j...@confluent.io.INVALID> wrote:
>
> Hi, Andrew,
>
> Thanks for the reply.
>
> 131. Could we also document how one could correlate each client instance in
> KStreams with the labels for the metrics received by the brokers?
>
> 132. The documentation for RequestsPerSec is not complete. If you trace
> through how
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/RequestChannel.scala#L71
> <https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/RequestChannel.scala#L579>
> is
> implemented, it includes every API key tagged with the corresponding
> listener.
>
> Jun
>
> On Thu, Oct 12, 2023 at 11:42 AM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Jun,
>> Thanks for your comments.
>>
>> 130. As Matthias described, and I am adding to the KIP, the
>> `KafkaStreams#clientInstanceIds` method
>> is only permitted when the state is RUNNING or REBALANCING. Also, clients
>> can be added dynamically
>> so the maps might change over time. If it’s in a permitted state, the
>> method is prepared to wait up to the
>> supplied timeout to get the client instance ids. It does not return a
>> partial result - it returns a result or
>> fails.
>>
>> 131. I’ve refactored the `ClientsInstanceIds` object and the global
>> consumer is now part of the map
>> of consumers. There is no need for the Optional any longer. I’ve also
>> renamed it `ClientInstanceIds`.
>>
>> 132. My reading of
>> `(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*)` is that
>> It does not support every request type - it supports Produce,
>> FetchConsumer and FetchFollower.
>> Consequently, I think the ClientMetricsSubscriptionRequestCount is not
>> instantly obsolete.
>>
>> If I’ve misunderstood, please let me know.
>>
>> Thanks,
>> Andrew
>>
>>
>>> On 12 Oct 2023, at 01:07, Jun Rao <j...@confluent.io.INVALID> wrote:
>>>
>>> Hi, Andrew,
>>>
>>> Thanks for the updated KIP. Just a few more minor comments.
>>>
>>> 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for
>> all
>>> consumer/producer/adminClient instances to be initialized? Are all those
>>> instances created during KafkaStreams initialization?
>>>
>>> 131. Why does globalConsumerInstanceId() return Optional<String> while
>>> other consumer instances don't return Optional?
>>>
>>> 132. ClientMetricsSubscriptionRequestCount: Do we need this since we
>> have a
>>> set of generic metrics
>>> (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
>>> report Request rate for every request type?
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>> On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax <mj...@apache.org>
>> wrote:
>>>
>>>> Thanks!
>>>>
>>>> On 10/10/23 11:31 PM, Andrew Schofield wrote:
>>>>> Matthias,
>>>>> Yes, I think that’s a sensible way forward and the interface you
>> propose
>>>> looks good. I’ll update the KIP accordingly.
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>>> On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote:
>>>>>>
>>>>>> Andrew,
>>>>>>
>>>>>> yes I would like to get this change into KIP-714 right way. Seems to
>> be
>>>> important, as we don't know if/when a follow-up KIP for Kafka Streams
>> would
>>>> land.
>>>>>>
>>>>>> I was also thinking (and discussed with a few others) how to expose
>> it,
>>>> and we would propose the following:
>>>>>>
>>>>>> We add a new method to `KafkaStreams` class:
>>>>>>
>>>>>>   public ClientsInstanceIds clientsInstanceIds(Duration timeout);
>>>>>>
>>>>>> The returned object is like below:
>>>>>>
>>>>>> public class ClientsInstanceIds {
>>>>>>   // we only have a single admin client per KS instance
>>>>>>   String adminInstanceId();
>>>>>>
>>>>>>   // we only have a single global consumer per KS instance (if any)
>>>>>>   // Optional<> because we might not have global-thread
>>>>>>   Optional<String> globalConsumerInstanceId();
>>>>>>
>>>>>>   // return a <threadKey -> ClientInstanceId> mapping
>>>>>>   // for the underlying (restore-)consumers/producers
>>>>>>   Map<String, String> mainConsumerInstanceIds();
>>>>>>   Map<String, String> restoreConsumerInstanceIds();
>>>>>>   Map<String, String> producerInstanceIds();
>>>>>> }
>>>>>>
>>>>>> For the `threadKey`, we would use some pattern like this:
>>>>>>
>>>>>> [Stream|StateUpdater]Thread-<threadIdx>
>>>>>>
>>>>>>
>>>>>> Would this work from your POV?
>>>>>>
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>>> On 10/9/23 2:15 AM, Andrew Schofield wrote:
>>>>>>> Hi Matthias,
>>>>>>> Good point. Makes sense to me.
>>>>>>> Is this something that can also be included in the proposed Kafka
>>>> Streams follow-on KIP, or would you prefer that I add it to KIP-714?
>>>>>>> I have a slight preference for the former to put all of the KS
>>>> enhancements into a separate KIP.
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org> wrote:
>>>>>>>>
>>>>>>>> Thanks Andrew. SGTM.
>>>>>>>>
>>>>>>>> One point you did not address is the idea to add a method to
>>>> `KafkaStreams` similar to the proposed `clientInstanceId()` that will be
>>>> added to consumer/producer/admin clients.
>>>>>>>>
>>>>>>>> Without addressing this, Kafka Streams users won't have a way to get
>>>> the assigned `instanceId` of the internally created clients, and thus it
>>>> would be very difficult for them to know which metrics that the broker
>>>> receives belong to a Kafka Streams app. It seems they would only find
>> the
>>>> `instanceIds` in the log4j output if they enable client logging?
>>>>>>>>
>>>>>>>> Of course, because there is multiple clients inside Kafka Streams,
>>>> the return type cannot be an single "String", but must be some some
>> complex
>>>> data structure -- we could either add a new class, or return a
>>>> Map<String,String> using a client key that maps to the `instanceId`.
>>>>>>>>
>>>>>>>> For example we could use the following key:
>>>>>>>>
>>>>>>>>  [Global]StreamThread[-<threadIndex>][-restore][consumer|producer]
>>>>>>>>
>>>>>>>> (Of course, only the valid combination.)
>>>>>>>>
>>>>>>>> Or maybe even better, we might want to return a `Future` because
>>>> collection all the `instanceId` might be a blocking all on each client?
>> I
>>>> have already a few idea how it could be implemented but I don't think it
>>>> must be discussed on the KIP, as it's an implementation detail.
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>>>
>>>>>>>> On 10/6/23 4:21 AM, Andrew Schofield wrote:
>>>>>>>>> Hi Matthias,
>>>>>>>>> Thanks for your comments. I agree that a follow-up KIP for Kafka
>>>> Streams makes sense. This KIP currently has made a bit
>>>>>>>>> of an effort to embrace KS, but it’s not enough by a long way.
>>>>>>>>> I have removed `application.id <http://application.id/>`. This
>>>> should be done properly in the follow-up KIP. I don’t believe there’s a
>>>> downside to
>>>>>>>>> removing it from this KIP.
>>>>>>>>> I have reworded the statement about temporarily. In practice, the
>>>> implementation of this KIP that’s going on while the voting
>>>>>>>>> progresses happens to use delta temporality, but that’s an
>>>> implementation detail. Supporting clients must support both
>>>>>>>>> temporalities.
>>>>>>>>> I thought about exposing the client instance ID as a metric, but
>>>> non-numeric metrics are not usual practice and tools
>>>>>>>>> do not universally support them. I don’t think the KIP is improved
>>>> by adding one now.
>>>>>>>>> I have also added constants for the various Config classes for
>>>> ENABLE_METRICS_PUSH_CONFIG, including to
>>>>>>>>> StreamsConfig. It’s best to be explicit about this.
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org>
>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I did not pay attention to this KIP in the past; seems it was
>>>> on-hold for a while.
>>>>>>>>>>
>>>>>>>>>> Overall it sounds very useful, and I think we should extend this
>>>> with a follow up KIP for Kafka Streams. What is unclear to me at this
>> point
>>>> is the statement:
>>>>>>>>>>
>>>>>>>>>>> Kafka Streams applications have an application.id configured and
>>>> this identifier should be included as the application_id metrics label.
>>>>>>>>>>
>>>>>>>>>> The `application.id` is currently only used as the (main)
>>>> consumer's `group.id` (and is part of an auto-generated `client.id` if
>>>> the user does not set one).
>>>>>>>>>>
>>>>>>>>>> This comment related to:
>>>>>>>>>>
>>>>>>>>>>> The following labels should be added by the client as appropriate
>>>> before metrics are pushed.
>>>>>>>>>>
>>>>>>>>>> Given that Kafka Streams uses the consumer/producer/admin client
>> as
>>>> "black boxes", a client does at this point not know that it's part of a
>>>> Kafka Streams application, and thus, it won't be able to attach any such
>>>> label to the metrics it sends. (Also producer and admin don't even know
>> the
>>>> value of `application.id` -- only the (main) consumer, indirectly via `
>>>> group.id`, but also restore and global consumer don't know it, because
>>>> they don't have `group.id` set).
>>>>>>>>>>
>>>>>>>>>> While I am totally in favor of the proposal, I am wondering how we
>>>> intent to implement it in clean way? Or would we do ok to have some
>>>> internal client APIs that KS can use to "register" itself with the
>> client?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> While clients must support both temporalities, the broker will
>>>> initially only send
>> GetTelemetrySubscriptionsResponse.DeltaTemporality=True
>>>>>>>>>>
>>>>>>>>>> Not sure if I can follow. How make the decision about DELTA or
>>>> CUMULATIVE metrics? Should the broker side plugin not decide what
>> metrics
>>>> it what to receive in which form? So what does "initially" mean -- the
>>>> broker won't ship with a default plugin implementation?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The following method is added to the Producer, Consumer, and
>> Admin
>>>> client interfaces:
>>>>>>>>>>
>>>>>>>>>> Should we add anything to Kafka Streams to expose the underlying
>>>> clients' assigned client-instance-ids programmatically? I am also
>> wondering
>>>> if clients should report their assigned client-instance-ids as metrics
>>>> itself (for this case, Kafka Streams won't need to do anything, because
>> we
>>>> already expose all client metrics).
>>>>>>>>>>
>>>>>>>>>> If we add anything programmatic, we need to make it simple, given
>>>> that Kafka Streams has many clients per `StreamThread` and may have
>>>> multiple threads.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> enable.metrics.push
>>>>>>>>>> It might be worth to add this to `StreamsConfig`, too? It set via
>>>> StreamsConfig, we would forward it to all clients automatically.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Matthias
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 9/29/23 5:45 PM, David Jacot wrote:
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>> Thanks for driving this one. I haven't read all the KIP yet but I
>>>> already
>>>>>>>>>>> have an initial question. In the Threading section, it is written
>>>>>>>>>>> "KafkaConsumer: the "background" thread (based on the consumer
>>>> threading
>>>>>>>>>>> refactor which is underway)". If I understand this correctly, it
>>>> means
>>>>>>>>>>> that KIP-714 won't work if the "old consumer" is used. Am I
>>>> correct?
>>>>>>>>>>> Cheers,
>>>>>>>>>>> David
>>>>>>>>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>> Hi Philip,
>>>>>>>>>>>> No, I do not think it should actively search for a broker that
>>>> supports
>>>>>>>>>>>> the new
>>>>>>>>>>>> RPCs. In general, either all of the brokers or none of the
>>>> brokers will
>>>>>>>>>>>> support it.
>>>>>>>>>>>> In the window, where the cluster is being upgraded or client
>>>> telemetry is
>>>>>>>>>>>> being
>>>>>>>>>>>> enabled, there might be a mixed situation. I wouldn’t put too
>>>> much effort
>>>>>>>>>>>> into
>>>>>>>>>>>> this mixed scenario. As the client finds brokers which support
>>>> the new
>>>>>>>>>>>> RPCs,
>>>>>>>>>>>> it can begin to follow the KIP-714 mechanism.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Andrew
>>>>>>>>>>>>
>>>>>>>>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com>
>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andrew -
>>>>>>>>>>>>>
>>>>>>>>>>>>> Question on top of your answers: Do you think the client should
>>>> actively
>>>>>>>>>>>>> search for a broker that supports this RPC? As previously
>>>> mentioned, the
>>>>>>>>>>>>> broker uses the leastLoadedNode to find its first connection
>> (am
>>>>>>>>>>>>> I correct?), and what if that broker doesn't support the metric
>>>> push?
>>>>>>>>>>>>>
>>>>>>>>>>>>> P
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Kirk,
>>>>>>>>>>>>>> Thanks for your question. You are correct that the presence or
>>>> absence
>>>>>>>>>>>> of
>>>>>>>>>>>>>> the new RPCs in the
>>>>>>>>>>>>>> ApiVersionsResponse tells the client whether to request the
>>>> telemetry
>>>>>>>>>>>>>> subscriptions and push
>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is of course tricky in practice. It would be conceivable,
>>>> as a
>>>>>>>>>>>>>> cluster is upgraded to AK 3.7
>>>>>>>>>>>>>> or as a client metrics receiver plugin is deployed across the
>>>> cluster,
>>>>>>>>>>>>>> that a client connects to some
>>>>>>>>>>>>>> brokers that support the new RPCs and some that do not.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here’s my suggestion:
>>>>>>>>>>>>>> * If a client is not connected to any brokers that support in
>>>> the new
>>>>>>>>>>>>>> RPCs, it cannot push metrics.
>>>>>>>>>>>>>> * If a client is only connected to brokers that support the
>> new
>>>> RPCs, it
>>>>>>>>>>>>>> will use the new RPCs in
>>>>>>>>>>>>>> accordance with the KIP.
>>>>>>>>>>>>>> * If a client is connected to some brokers that support the
>> new
>>>> RPCs and
>>>>>>>>>>>>>> some that do not, it will
>>>>>>>>>>>>>> use the new RPCs with the supporting subset of brokers in
>>>> accordance
>>>>>>>>>>>> with
>>>>>>>>>>>>>> the KIP.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Comments?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro>
>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andrew/Jun,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I want to make sure I understand question/comment #119… In
>> the
>>>> case
>>>>>>>>>>>>>> where a cluster without a metrics client receiver is later
>>>> reconfigured
>>>>>>>>>>>> and
>>>>>>>>>>>>>> restarted to include a metrics client receiver, do we want the
>>>> client to
>>>>>>>>>>>>>> thereafter begin pushing metrics to the cluster? From Andrew’s
>>>> response
>>>>>>>>>>>> to
>>>>>>>>>>>>>> question #119, it sounds like we’re using the presence/absence
>>>> of the
>>>>>>>>>>>>>> relevant RPCs in ApiVersionsResponse as the
>>>> to-push-or-not-to-push
>>>>>>>>>>>>>> indicator. Do I have that correct?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Kirk
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Jun,
>>>>>>>>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify
>>>> where
>>>>>>>>>>>>>> necessary.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 110. Yes, agree. The motivation section mentions this.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and
>> the
>>>>>>>>>>>>>> replacement of
>>>>>>>>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP
>>>> guidelines. I
>>>>>>>>>>>>>> think it’s a bit
>>>>>>>>>>>>>>>> of a debatable point. OTLP makes a distinction between a
>>>> namespace
>>>>>>>>>>>> and a
>>>>>>>>>>>>>>>> multi-word component. If it was “client.id” then “client”
>>>> would be a
>>>>>>>>>>>>>> namespace with
>>>>>>>>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So, it
>>>> was
>>>>>>>>>>>>>> intentional, but debatable.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 112. Thanks. The link target moved. Fixed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 113. Thanks. Fixed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 114.1. If a standard metric makes sense for a client, it
>>>> should use
>>>>>>>>>>>> the
>>>>>>>>>>>>>> exact same
>>>>>>>>>>>>>>>> name. If a standard metric doesn’t make sense for a client,
>>>> then it
>>>>>>>>>>>> can
>>>>>>>>>>>>>> omit that metric.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For a required metric, the situation is stronger. All
>> clients
>>>> must
>>>>>>>>>>>>>> implement these
>>>>>>>>>>>>>>>> metrics with these names in order to implement the KIP. But
>>>> the
>>>>>>>>>>>>>> required metrics
>>>>>>>>>>>>>>>> are essentially the number of connections and the request
>>>> latency,
>>>>>>>>>>>>>> which do not
>>>>>>>>>>>>>>>> reference the underlying implementation of the client (which
>>>>>>>>>>>>>> producer.record.queue.time.max
>>>>>>>>>>>>>>>> of course does).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I suppose someone might build a producer-only client that
>>>> didn’t have
>>>>>>>>>>>>>> consumer metrics.
>>>>>>>>>>>>>>>> In this case, the consumer metrics would conceptually have
>>>> the value 0
>>>>>>>>>>>>>> and would not
>>>>>>>>>>>>>>>> need to be sent to the broker.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 114.2. If a client does not implement some metrics, they
>> will
>>>> not be
>>>>>>>>>>>>>> available for
>>>>>>>>>>>>>>>> analysis and troubleshooting. It just makes the ability to
>>>> combine
>>>>>>>>>>>>>> metrics from lots
>>>>>>>>>>>>>>>> different clients less complete.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 115. I think it was probably a mistake to be so specific
>> about
>>>>>>>>>>>>>> threading in this KIP.
>>>>>>>>>>>>>>>> When the consumer threading refactor is complete, of course,
>>>> it would
>>>>>>>>>>>>>> do the appropriate
>>>>>>>>>>>>>>>> equivalent. I’ve added a clarification and massively
>>>> simplified this
>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 116. I removed “client.terminating”.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 117. Yes. Horrid. Fixed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 118. The Terminating flag just indicates that this is the
>>>> final
>>>>>>>>>>>>>> PushTelemetryRequest
>>>>>>>>>>>>>>>> from this client. Any subsequent request will be rejected. I
>>>> think
>>>>>>>>>>>> this
>>>>>>>>>>>>>> flag should remain.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 119. Good catch. This was actually contradicting another
>> part
>>>> of the
>>>>>>>>>>>>>> KIP. The current behaviour
>>>>>>>>>>>>>>>> is indeed preserved. If the broker doesn’t have a client
>>>> metrics
>>>>>>>>>>>>>> receiver plugin, the new RPCs
>>>>>>>>>>>>>>>> in this KIP are “turned off” and not reported in
>>>> ApiVersionsResponse.
>>>>>>>>>>>>>> The client will not
>>>>>>>>>>>>>>>> attempt to push metrics.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 120. The error handling table lists the error codes for
>>>>>>>>>>>>>> PushTelemetryResponse. I’ve added one
>>>>>>>>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t
>>>> have any
>>>>>>>>>>>>>> error codes, since the
>>>>>>>>>>>>>>>> situation in which the client telemetry is not supported is
>>>> handled by
>>>>>>>>>>>>>> the RPCs not being offered
>>>>>>>>>>>>>>>> by the broker.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 121. Again, I think it’s probably a mistake to be specific
>>>> about
>>>>>>>>>>>>>> threading. Removed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation
>>>> should be
>>>>>>>>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation
>>>> should be
>>>>>>>>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the
>>>> CLUSTER
>>>>>>>>>>>>>>>> resource.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the detailed review.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 110. Another potential motivation is the multiple clients
>>>> support.
>>>>>>>>>>>>>> Some of
>>>>>>>>>>>>>>>>> the places may not have good monitoring support for
>> non-java
>>>> clients.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for
>>>> metric name
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> replace '-' with '_' for attributes. Why is the
>>>> inconsistency?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 112. OTLP specification: Page is not found from the link.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 113. "Defining standard and required metrics makes the
>>>> monitoring and
>>>>>>>>>>>>>>>>> troubleshooting of clients from various client types ":
>>>> Incomplete
>>>>>>>>>>>>>> sentence.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 114. standard/required metrics
>>>>>>>>>>>>>>>>> 114.1 Do other clients need to implement those metrics with
>>>> the exact
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>> names?
>>>>>>>>>>>>>>>>> 114.2 What happens if some of those metrics are missing
>> from
>>>> a
>>>>>>>>>>>> client?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and application
>>>> threads":
>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>> have an ongoing effort to refactor the consumer threading
>>>> model (
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>>> Once this is done, PRC requests will only be made from the
>>>> background
>>>>>>>>>>>>>>>>> thread. Should this KIP follow the new model only?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 116. 'The metrics should contain the reason for the client
>>>>>>>>>>>> termination
>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>> including the client.terminating metric with the label
>>>> “reason” ...'.
>>>>>>>>>>>>>> Hmm,
>>>>>>>>>>>>>>>>> are we introducing a new metric client.terminating? If so,
>>>> that needs
>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>>>>> explicitly listed.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 117. "As the metrics plugin may need to add additional
>>>> metrics on top
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> this the generic metrics receiver in the broker will not
>> add
>>>> these
>>>>>>>>>>>>>> labels
>>>>>>>>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't
>> read
>>>> well.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 118. "it is possible for the client to send at most one
>>>> accepted
>>>>>>>>>>>>>>>>> out-of-profile per connection before the rate-limiter kicks
>>>> in": If
>>>>>>>>>>>> we
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>> this, do we still need the Terminating flag in
>>>>>>>>>>>> PushTelemetryRequestV0?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 119. "If there is no client metrics receiver plugin
>>>> configured on the
>>>>>>>>>>>>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest
>>>> with
>>>>>>>>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The
>>>> client
>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the
>>>> PushIntervalMs
>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>> expired. This allows the metrics receiver to be enabled or
>>>> disabled
>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>>> having to restart the broker or reset the client
>> connection."
>>>>>>>>>>>>>>>>> "no client metrics receiver plugin configured" is defined
>> by
>>>> no
>>>>>>>>>>>> metric
>>>>>>>>>>>>>>>>> reporter implementing the ClientTelemetry interface, right?
>>>> In that
>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>> it would be useful to avoid the clients sending
>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve
>>>> the current
>>>>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and
>>>> PushTelemetryRequestV0:
>>>>>>>>>>>>>> Could
>>>>>>>>>>>>>>>>> we list error codes for each?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This
>>>> method may
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> called from the request handling thread": Where else can
>>>> this method
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> called?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we
>>>> changing the
>>>>>>>>>>>>>> ACL?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Jun
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Milind,
>>>>>>>>>>>>>>>>>> Thanks for your question.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most likely
>>>> to be
>>>>>>>>>>>>>> caused by a
>>>>>>>>>>>>>>>>>> problem in the serialization in the client. I have changed
>>>> the
>>>>>>>>>>>> client
>>>>>>>>>>>>>>>>>> action in this case
>>>>>>>>>>>>>>>>>> to “Log an error and stop pushing metrics”.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have updated the KIP text accordingly.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra
>>>>>>>>>>>>>> <milut...@confluent.io.INVALID>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>> Thanks for the clarifications.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> About 2b:
>>>>>>>>>>>>>>>>>>> In case a client has a bug while serializing, it might be
>>>> difficult
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> client to recover from that without code changes. In
>> that,
>>>> it might
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat the
>>>> error as
>>>>>>>>>>>>>> fatal
>>>>>>>>>>>>>>>>>>> for the client (only fatal in terms of sending the
>>>> metrics, the
>>>>>>>>>>>>>> client
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> keep functioning otherwise). What do you think?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> Milind
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
>>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Milind,
>>>>>>>>>>>>>>>>>>>> Thanks for your questions about the KIP.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1) I did some archaeology and looked at historical
>>>> versions of the
>>>>>>>>>>>>>> KIP.
>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>> think this is
>>>>>>>>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push
>>>> interval. 30
>>>>>>>>>>>>>>>>>> minutes
>>>>>>>>>>>>>>>>>>>> is a mystery
>>>>>>>>>>>>>>>>>>>> to me. I’ve updated the KIP.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2) I think there are two situations in which
>>>> INVALID_RECORD might
>>>>>>>>>>>>>> occur.
>>>>>>>>>>>>>>>>>>>> a) The client might perhaps be using a content-type that
>>>> the
>>>>>>>>>>>> broker
>>>>>>>>>>>>>> does
>>>>>>>>>>>>>>>>>>>> not support.
>>>>>>>>>>>>>>>>>>>> The KIP mentions content-type as a future extension, but
>>>> there’s
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>>>>>>> to start with. Until we have multiple content-types,
>> this
>>>> seems
>>>>>>>>>>>> out
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> scope. I think a
>>>>>>>>>>>>>>>>>>>> future KIP would add another error code for this.
>>>>>>>>>>>>>>>>>>>> b) The client might perhaps have a bug which means the
>>>> metrics
>>>>>>>>>>>>>> payload
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> malformed.
>>>>>>>>>>>>>>>>>>>> Logging a warning and attempting the next metrics push
>> on
>>>> the push
>>>>>>>>>>>>>>>>>>>> interval seems
>>>>>>>>>>>>>>>>>>>> appropriate.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by
>> making
>>>> an
>>>>>>>>>>>>>> immediate
>>>>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client
>>>> bug or
>>>>>>>>>>>>>> perhaps
>>>>>>>>>>>>>>>>>>>> a situation in which a broker sends a compression type
>> in
>>>> a
>>>>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse
>>>>>>>>>>>>>>>>>>>> which is subsequently not supported when its used with a
>>>>>>>>>>>>>>>>>>>> PushTelemetryRequest.
>>>>>>>>>>>>>>>>>>>> We do want the client to have the opportunity to get an
>>>> up-to-date
>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>>>>>>> compression types. I think an immediate
>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest
>>>>>>>>>>>>>>>>>>>> is appropriate.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a
>> Null
>>>>>>>>>>>>>>>>>>>> ClientInstanceId, the
>>>>>>>>>>>>>>>>>>>> receiving broker may not already know the client's
>>>> existing
>>>>>>>>>>>>>>>>>>>> ClientInstanceId. If the
>>>>>>>>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId, it
>>>> simply
>>>>>>>>>>>>>> responds
>>>>>>>>>>>>>>>>>>>> the existing
>>>>>>>>>>>>>>>>>>>> value back to the client. If it does not know the
>> existing
>>>>>>>>>>>>>>>>>>>> ClientInstanceId, it will create
>>>>>>>>>>>>>>>>>>>> a new client instance ID and respond with that.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I will update the KIP with these clarifications.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
>>>>>>>>>>>>>> <milut...@confluent.io.INVALID
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I had a few questions regarding the "Error handling"
>>>> section.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to
>>>>>>>>>>>> eventually
>>>>>>>>>>>>>>>>>>>> trigger
>>>>>>>>>>>>>>>>>>>>> a retry and avoid having to restart clients if the
>>>> cluster
>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by
>> operator
>>>> error,
>>>>>>>>>>>>>> rolling
>>>>>>>>>>>>>>>>>>>>> upgrades, etc."
>>>>>>>>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else.
>>>> What is
>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>> referring to?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2. For the actual errors:
>>>>>>>>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a
>>>> warning to the
>>>>>>>>>>>>>>>>>>>>> application and schedule the next
>>>>>>>>>>>> GetTelemetrySubscriptionsRequest
>>>>>>>>>>>>>> to 5
>>>>>>>>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something like
>>>>>>>>>>>>>> PushIntervalMs?
>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>>> also, why are we scheduling a
>>>> GetTelemetrySubscriptionsRequest in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> case, if the serialization is broken?
>>>>>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE
>> :
>>>> just to
>>>>>>>>>>>>>>>>>> confirm,
>>>>>>>>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be
>>>> scheduled
>>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>>>>>>>>> after the PushTelemetry response, correct?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests
>>>> must include
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> ClientInstanceId returned in the first response,
>>>> regardless of
>>>>>>>>>>>>>> broker":
>>>>>>>>>>>>>>>>>>>>> Will a broker error be returned in case some
>>>> implementation of
>>>>>>>>>>>>>> this KIP
>>>>>>>>>>>>>>>>>>>>> violates this accidentally and sends a request with
>>>>>>>>>>>>>> ClientInstanceId =
>>>>>>>>>>>>>>>>>>>> Null
>>>>>>>>>>>>>>>>>>>>> even when it's been obtained already? Or will a new
>>>>>>>>>>>>>> ClientInstanceId be
>>>>>>>>>>>>>>>>>>>>> returned without an error?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
>>>>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>> I would like to start a new discussion thread on
>>>> KIP-714: Client
>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>> and observability.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce the
>>>> scope.
>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>> overall
>>>>>>>>>>>>>>>>>>>>>> mechanism for client metric subscriptions is
>> unchanged,
>>>> but the
>>>>>>>>>>>>>>>>>>>>>> KIP is now based on the existing client metrics,
>> rather
>>>> than
>>>>>>>>>>>>>>>>>> introducing
>>>>>>>>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster
>>>> operators
>>>>>>>>>>>>>>>>>>>>>> investigate performance problems experienced by
>> clients
>>>> without
>>>>>>>>>>>>>>>>>>>> requiring
>>>>>>>>>>>>>>>>>>>>>> changes to the client application code or
>> configuration.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Andrew


Reply via email to