Re: [DISCUSS] KIP-714: Client metrics and observability

Andrew Schofield Thu, 12 Oct 2023 11:42:23 -0700

Hi Jun,
Thanks for your comments.

130. As Matthias described, and I am adding to the KIP, the 
`KafkaStreams#clientInstanceIds` method
is only permitted when the state is RUNNING or REBALANCING. Also, clients can 
be added dynamically
so the maps might change over time. If it’s in a permitted state, the method is 
prepared to wait up to the
supplied timeout to get the client instance ids. It does not return a partial 
result - it returns a result or
fails.


131. I’ve refactored the `ClientsInstanceIds` object and the global consumer is 
now part of the map
of consumers. There is no need for the Optional any longer. I’ve also renamed 
it `ClientInstanceIds`.

132. My reading of 
`(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*)` is that
It does not support every request type - it supports Produce, FetchConsumer and 
FetchFollower.
Consequently, I think the ClientMetricsSubscriptionRequestCount is not 
instantly obsolete.

If I’ve misunderstood, please let me know.

Thanks,
Andrew


> On 12 Oct 2023, at 01:07, Jun Rao <j...@confluent.io.INVALID> wrote:
>
> Hi, Andrew,
>
> Thanks for the updated KIP. Just a few more minor comments.
>
> 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for all
> consumer/producer/adminClient instances to be initialized? Are all those
> instances created during KafkaStreams initialization?
>
> 131. Why does globalConsumerInstanceId() return Optional<String> while
> other consumer instances don't return Optional?
>
> 132. ClientMetricsSubscriptionRequestCount: Do we need this since we have a
> set of generic metrics
> (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
> report Request rate for every request type?
>
> Thanks,
>
> Jun
>
> On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax <mj...@apache.org> wrote:
>
>> Thanks!
>>
>> On 10/10/23 11:31 PM, Andrew Schofield wrote:
>>> Matthias,
>>> Yes, I think that’s a sensible way forward and the interface you propose
>> looks good. I’ll update the KIP accordingly.
>>>
>>> Thanks,
>>> Andrew
>>>
>>>> On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote:
>>>>
>>>> Andrew,
>>>>
>>>> yes I would like to get this change into KIP-714 right way. Seems to be
>> important, as we don't know if/when a follow-up KIP for Kafka Streams would
>> land.
>>>>
>>>> I was also thinking (and discussed with a few others) how to expose it,
>> and we would propose the following:
>>>>
>>>> We add a new method to `KafkaStreams` class:
>>>>
>>>>    public ClientsInstanceIds clientsInstanceIds(Duration timeout);
>>>>
>>>> The returned object is like below:
>>>>
>>>>  public class ClientsInstanceIds {
>>>>    // we only have a single admin client per KS instance
>>>>    String adminInstanceId();
>>>>
>>>>    // we only have a single global consumer per KS instance (if any)
>>>>    // Optional<> because we might not have global-thread
>>>>    Optional<String> globalConsumerInstanceId();
>>>>
>>>>    // return a <threadKey -> ClientInstanceId> mapping
>>>>    // for the underlying (restore-)consumers/producers
>>>>    Map<String, String> mainConsumerInstanceIds();
>>>>    Map<String, String> restoreConsumerInstanceIds();
>>>>    Map<String, String> producerInstanceIds();
>>>> }
>>>>
>>>> For the `threadKey`, we would use some pattern like this:
>>>>
>>>>  [Stream|StateUpdater]Thread-<threadIdx>
>>>>
>>>>
>>>> Would this work from your POV?
>>>>
>>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>> On 10/9/23 2:15 AM, Andrew Schofield wrote:
>>>>> Hi Matthias,
>>>>> Good point. Makes sense to me.
>>>>> Is this something that can also be included in the proposed Kafka
>> Streams follow-on KIP, or would you prefer that I add it to KIP-714?
>>>>> I have a slight preference for the former to put all of the KS
>> enhancements into a separate KIP.
>>>>> Thanks,
>>>>> Andrew
>>>>>> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org> wrote:
>>>>>>
>>>>>> Thanks Andrew. SGTM.
>>>>>>
>>>>>> One point you did not address is the idea to add a method to
>> `KafkaStreams` similar to the proposed `clientInstanceId()` that will be
>> added to consumer/producer/admin clients.
>>>>>>
>>>>>> Without addressing this, Kafka Streams users won't have a way to get
>> the assigned `instanceId` of the internally created clients, and thus it
>> would be very difficult for them to know which metrics that the broker
>> receives belong to a Kafka Streams app. It seems they would only find the
>> `instanceIds` in the log4j output if they enable client logging?
>>>>>>
>>>>>> Of course, because there is multiple clients inside Kafka Streams,
>> the return type cannot be an single "String", but must be some some complex
>> data structure -- we could either add a new class, or return a
>> Map<String,String> using a client key that maps to the `instanceId`.
>>>>>>
>>>>>> For example we could use the following key:
>>>>>>
>>>>>>   [Global]StreamThread[-<threadIndex>][-restore][consumer|producer]
>>>>>>
>>>>>> (Of course, only the valid combination.)
>>>>>>
>>>>>> Or maybe even better, we might want to return a `Future` because
>> collection all the `instanceId` might be a blocking all on each client? I
>> have already a few idea how it could be implemented but I don't think it
>> must be discussed on the KIP, as it's an implementation detail.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>> On 10/6/23 4:21 AM, Andrew Schofield wrote:
>>>>>>> Hi Matthias,
>>>>>>> Thanks for your comments. I agree that a follow-up KIP for Kafka
>> Streams makes sense. This KIP currently has made a bit
>>>>>>> of an effort to embrace KS, but it’s not enough by a long way.
>>>>>>> I have removed `application.id <http://application.id/>`. This
>> should be done properly in the follow-up KIP. I don’t believe there’s a
>> downside to
>>>>>>> removing it from this KIP.
>>>>>>> I have reworded the statement about temporarily. In practice, the
>> implementation of this KIP that’s going on while the voting
>>>>>>> progresses happens to use delta temporality, but that’s an
>> implementation detail. Supporting clients must support both
>>>>>>> temporalities.
>>>>>>> I thought about exposing the client instance ID as a metric, but
>> non-numeric metrics are not usual practice and tools
>>>>>>> do not universally support them. I don’t think the KIP is improved
>> by adding one now.
>>>>>>> I have also added constants for the various Config classes for
>> ENABLE_METRICS_PUSH_CONFIG, including to
>>>>>>> StreamsConfig. It’s best to be explicit about this.
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I did not pay attention to this KIP in the past; seems it was
>> on-hold for a while.
>>>>>>>>
>>>>>>>> Overall it sounds very useful, and I think we should extend this
>> with a follow up KIP for Kafka Streams. What is unclear to me at this point
>> is the statement:
>>>>>>>>
>>>>>>>>> Kafka Streams applications have an application.id configured and
>> this identifier should be included as the application_id metrics label.
>>>>>>>>
>>>>>>>> The `application.id` is currently only used as the (main)
>> consumer's `group.id` (and is part of an auto-generated `client.id` if
>> the user does not set one).
>>>>>>>>
>>>>>>>> This comment related to:
>>>>>>>>
>>>>>>>>> The following labels should be added by the client as appropriate
>> before metrics are pushed.
>>>>>>>>
>>>>>>>> Given that Kafka Streams uses the consumer/producer/admin client as
>> "black boxes", a client does at this point not know that it's part of a
>> Kafka Streams application, and thus, it won't be able to attach any such
>> label to the metrics it sends. (Also producer and admin don't even know the
>> value of `application.id` -- only the (main) consumer, indirectly via `
>> group.id`, but also restore and global consumer don't know it, because
>> they don't have `group.id` set).
>>>>>>>>
>>>>>>>> While I am totally in favor of the proposal, I am wondering how we
>> intent to implement it in clean way? Or would we do ok to have some
>> internal client APIs that KS can use to "register" itself with the client?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> While clients must support both temporalities, the broker will
>> initially only send GetTelemetrySubscriptionsResponse.DeltaTemporality=True
>>>>>>>>
>>>>>>>> Not sure if I can follow. How make the decision about DELTA or
>> CUMULATIVE metrics? Should the broker side plugin not decide what metrics
>> it what to receive in which form? So what does "initially" mean -- the
>> broker won't ship with a default plugin implementation?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> The following method is added to the Producer, Consumer, and Admin
>> client interfaces:
>>>>>>>>
>>>>>>>> Should we add anything to Kafka Streams to expose the underlying
>> clients' assigned client-instance-ids programmatically? I am also wondering
>> if clients should report their assigned client-instance-ids as metrics
>> itself (for this case, Kafka Streams won't need to do anything, because we
>> already expose all client metrics).
>>>>>>>>
>>>>>>>> If we add anything programmatic, we need to make it simple, given
>> that Kafka Streams has many clients per `StreamThread` and may have
>> multiple threads.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> enable.metrics.push
>>>>>>>> It might be worth to add this to `StreamsConfig`, too? It set via
>> StreamsConfig, we would forward it to all clients automatically.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9/29/23 5:45 PM, David Jacot wrote:
>>>>>>>>> Hi Andrew,
>>>>>>>>> Thanks for driving this one. I haven't read all the KIP yet but I
>> already
>>>>>>>>> have an initial question. In the Threading section, it is written
>>>>>>>>> "KafkaConsumer: the "background" thread (based on the consumer
>> threading
>>>>>>>>> refactor which is underway)". If I understand this correctly, it
>> means
>>>>>>>>> that KIP-714 won't work if the "old consumer" is used. Am I
>> correct?
>>>>>>>>> Cheers,
>>>>>>>>> David
>>>>>>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>> Hi Philip,
>>>>>>>>>> No, I do not think it should actively search for a broker that
>> supports
>>>>>>>>>> the new
>>>>>>>>>> RPCs. In general, either all of the brokers or none of the
>> brokers will
>>>>>>>>>> support it.
>>>>>>>>>> In the window, where the cluster is being upgraded or client
>> telemetry is
>>>>>>>>>> being
>>>>>>>>>> enabled, there might be a mixed situation. I wouldn’t put too
>> much effort
>>>>>>>>>> into
>>>>>>>>>> this mixed scenario. As the client finds brokers which support
>> the new
>>>>>>>>>> RPCs,
>>>>>>>>>> it can begin to follow the KIP-714 mechanism.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com>
>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Andrew -
>>>>>>>>>>>
>>>>>>>>>>> Question on top of your answers: Do you think the client should
>> actively
>>>>>>>>>>> search for a broker that supports this RPC? As previously
>> mentioned, the
>>>>>>>>>>> broker uses the leastLoadedNode to find its first connection (am
>>>>>>>>>>> I correct?), and what if that broker doesn't support the metric
>> push?
>>>>>>>>>>>
>>>>>>>>>>> P
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Kirk,
>>>>>>>>>>>> Thanks for your question. You are correct that the presence or
>> absence
>>>>>>>>>> of
>>>>>>>>>>>> the new RPCs in the
>>>>>>>>>>>> ApiVersionsResponse tells the client whether to request the
>> telemetry
>>>>>>>>>>>> subscriptions and push
>>>>>>>>>>>> metrics.
>>>>>>>>>>>>
>>>>>>>>>>>> This is of course tricky in practice. It would be conceivable,
>> as a
>>>>>>>>>>>> cluster is upgraded to AK 3.7
>>>>>>>>>>>> or as a client metrics receiver plugin is deployed across the
>> cluster,
>>>>>>>>>>>> that a client connects to some
>>>>>>>>>>>> brokers that support the new RPCs and some that do not.
>>>>>>>>>>>>
>>>>>>>>>>>> Here’s my suggestion:
>>>>>>>>>>>> * If a client is not connected to any brokers that support in
>> the new
>>>>>>>>>>>> RPCs, it cannot push metrics.
>>>>>>>>>>>> * If a client is only connected to brokers that support the new
>> RPCs, it
>>>>>>>>>>>> will use the new RPCs in
>>>>>>>>>>>> accordance with the KIP.
>>>>>>>>>>>> * If a client is connected to some brokers that support the new
>> RPCs and
>>>>>>>>>>>> some that do not, it will
>>>>>>>>>>>> use the new RPCs with the supporting subset of brokers in
>> accordance
>>>>>>>>>> with
>>>>>>>>>>>> the KIP.
>>>>>>>>>>>>
>>>>>>>>>>>> Comments?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Andrew
>>>>>>>>>>>>
>>>>>>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andrew/Jun,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I want to make sure I understand question/comment #119… In the
>> case
>>>>>>>>>>>> where a cluster without a metrics client receiver is later
>> reconfigured
>>>>>>>>>> and
>>>>>>>>>>>> restarted to include a metrics client receiver, do we want the
>> client to
>>>>>>>>>>>> thereafter begin pushing metrics to the cluster? From Andrew’s
>> response
>>>>>>>>>> to
>>>>>>>>>>>> question #119, it sounds like we’re using the presence/absence
>> of the
>>>>>>>>>>>> relevant RPCs in ApiVersionsResponse as the
>> to-push-or-not-to-push
>>>>>>>>>>>> indicator. Do I have that correct?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Kirk
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Jun,
>>>>>>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify
>> where
>>>>>>>>>>>> necessary.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 110. Yes, agree. The motivation section mentions this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the
>>>>>>>>>>>> replacement of
>>>>>>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP
>> guidelines. I
>>>>>>>>>>>> think it’s a bit
>>>>>>>>>>>>>> of a debatable point. OTLP makes a distinction between a
>> namespace
>>>>>>>>>> and a
>>>>>>>>>>>>>> multi-word component. If it was “client.id” then “client”
>> would be a
>>>>>>>>>>>> namespace with
>>>>>>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So, it
>> was
>>>>>>>>>>>> intentional, but debatable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 112. Thanks. The link target moved. Fixed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 113. Thanks. Fixed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 114.1. If a standard metric makes sense for a client, it
>> should use
>>>>>>>>>> the
>>>>>>>>>>>> exact same
>>>>>>>>>>>>>> name. If a standard metric doesn’t make sense for a client,
>> then it
>>>>>>>>>> can
>>>>>>>>>>>> omit that metric.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For a required metric, the situation is stronger. All clients
>> must
>>>>>>>>>>>> implement these
>>>>>>>>>>>>>> metrics with these names in order to implement the KIP. But
>> the
>>>>>>>>>>>> required metrics
>>>>>>>>>>>>>> are essentially the number of connections and the request
>> latency,
>>>>>>>>>>>> which do not
>>>>>>>>>>>>>> reference the underlying implementation of the client (which
>>>>>>>>>>>> producer.record.queue.time.max
>>>>>>>>>>>>>> of course does).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I suppose someone might build a producer-only client that
>> didn’t have
>>>>>>>>>>>> consumer metrics.
>>>>>>>>>>>>>> In this case, the consumer metrics would conceptually have
>> the value 0
>>>>>>>>>>>> and would not
>>>>>>>>>>>>>> need to be sent to the broker.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 114.2. If a client does not implement some metrics, they will
>> not be
>>>>>>>>>>>> available for
>>>>>>>>>>>>>> analysis and troubleshooting. It just makes the ability to
>> combine
>>>>>>>>>>>> metrics from lots
>>>>>>>>>>>>>> different clients less complete.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 115. I think it was probably a mistake to be so specific about
>>>>>>>>>>>> threading in this KIP.
>>>>>>>>>>>>>> When the consumer threading refactor is complete, of course,
>> it would
>>>>>>>>>>>> do the appropriate
>>>>>>>>>>>>>> equivalent. I’ve added a clarification and massively
>> simplified this
>>>>>>>>>>>> section.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 116. I removed “client.terminating”.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 117. Yes. Horrid. Fixed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 118. The Terminating flag just indicates that this is the
>> final
>>>>>>>>>>>> PushTelemetryRequest
>>>>>>>>>>>>>> from this client. Any subsequent request will be rejected. I
>> think
>>>>>>>>>> this
>>>>>>>>>>>> flag should remain.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 119. Good catch. This was actually contradicting another part
>> of the
>>>>>>>>>>>> KIP. The current behaviour
>>>>>>>>>>>>>> is indeed preserved. If the broker doesn’t have a client
>> metrics
>>>>>>>>>>>> receiver plugin, the new RPCs
>>>>>>>>>>>>>> in this KIP are “turned off” and not reported in
>> ApiVersionsResponse.
>>>>>>>>>>>> The client will not
>>>>>>>>>>>>>> attempt to push metrics.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 120. The error handling table lists the error codes for
>>>>>>>>>>>> PushTelemetryResponse. I’ve added one
>>>>>>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t
>> have any
>>>>>>>>>>>> error codes, since the
>>>>>>>>>>>>>> situation in which the client telemetry is not supported is
>> handled by
>>>>>>>>>>>> the RPCs not being offered
>>>>>>>>>>>>>> by the broker.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 121. Again, I think it’s probably a mistake to be specific
>> about
>>>>>>>>>>>> threading. Removed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation
>> should be
>>>>>>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation
>> should be
>>>>>>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the
>> CLUSTER
>>>>>>>>>>>>>> resource.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the detailed review.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 110. Another potential motivation is the multiple clients
>> support.
>>>>>>>>>>>> Some of
>>>>>>>>>>>>>>> the places may not have good monitoring support for non-java
>> clients.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for
>> metric name
>>>>>>>>>> and
>>>>>>>>>>>>>>> replace '-' with '_' for attributes. Why is the
>> inconsistency?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 112. OTLP specification: Page is not found from the link.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 113. "Defining standard and required metrics makes the
>> monitoring and
>>>>>>>>>>>>>>> troubleshooting of clients from various client types ":
>> Incomplete
>>>>>>>>>>>> sentence.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 114. standard/required metrics
>>>>>>>>>>>>>>> 114.1 Do other clients need to implement those metrics with
>> the exact
>>>>>>>>>>>> same
>>>>>>>>>>>>>>> names?
>>>>>>>>>>>>>>> 114.2 What happens if some of those metrics are missing from
>> a
>>>>>>>>>> client?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and application
>> threads":
>>>>>>>>>> We
>>>>>>>>>>>>>>> have an ongoing effort to refactor the consumer threading
>> model (
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
>>>>>>>>>>>> ).
>>>>>>>>>>>>>>> Once this is done, PRC requests will only be made from the
>> background
>>>>>>>>>>>>>>> thread. Should this KIP follow the new model only?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 116. 'The metrics should contain the reason for the client
>>>>>>>>>> termination
>>>>>>>>>>>> by
>>>>>>>>>>>>>>> including the client.terminating metric with the label
>> “reason” ...'.
>>>>>>>>>>>> Hmm,
>>>>>>>>>>>>>>> are we introducing a new metric client.terminating? If so,
>> that needs
>>>>>>>>>>>> to be
>>>>>>>>>>>>>>> explicitly listed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 117. "As the metrics plugin may need to add additional
>> metrics on top
>>>>>>>>>>>> of
>>>>>>>>>>>>>>> this the generic metrics receiver in the broker will not add
>> these
>>>>>>>>>>>> labels
>>>>>>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't read
>> well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 118. "it is possible for the client to send at most one
>> accepted
>>>>>>>>>>>>>>> out-of-profile per connection before the rate-limiter kicks
>> in": If
>>>>>>>>>> we
>>>>>>>>>>>> do
>>>>>>>>>>>>>>> this, do we still need the Terminating flag in
>>>>>>>>>> PushTelemetryRequestV0?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 119. "If there is no client metrics receiver plugin
>> configured on the
>>>>>>>>>>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest
>> with
>>>>>>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The
>> client
>>>>>>>>>> should
>>>>>>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the
>> PushIntervalMs
>>>>>>>>>>>> has
>>>>>>>>>>>>>>> expired. This allows the metrics receiver to be enabled or
>> disabled
>>>>>>>>>>>> without
>>>>>>>>>>>>>>> having to restart the broker or reset the client connection."
>>>>>>>>>>>>>>> "no client metrics receiver plugin configured" is defined by
>> no
>>>>>>>>>> metric
>>>>>>>>>>>>>>> reporter implementing the ClientTelemetry interface, right?
>> In that
>>>>>>>>>>>> case,
>>>>>>>>>>>>>>> it would be useful to avoid the clients sending
>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve
>> the current
>>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and
>> PushTelemetryRequestV0:
>>>>>>>>>>>> Could
>>>>>>>>>>>>>>> we list error codes for each?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This
>> method may
>>>>>>>>>>>> be
>>>>>>>>>>>>>>> called from the request handling thread": Where else can
>> this method
>>>>>>>>>> be
>>>>>>>>>>>>>>> called?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we
>> changing the
>>>>>>>>>>>> ACL?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jun
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Milind,
>>>>>>>>>>>>>>>> Thanks for your question.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most likely
>> to be
>>>>>>>>>>>> caused by a
>>>>>>>>>>>>>>>> problem in the serialization in the client. I have changed
>> the
>>>>>>>>>> client
>>>>>>>>>>>>>>>> action in this case
>>>>>>>>>>>>>>>> to “Log an error and stop pushing metrics”.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have updated the KIP text accordingly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra
>>>>>>>>>>>> <milut...@confluent.io.INVALID>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>> Thanks for the clarifications.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> About 2b:
>>>>>>>>>>>>>>>>> In case a client has a bug while serializing, it might be
>> difficult
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> client to recover from that without code changes. In that,
>> it might
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat the
>> error as
>>>>>>>>>>>> fatal
>>>>>>>>>>>>>>>>> for the client (only fatal in terms of sending the
>> metrics, the
>>>>>>>>>>>> client
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> keep functioning otherwise). What do you think?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Milind
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Milind,
>>>>>>>>>>>>>>>>>> Thanks for your questions about the KIP.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1) I did some archaeology and looked at historical
>> versions of the
>>>>>>>>>>>> KIP.
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> think this is
>>>>>>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push
>> interval. 30
>>>>>>>>>>>>>>>> minutes
>>>>>>>>>>>>>>>>>> is a mystery
>>>>>>>>>>>>>>>>>> to me. I’ve updated the KIP.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2) I think there are two situations in which
>> INVALID_RECORD might
>>>>>>>>>>>> occur.
>>>>>>>>>>>>>>>>>> a) The client might perhaps be using a content-type that
>> the
>>>>>>>>>> broker
>>>>>>>>>>>> does
>>>>>>>>>>>>>>>>>> not support.
>>>>>>>>>>>>>>>>>> The KIP mentions content-type as a future extension, but
>> there’s
>>>>>>>>>>>> only
>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>>>>> to start with. Until we have multiple content-types, this
>> seems
>>>>>>>>>> out
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> scope. I think a
>>>>>>>>>>>>>>>>>> future KIP would add another error code for this.
>>>>>>>>>>>>>>>>>> b) The client might perhaps have a bug which means the
>> metrics
>>>>>>>>>>>> payload
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> malformed.
>>>>>>>>>>>>>>>>>> Logging a warning and attempting the next metrics push on
>> the push
>>>>>>>>>>>>>>>>>> interval seems
>>>>>>>>>>>>>>>>>> appropriate.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making
>> an
>>>>>>>>>>>> immediate
>>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client
>> bug or
>>>>>>>>>>>> perhaps
>>>>>>>>>>>>>>>>>> a situation in which a broker sends a compression type in
>> a
>>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse
>>>>>>>>>>>>>>>>>> which is subsequently not supported when its used with a
>>>>>>>>>>>>>>>>>> PushTelemetryRequest.
>>>>>>>>>>>>>>>>>> We do want the client to have the opportunity to get an
>> up-to-date
>>>>>>>>>>>> list
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> supported
>>>>>>>>>>>>>>>>>> compression types. I think an immediate
>>>>>>>>>>>> GetTelemetrySubscriptionsRequest
>>>>>>>>>>>>>>>>>> is appropriate.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a Null
>>>>>>>>>>>>>>>>>> ClientInstanceId, the
>>>>>>>>>>>>>>>>>> receiving broker may not already know the client's
>> existing
>>>>>>>>>>>>>>>>>> ClientInstanceId. If the
>>>>>>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId, it
>> simply
>>>>>>>>>>>> responds
>>>>>>>>>>>>>>>>>> the existing
>>>>>>>>>>>>>>>>>> value back to the client. If it does not know the existing
>>>>>>>>>>>>>>>>>> ClientInstanceId, it will create
>>>>>>>>>>>>>>>>>> a new client instance ID and respond with that.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I will update the KIP with these clarifications.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
>>>>>>>>>>>> <milut...@confluent.io.INVALID
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I had a few questions regarding the "Error handling"
>> section.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to
>>>>>>>>>> eventually
>>>>>>>>>>>>>>>>>> trigger
>>>>>>>>>>>>>>>>>>> a retry and avoid having to restart clients if the
>> cluster
>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by operator
>> error,
>>>>>>>>>>>> rolling
>>>>>>>>>>>>>>>>>>> upgrades, etc."
>>>>>>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else.
>> What is
>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> referring to?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. For the actual errors:
>>>>>>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a
>> warning to the
>>>>>>>>>>>>>>>>>>> application and schedule the next
>>>>>>>>>> GetTelemetrySubscriptionsRequest
>>>>>>>>>>>> to 5
>>>>>>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something like
>>>>>>>>>>>> PushIntervalMs?
>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>> also, why are we scheduling a
>> GetTelemetrySubscriptionsRequest in
>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> case, if the serialization is broken?
>>>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE :
>> just to
>>>>>>>>>>>>>>>> confirm,
>>>>>>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be
>> scheduled
>>>>>>>>>>>> immediately
>>>>>>>>>>>>>>>>>>> after the PushTelemetry response, correct?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests
>> must include
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> ClientInstanceId returned in the first response,
>> regardless of
>>>>>>>>>>>> broker":
>>>>>>>>>>>>>>>>>>> Will a broker error be returned in case some
>> implementation of
>>>>>>>>>>>> this KIP
>>>>>>>>>>>>>>>>>>> violates this accidentally and sends a request with
>>>>>>>>>>>> ClientInstanceId =
>>>>>>>>>>>>>>>>>> Null
>>>>>>>>>>>>>>>>>>> even when it's been obtained already? Or will a new
>>>>>>>>>>>> ClientInstanceId be
>>>>>>>>>>>>>>>>>>> returned without an error?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
>>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>> I would like to start a new discussion thread on
>> KIP-714: Client
>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>> and observability.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce the
>> scope.
>>>>>>>>>> The
>>>>>>>>>>>>>>>>>> overall
>>>>>>>>>>>>>>>>>>>> mechanism for client metric subscriptions is unchanged,
>> but the
>>>>>>>>>>>>>>>>>>>> KIP is now based on the existing client metrics, rather
>> than
>>>>>>>>>>>>>>>> introducing
>>>>>>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster
>> operators
>>>>>>>>>>>>>>>>>>>> investigate performance problems experienced by clients
>> without
>>>>>>>>>>>>>>>>>> requiring
>>>>>>>>>>>>>>>>>>>> changes to the client application code or configuration.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Andrew
>>>
>>>
>>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to