Hi Milind,
Thanks for your questions about the KIP.

1) I did some archaeology and looked at historical versions of the KIP. I think 
this is
just a mistake. 5 minutes is the default metric push interval. 30 minutes is a 
mystery
to me. I’ve updated the KIP.

2) I think there are two situations in which INVALID_RECORD might occur.
a) The client might perhaps be using a content-type that the broker does not 
support.
The KIP mentions content-type as a future extension, but there’s only one 
supported
to start with. Until we have multiple content-types, this seems out of scope. I 
think a
future KIP would add another error code for this.
b) The client might perhaps have a bug which means the metrics payload is 
malformed.
Logging a warning and attempting the next metrics push on the push interval 
seems
appropriate.

UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an immediate
GetTelemetrySubscriptionsRequest.

UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or perhaps
a situation in which a broker sends a compression type in a 
GetTelemetrySubscriptionsResponse
which is subsequently not supported when its used with a PushTelemetryRequest.
We do want the client to have the opportunity to get an up-to-date list of 
supported
compression types. I think an immediate GetTelemetrySubscriptionsRequest is 
appropriate.

3) If a client attempts a subsequent handshake with a Null ClientInstanceId, the
receiving broker may not already know the client's existing ClientInstanceId. 
If the
receiving broker knows the existing ClientInstanceId, it simply responds the 
existing
value back to the client. If it does not know the existing ClientInstanceId, it 
will create
a new client instance ID and respond with that.

I will update the KIP with these clarifications.

Thanks,
Andrew

> On 17 Jul 2023, at 14:21, Milind Luthra <milut...@confluent.io.INVALID> wrote:
>
> Hi Andrew, thanks for this KIP.
>
> I had a few questions regarding the "Error handling" section.
>
> 1. It mentions that "The 5 and 30 minute retries are to eventually trigger
> a retry and avoid having to restart clients if the cluster metrics
> configuration is disabled temporarily, e.g., by operator error, rolling
> upgrades, etc."
> But this 30 min interval isn't mentioned anywhere else. What is it
> referring to?
>
> 2. For the actual errors:
> INVALID_RECORD : The action required is to "Log a warning to the
> application and schedule the next GetTelemetrySubscriptionsRequest to 5
> minutes". Why is this 5 minutes, and not something like PushIntervalMs? And
> also, why are we scheduling a GetTelemetrySubscriptionsRequest in this
> case, if the serialization is broken?
> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to confirm,
> the GetTelemetrySubscriptionsRequest needs to be scheduled immediately
> after the PushTelemetry response, correct?
>
> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include the
> ClientInstanceId returned in the first response, regardless of broker":
> Will a broker error be returned in case some implementation of this KIP
> violates this accidentally and sends a request with ClientInstanceId = Null
> even when it's been obtained already? Or will a new ClientInstanceId be
> returned without an error?
>
> Thanks!
>
> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi,
>> I would like to start a new discussion thread on KIP-714: Client metrics
>> and observability.
>>
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>
>> I have edited the proposal significantly to reduce the scope. The overall
>> mechanism for client metric subscriptions is unchanged, but the
>> KIP is now based on the existing client metrics, rather than introducing
>> new metrics. The purpose remains helping cluster operators
>> investigate performance problems experienced by clients without requiring
>> changes to the client application code or configuration.
>>
>> Thanks,
>> Andrew

Reply via email to