Hi Andrew, thanks a lot for this KIP. I was thinking of something similar
so thanks for writing this down 😊



Couple of questions related to the design:



1. Can we investigate the option for using the Kraft controllers instead of
the brokers for sending metrics? The disadvantage of sending these metrics
directly to the brokers tightly couples metric observability to data plane
availability. If the broker is unhealthy then the root cause of an incident
is clear however on partial failures it makes it hard to debug these
incidents from the brokers perspective.



2. Ratelimiting will be disable if the `PushTelemetryRequest.Terminating`
flag is set. However, this may cause unavailability on the broker if too
many clients are terminated at once, especially network threads could
become busy and introduce latency on the produce/consume on other
non-terminating clients connections. I think there is a room for
improvement here. If the client is gracefully shutting down, it could wait
for the request to be handled if it is being ratelimited, it doesn't need
to "force push" the metrics. For that reason, maybe we could define a
separate ratelimiting for telemetry data?



3. `PushIntervalMs` is set on the client side by a response from
`GetTelemetrySubscriptionsResponse`. If the broker sets this value to too
low, like 1msec, this may hog all of the clients activity and cause an
impact on the client side. I think we should introduce a configuration both
on the client and the broker side for the minimum and maximum numbers for
this value to fence out misconfigurations.



4. One of the important things I face during debugging the client side
failures is to understand the client side configurations. Can the client
sends these configs during the GetTelemetrySubscriptions request as well?



Small comments:

5. Default PushIntervalMs is 5 minutes. Can we make it 1 minute instead? I
think 5 minutes of aggregated data is too not helpful in the world of
telemetry 😊

6. UnsupportedCompressionType: Shall we fallback to non-compression mode in
that case? I think compression is nice to have, but non-compressed
telemetry data is valuable as well. Especially for low throughput clients,
compressing telemetry data may cause more CPU load then the actual data
plane work.


Thanks again.

Doguscan



> On Jun 13, 2023, at 8:06 AM, Andrew Schofield

> <andrew_schofield_j...@outlook.com> wrote:

>

> Hi,

> I would like to start a new discussion thread on KIP-714: Client metrics
and

> observability.

>

>
https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability

>

> I have edited the proposal significantly to reduce the scope. The overall

> mechanism for client metric subscriptions is unchanged, but the

> KIP is now based on the existing client metrics, rather than introducing
new

> metrics. The purpose remains helping cluster operators

> investigate performance problems experienced by clients without requiring

> changes to the client application code or configuration.

>

> Thanks,

> Andrew

Reply via email to