Hi Andrew, thanks a lot for this KIP. I was thinking of something similar so thanks for writing this down 😊
Couple of questions related to the design: 1. Can we investigate the option for using the Kraft controllers instead of the brokers for sending metrics? The disadvantage of sending these metrics directly to the brokers tightly couples metric observability to data plane availability. If the broker is unhealthy then the root cause of an incident is clear however on partial failures it makes it hard to debug these incidents from the brokers perspective. 2. Ratelimiting will be disable if the `PushTelemetryRequest.Terminating` flag is set. However, this may cause unavailability on the broker if too many clients are terminated at once, especially network threads could become busy and introduce latency on the produce/consume on other non-terminating clients connections. I think there is a room for improvement here. If the client is gracefully shutting down, it could wait for the request to be handled if it is being ratelimited, it doesn't need to "force push" the metrics. For that reason, maybe we could define a separate ratelimiting for telemetry data? 3. `PushIntervalMs` is set on the client side by a response from `GetTelemetrySubscriptionsResponse`. If the broker sets this value to too low, like 1msec, this may hog all of the clients activity and cause an impact on the client side. I think we should introduce a configuration both on the client and the broker side for the minimum and maximum numbers for this value to fence out misconfigurations. 4. One of the important things I face during debugging the client side failures is to understand the client side configurations. Can the client sends these configs during the GetTelemetrySubscriptions request as well? Small comments: 5. Default PushIntervalMs is 5 minutes. Can we make it 1 minute instead? I think 5 minutes of aggregated data is too not helpful in the world of telemetry 😊 6. UnsupportedCompressionType: Shall we fallback to non-compression mode in that case? I think compression is nice to have, but non-compressed telemetry data is valuable as well. Especially for low throughput clients, compressing telemetry data may cause more CPU load then the actual data plane work. Thanks again. Doguscan > On Jun 13, 2023, at 8:06 AM, Andrew Schofield > <andrew_schofield_j...@outlook.com> wrote: > > Hi, > I would like to start a new discussion thread on KIP-714: Client metrics and > observability. > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability > > I have edited the proposal significantly to reduce the scope. The overall > mechanism for client metric subscriptions is unchanged, but the > KIP is now based on the existing client metrics, rather than introducing new > metrics. The purpose remains helping cluster operators > investigate performance problems experienced by clients without requiring > changes to the client application code or configuration. > > Thanks, > Andrew