Hi Doguscan, Thanks for your comments. I’m glad to hear you’re interested in this KIP.
1) It is preferred that a client sends its metrics to the same broker connection but actually it is able to send them to any broker. As a result, if a broker becomes unhealthy, the client can push its metrics to any other broker. It seems to me that pushing to KRaft controllers instead just has the effect of increasing the load on the controllers, while still having the characteristic that an unhealthy controller would present inconvenience for collecting metrics. 2) When the `PushTelemetryRequest.Terminating` flag is set, the standard request throttling is not disabled. The metrics rate-limiting based on the push interval is not applied in this case for a single request for the combination of client instance ID and subscription ID. (I have corrected the KIP text because it erroneously said “client ID and subscription ID”. 3) While this is a theoretical problem, I’m not keen on adding yet more configurations to the broker or client. The `interval.ms` configuration on the CLIENT_METRICS resource could perhaps have a minimum and maximum value to prevent accidental misconfiguration. 4) One of the reasons that this KIP has taken so long to get to this stage is that it tried to do many things all at once. So, it’s greatly simplified compared with 6 months ago. I can see the value of collecting client configurations for problem determination, but I don’t want to make this KIP more complicated. I think the idea has merit as a separate follow-on KIP. I would be happy to collaborate with you on this. 5) The default is set to 5 minutes to minimise the load on the broker for situations in which the administrator didn’t set an interval on a metrics subscription. To use an interval of 1 minute, it is only necessary to set `interval.ms` on the metrics subscription to 60000ms. 6) Uncompressed data is always supported. The KIP says: "The CompressionType of NONE will not be "present in the response from the broker, though the broker does support uncompressed "client telemetry if none of the accepted compression codecs are supported by the client.” So in your example, the client need only use CompressionType=NONE. Thanks, Andrew > On 4 Aug 2023, at 14:04, Doğuşcan Namal <namal.dogus...@gmail.com> wrote: > > Hi Andrew, thanks a lot for this KIP. I was thinking of something similar > so thanks for writing this down 😊 > > > > Couple of questions related to the design: > > > > 1. Can we investigate the option for using the Kraft controllers instead of > the brokers for sending metrics? The disadvantage of sending these metrics > directly to the brokers tightly couples metric observability to data plane > availability. If the broker is unhealthy then the root cause of an incident > is clear however on partial failures it makes it hard to debug these > incidents from the brokers perspective. > > > > 2. Ratelimiting will be disable if the `PushTelemetryRequest.Terminating` > flag is set. However, this may cause unavailability on the broker if too > many clients are terminated at once, especially network threads could > become busy and introduce latency on the produce/consume on other > non-terminating clients connections. I think there is a room for > improvement here. If the client is gracefully shutting down, it could wait > for the request to be handled if it is being ratelimited, it doesn't need > to "force push" the metrics. For that reason, maybe we could define a > separate ratelimiting for telemetry data? > > > > 3. `PushIntervalMs` is set on the client side by a response from > `GetTelemetrySubscriptionsResponse`. If the broker sets this value to too > low, like 1msec, this may hog all of the clients activity and cause an > impact on the client side. I think we should introduce a configuration both > on the client and the broker side for the minimum and maximum numbers for > this value to fence out misconfigurations. > > > > 4. One of the important things I face during debugging the client side > failures is to understand the client side configurations. Can the client > sends these configs during the GetTelemetrySubscriptions request as well? > > > > Small comments: > > 5. Default PushIntervalMs is 5 minutes. Can we make it 1 minute instead? I > think 5 minutes of aggregated data is too not helpful in the world of > telemetry 😊 > > 6. UnsupportedCompressionType: Shall we fallback to non-compression mode in > that case? I think compression is nice to have, but non-compressed > telemetry data is valuable as well. Especially for low throughput clients, > compressing telemetry data may cause more CPU load then the actual data > plane work. > > > Thanks again. > > Doguscan > > > >> On Jun 13, 2023, at 8:06 AM, Andrew Schofield > >> <andrew_schofield_j...@outlook.com> wrote: > >> > >> Hi, > >> I would like to start a new discussion thread on KIP-714: Client metrics > and > >> observability. > >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability > >> > >> I have edited the proposal significantly to reduce the scope. The overall > >> mechanism for client metric subscriptions is unchanged, but the > >> KIP is now based on the existing client metrics, rather than introducing > new > >> metrics. The purpose remains helping cluster operators > >> investigate performance problems experienced by clients without requiring > >> changes to the client application code or configuration. > >> > >> Thanks, > >> Andrew