Hi Doguscan,
Thanks for your comments. I’m glad to hear you’re interested in this KIP.

1) It is preferred that a client sends its metrics to the same broker connection
but actually it is able to send them to any broker. As a result, if a broker 
becomes
unhealthy, the client can push its metrics to any other broker. It seems to me 
that
pushing to KRaft controllers instead just has the effect of increasing the load 
on
the controllers, while still having the characteristic that an unhealthy 
controller
would present inconvenience for collecting metrics.

2) When the `PushTelemetryRequest.Terminating` flag is set, the standard request
throttling is not disabled. The metrics rate-limiting based on the push 
interval is
not applied in this case for a single request for the combination of client 
instance ID
and subscription ID.

(I have corrected the KIP text because it erroneously said “client ID and 
subscription ID”.

3) While this is a theoretical problem, I’m not keen on adding yet more 
configurations
to the broker or client. The `interval.ms` configuration on the CLIENT_METRICS
resource could perhaps have a minimum and maximum value to prevent accidental
misconfiguration.

4) One of the reasons that this KIP has taken so long to get to this stage is 
that
it tried to do many things all at once. So, it’s greatly simplified compared 
with
6 months ago. I can see the value of collecting client configurations for 
problem
determination, but I don’t want to make this KIP more complicated. I think the
idea has merit as a separate follow-on KIP. I would be happy to collaborate
with you on this.

5) The default is set to 5 minutes to minimise the load on the broker for 
situations
in which the administrator didn’t set an interval on a metrics subscription. To
use an interval of 1 minute, it is only necessary to set `interval.ms` on the 
metrics
subscription to 60000ms.

6) Uncompressed data is always supported. The KIP says:
 "The CompressionType of NONE will not be
"present in the response from the broker, though the broker does support 
uncompressed
"client telemetry if none of the accepted compression codecs are supported by 
the client.”
So in your example, the client need only use CompressionType=NONE.

Thanks,
Andrew

> On 4 Aug 2023, at 14:04, Doğuşcan Namal <namal.dogus...@gmail.com> wrote:
>
> Hi Andrew, thanks a lot for this KIP. I was thinking of something similar
> so thanks for writing this down 😊
>
>
>
> Couple of questions related to the design:
>
>
>
> 1. Can we investigate the option for using the Kraft controllers instead of
> the brokers for sending metrics? The disadvantage of sending these metrics
> directly to the brokers tightly couples metric observability to data plane
> availability. If the broker is unhealthy then the root cause of an incident
> is clear however on partial failures it makes it hard to debug these
> incidents from the brokers perspective.
>
>
>
> 2. Ratelimiting will be disable if the `PushTelemetryRequest.Terminating`
> flag is set. However, this may cause unavailability on the broker if too
> many clients are terminated at once, especially network threads could
> become busy and introduce latency on the produce/consume on other
> non-terminating clients connections. I think there is a room for
> improvement here. If the client is gracefully shutting down, it could wait
> for the request to be handled if it is being ratelimited, it doesn't need
> to "force push" the metrics. For that reason, maybe we could define a
> separate ratelimiting for telemetry data?
>
>
>
> 3. `PushIntervalMs` is set on the client side by a response from
> `GetTelemetrySubscriptionsResponse`. If the broker sets this value to too
> low, like 1msec, this may hog all of the clients activity and cause an
> impact on the client side. I think we should introduce a configuration both
> on the client and the broker side for the minimum and maximum numbers for
> this value to fence out misconfigurations.
>
>
>
> 4. One of the important things I face during debugging the client side
> failures is to understand the client side configurations. Can the client
> sends these configs during the GetTelemetrySubscriptions request as well?
>
>
>
> Small comments:
>
> 5. Default PushIntervalMs is 5 minutes. Can we make it 1 minute instead? I
> think 5 minutes of aggregated data is too not helpful in the world of
> telemetry 😊
>
> 6. UnsupportedCompressionType: Shall we fallback to non-compression mode in
> that case? I think compression is nice to have, but non-compressed
> telemetry data is valuable as well. Especially for low throughput clients,
> compressing telemetry data may cause more CPU load then the actual data
> plane work.
>
>
> Thanks again.
>
> Doguscan
>
>
>
>> On Jun 13, 2023, at 8:06 AM, Andrew Schofield
>
>> <andrew_schofield_j...@outlook.com> wrote:
>
>>
>
>> Hi,
>
>> I would like to start a new discussion thread on KIP-714: Client metrics
> and
>
>> observability.
>
>>
>
>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>
>>
>
>> I have edited the proposal significantly to reduce the scope. The overall
>
>> mechanism for client metric subscriptions is unchanged, but the
>
>> KIP is now based on the existing client metrics, rather than introducing
> new
>
>> metrics. The purpose remains helping cluster operators
>
>> investigate performance problems experienced by clients without requiring
>
>> changes to the client application code or configuration.
>
>>
>
>> Thanks,
>
>> Andrew

Reply via email to