Re: [DISCUSS] KIP-714: Client metrics and observability

Andrew Schofield Fri, 25 Aug 2023 12:31:10 -0700

Hi Tom,
Thanks for your comments. Sorry for the delay in responding. They’re still 
useful
comments in spite of the fact that the voting has begun.


1) This is a good idea. I expect the client will emit the client instance ID
as a log line.

2) I will add PROXY protocol support to the future work. I agree.

3) Thanks for the suggestion.

4) Yes, the client authenticates before it can send any of the RPCs in this KIP.

5) a) Yes, a rogue client could in principle send metrics to all brokers 
resulting
in a lot of data being exported to the back end. Of course, a proper deployment
of a client telemetry reporter plugin would be instrumented to help the operator
diagnose this situation.

b) There are already instances of the client sending compressed data to
the broker. I think it is prudent to limit the maximum metrics payload size.
I will update the KIP accordingly.

6) Yes, bundling and relocating.

7) I will add a broker metric to help with diagnosis.

8) I will add some clarifying text. If the broker does not have a configured
metrics reporter that supports the new interface, it should not push metrics
and will not receive a metrics subscription. I am thinking over the options for
achieving this cleanly and will update the KIP.

Thanks for your interest in the KIP.

Thanks,
Andrew

> On 11 Aug 2023, at 09:48, Tom Bentley <tbent...@redhat.com> wrote:
>
> Hi Andrew,
>
> Thanks for picking this KIP up. I know you've started a vote, so these are
> unhelpfully late... sorry about that, but hopefully they're still useful.
>
> 1. "The Kafka Java client provides an API for the application to read the
> generated client instance id to assist in mapping and identification of the
> client based on the collected metrics." In the multi-client, single-process
> model perhaps it would be desirable to have the option of including this in
> log messages emitted by the client library.
>
> 2. "Mapping the client instance id to an actual application instance
> running on a (virtual) machine can be done by inspecting the metrics
> resource labels, such as the client source address and source port, or
> security principal, all of which are added by the receiving broker." The
> specific example of correlation given here (source IP address) is
> problematic in environments where there may be network proxies (e.g.
> Kubernetes ingress) on the path between client and broker: The broker sees
> the IP address of the proxy. This is a rough edge which could be smoothed
> out if Kafka supported the PROXY protocol[1] which seems to have become
> something of a defacto standard. I'm not suggesting this need to be part of
> the KIP, but perhaps it could be added to Future Work?
> [1]: http://www.haproxy.org/download/2.9/doc/proxy-protocol.txt
>
> 3. Compression... just an idle idea, but I wonder if a useful further
> improvement in compression ratio could be achieve using zstd's support for
> dictionary compression[2]. I.e. a client could initially use training mode
> when sending metrics, but eventually include a dictionary to be used for
> subsequent metric sends. It's probably not worth the effort (at least for
> the initial implementation), but since you've gone to the effort of
> providing some numbers anyway, maybe it's not much additional effort to at
> least find out whether this makes a useful difference.
> [2]: http://facebook.github.io/zstd/#small-data
>
> 4. Maybe I didn't spot it, but I assume the initial
> GetTelemetrySubscriptionsRequest
> happens after authentication?
>
> 5. Rogue clients -- There are some interesting things to consider if we're
> trying to defend against a genuinely adversarial client.
>
> a) Client sends profiling information to all brokers at the maximum rate.
> Each broker forwards to the time series DB. Obviously this scales linearly
> with number of brokers, but it's clear that the load on the tsdb could be
> many times larger than users might expect.
> b) Client sends crafted compressed data which decompresses to require more
> memory that the broker can allocate.
>
> 6. Shadowing the OLTP and protobuf jars -- to be clear by this you mean
> both bundling _and_ relocating?
>
> 7. "In case the cluster load induced from metrics requests becomes
> unmanageable the remedy is to temporarily remove or limit configured
> metrics subscriptions.  " How would a user know that the observed load was
> due to handling metrics requests?
>
> 8. If I understand correctly, when the configured metrics reporter does not
> implement the new interface the client would still follow the described
> protocol only to have nowhere to send the metrics. Am I overlooking
> something?
>
> Thanks again,
>
> Tom
>
> On Fri, 11 Aug 2023 at 07:52, Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Doguscan,
>> Thanks for your question.
>>
>> If the target broker is unreachable, the client can send the metrics to
>> another
>> broker. It can select any of the other brokers for this purpose. What I
>> expect in
>> practice is that it loses connection to the broker it’s being using for
>> metrics,
>> chooses or establishes a connection to another broker, and then selects
>> that
>> broker for subsequent metrics pushes.
>>
>> Thanks,
>> Andrew
>>
>>> On 8 Aug 2023, at 08:34, Doğuşcan Namal <namal.dogus...@gmail.com>
>> wrote:
>>>
>>> Thanks for your answers Andrew. I share your pain that it took a while
>> for
>>> you to get this KIP approved and you want to reduce the scope of it, will
>>> be happy to help you with the implementation :)
>>>
>>> Could you help me walk through what happens if the target broker is
>>> unreachable? Is the client going to drop these metrics or is it going to
>>> send it to the other brokers it is connected to? This information is
>>> crucial to understand the client side impact on leadership failovers.
>>> Moreover, in case of partial outages, such as only the network between
>> the
>>> client and the broker is partitioned whereas the network within the
>> cluster
>>> is healthy, practically there is no other way than the client side
>> metrics
>>> to identify this problem.
>>>
>>> Doguscan
>>>
>>> On Fri, 4 Aug 2023 at 15:33, Andrew Schofield <
>>> andrew_schofield_j...@outlook.com> wrote:
>>>
>>>> Hi Doguscan,
>>>> Thanks for your comments. I’m glad to hear you’re interested in this
>> KIP.
>>>>
>>>> 1) It is preferred that a client sends its metrics to the same broker
>>>> connection
>>>> but actually it is able to send them to any broker. As a result, if a
>>>> broker becomes
>>>> unhealthy, the client can push its metrics to any other broker. It seems
>>>> to me that
>>>> pushing to KRaft controllers instead just has the effect of increasing
>> the
>>>> load on
>>>> the controllers, while still having the characteristic that an unhealthy
>>>> controller
>>>> would present inconvenience for collecting metrics.
>>>>
>>>> 2) When the `PushTelemetryRequest.Terminating` flag is set, the standard
>>>> request
>>>> throttling is not disabled. The metrics rate-limiting based on the push
>>>> interval is
>>>> not applied in this case for a single request for the combination of
>>>> client instance ID
>>>> and subscription ID.
>>>>
>>>> (I have corrected the KIP text because it erroneously said “client ID
>> and
>>>> subscription ID”.
>>>>
>>>> 3) While this is a theoretical problem, I’m not keen on adding yet more
>>>> configurations
>>>> to the broker or client. The `interval.ms` configuration on the
>>>> CLIENT_METRICS
>>>> resource could perhaps have a minimum and maximum value to prevent
>>>> accidental
>>>> misconfiguration.
>>>>
>>>> 4) One of the reasons that this KIP has taken so long to get to this
>> stage
>>>> is that
>>>> it tried to do many things all at once. So, it’s greatly simplified
>>>> compared with
>>>> 6 months ago. I can see the value of collecting client configurations
>> for
>>>> problem
>>>> determination, but I don’t want to make this KIP more complicated. I
>> think
>>>> the
>>>> idea has merit as a separate follow-on KIP. I would be happy to
>> collaborate
>>>> with you on this.
>>>>
>>>> 5) The default is set to 5 minutes to minimise the load on the broker
>> for
>>>> situations
>>>> in which the administrator didn’t set an interval on a metrics
>>>> subscription. To
>>>> use an interval of 1 minute, it is only necessary to set `interval.ms`
>> on
>>>> the metrics
>>>> subscription to 60000ms.
>>>>
>>>> 6) Uncompressed data is always supported. The KIP says:
>>>> "The CompressionType of NONE will not be
>>>> "present in the response from the broker, though the broker does support
>>>> uncompressed
>>>> "client telemetry if none of the accepted compression codecs are
>> supported
>>>> by the client.”
>>>> So in your example, the client need only use CompressionType=NONE.
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>>> On 4 Aug 2023, at 14:04, Doğuşcan Namal <namal.dogus...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi Andrew, thanks a lot for this KIP. I was thinking of something
>> similar
>>>>> so thanks for writing this down 😊
>>>>>
>>>>>
>>>>>
>>>>> Couple of questions related to the design:
>>>>>
>>>>>
>>>>>
>>>>> 1. Can we investigate the option for using the Kraft controllers
>> instead
>>>> of
>>>>> the brokers for sending metrics? The disadvantage of sending these
>>>> metrics
>>>>> directly to the brokers tightly couples metric observability to data
>>>> plane
>>>>> availability. If the broker is unhealthy then the root cause of an
>>>> incident
>>>>> is clear however on partial failures it makes it hard to debug these
>>>>> incidents from the brokers perspective.
>>>>>
>>>>>
>>>>>
>>>>> 2. Ratelimiting will be disable if the
>> `PushTelemetryRequest.Terminating`
>>>>> flag is set. However, this may cause unavailability on the broker if
>> too
>>>>> many clients are terminated at once, especially network threads could
>>>>> become busy and introduce latency on the produce/consume on other
>>>>> non-terminating clients connections. I think there is a room for
>>>>> improvement here. If the client is gracefully shutting down, it could
>>>> wait
>>>>> for the request to be handled if it is being ratelimited, it doesn't
>> need
>>>>> to "force push" the metrics. For that reason, maybe we could define a
>>>>> separate ratelimiting for telemetry data?
>>>>>
>>>>>
>>>>>
>>>>> 3. `PushIntervalMs` is set on the client side by a response from
>>>>> `GetTelemetrySubscriptionsResponse`. If the broker sets this value to
>> too
>>>>> low, like 1msec, this may hog all of the clients activity and cause an
>>>>> impact on the client side. I think we should introduce a configuration
>>>> both
>>>>> on the client and the broker side for the minimum and maximum numbers
>> for
>>>>> this value to fence out misconfigurations.
>>>>>
>>>>>
>>>>>
>>>>> 4. One of the important things I face during debugging the client side
>>>>> failures is to understand the client side configurations. Can the
>> client
>>>>> sends these configs during the GetTelemetrySubscriptions request as
>> well?
>>>>>
>>>>>
>>>>>
>>>>> Small comments:
>>>>>
>>>>> 5. Default PushIntervalMs is 5 minutes. Can we make it 1 minute
>> instead?
>>>> I
>>>>> think 5 minutes of aggregated data is too not helpful in the world of
>>>>> telemetry 😊
>>>>>
>>>>> 6. UnsupportedCompressionType: Shall we fallback to non-compression
>> mode
>>>> in
>>>>> that case? I think compression is nice to have, but non-compressed
>>>>> telemetry data is valuable as well. Especially for low throughput
>>>> clients,
>>>>> compressing telemetry data may cause more CPU load then the actual data
>>>>> plane work.
>>>>>
>>>>>
>>>>> Thanks again.
>>>>>
>>>>> Doguscan
>>>>>
>>>>>
>>>>>
>>>>>> On Jun 13, 2023, at 8:06 AM, Andrew Schofield
>>>>>
>>>>>> <andrew_schofield_j...@outlook.com> wrote:
>>>>>
>>>>>>
>>>>>
>>>>>> Hi,
>>>>>
>>>>>> I would like to start a new discussion thread on KIP-714: Client
>> metrics
>>>>> and
>>>>>
>>>>>> observability.
>>>>>
>>>>>>
>>>>>
>>>>>>
>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>>>>
>>>>>>
>>>>>
>>>>>> I have edited the proposal significantly to reduce the scope. The
>>>> overall
>>>>>
>>>>>> mechanism for client metric subscriptions is unchanged, but the
>>>>>
>>>>>> KIP is now based on the existing client metrics, rather than
>> introducing
>>>>> new
>>>>>
>>>>>> metrics. The purpose remains helping cluster operators
>>>>>
>>>>>> investigate performance problems experienced by clients without
>>>> requiring
>>>>>
>>>>>> changes to the client application code or configuration.
>>>>>
>>>>>>
>>>>>
>>>>>> Thanks,
>>>>>
>>>>>> Andrew

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to