Re: [DISCUSS] KIP-714: Client metrics and observability

Tom Bentley Fri, 11 Aug 2023 09:48:54 -0700

Hi Andrew,

Thanks for picking this KIP up. I know you've started a vote, so these are
unhelpfully late... sorry about that, but hopefully they're still useful.


1. "The Kafka Java client provides an API for the application to read the
generated client instance id to assist in mapping and identification of the
client based on the collected metrics." In the multi-client, single-process
model perhaps it would be desirable to have the option of including this in
log messages emitted by the client library.

2. "Mapping the client instance id to an actual application instance
running on a (virtual) machine can be done by inspecting the metrics
resource labels, such as the client source address and source port, or
security principal, all of which are added by the receiving broker." The
specific example of correlation given here (source IP address) is
problematic in environments where there may be network proxies (e.g.
Kubernetes ingress) on the path between client and broker: The broker sees
the IP address of the proxy. This is a rough edge which could be smoothed
out if Kafka supported the PROXY protocol[1] which seems to have become
something of a defacto standard. I'm not suggesting this need to be part of
the KIP, but perhaps it could be added to Future Work?
[1]: http://www.haproxy.org/download/2.9/doc/proxy-protocol.txt

3. Compression... just an idle idea, but I wonder if a useful further
improvement in compression ratio could be achieve using zstd's support for
dictionary compression[2]. I.e. a client could initially use training mode
when sending metrics, but eventually include a dictionary to be used for
subsequent metric sends. It's probably not worth the effort (at least for
the initial implementation), but since you've gone to the effort of
providing some numbers anyway, maybe it's not much additional effort to at
least find out whether this makes a useful difference.
[2]: http://facebook.github.io/zstd/#small-data

4. Maybe I didn't spot it, but I assume the initial
GetTelemetrySubscriptionsRequest
happens after authentication?

5. Rogue clients -- There are some interesting things to consider if we're
trying to defend against a genuinely adversarial client.

a) Client sends profiling information to all brokers at the maximum rate.
Each broker forwards to the time series DB. Obviously this scales linearly
with number of brokers, but it's clear that the load on the tsdb could be
many times larger than users might expect.
b) Client sends crafted compressed data which decompresses to require more
memory that the broker can allocate.

6. Shadowing the OLTP and protobuf jars -- to be clear by this you mean
both bundling _and_ relocating?

7. "In case the cluster load induced from metrics requests becomes
unmanageable the remedy is to temporarily remove or limit configured
metrics subscriptions.  " How would a user know that the observed load was
due to handling metrics requests?

8. If I understand correctly, when the configured metrics reporter does not
implement the new interface the client would still follow the described
protocol only to have nowhere to send the metrics. Am I overlooking
something?

Thanks again,

Tom

On Fri, 11 Aug 2023 at 07:52, Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Doguscan,
> Thanks for your question.
>
> If the target broker is unreachable, the client can send the metrics to
> another
> broker. It can select any of the other brokers for this purpose. What I
> expect in
> practice is that it loses connection to the broker it’s being using for
> metrics,
> chooses or establishes a connection to another broker, and then selects
> that
> broker for subsequent metrics pushes.
>
> Thanks,
> Andrew
>
> > On 8 Aug 2023, at 08:34, Doğuşcan Namal <namal.dogus...@gmail.com>
> wrote:
> >
> > Thanks for your answers Andrew. I share your pain that it took a while
> for
> > you to get this KIP approved and you want to reduce the scope of it, will
> > be happy to help you with the implementation :)
> >
> > Could you help me walk through what happens if the target broker is
> > unreachable? Is the client going to drop these metrics or is it going to
> > send it to the other brokers it is connected to? This information is
> > crucial to understand the client side impact on leadership failovers.
> > Moreover, in case of partial outages, such as only the network between
> the
> > client and the broker is partitioned whereas the network within the
> cluster
> > is healthy, practically there is no other way than the client side
> metrics
> > to identify this problem.
> >
> > Doguscan
> >
> > On Fri, 4 Aug 2023 at 15:33, Andrew Schofield <
> > andrew_schofield_j...@outlook.com> wrote:
> >
> >> Hi Doguscan,
> >> Thanks for your comments. I’m glad to hear you’re interested in this
> KIP.
> >>
> >> 1) It is preferred that a client sends its metrics to the same broker
> >> connection
> >> but actually it is able to send them to any broker. As a result, if a
> >> broker becomes
> >> unhealthy, the client can push its metrics to any other broker. It seems
> >> to me that
> >> pushing to KRaft controllers instead just has the effect of increasing
> the
> >> load on
> >> the controllers, while still having the characteristic that an unhealthy
> >> controller
> >> would present inconvenience for collecting metrics.
> >>
> >> 2) When the `PushTelemetryRequest.Terminating` flag is set, the standard
> >> request
> >> throttling is not disabled. The metrics rate-limiting based on the push
> >> interval is
> >> not applied in this case for a single request for the combination of
> >> client instance ID
> >> and subscription ID.
> >>
> >> (I have corrected the KIP text because it erroneously said “client ID
> and
> >> subscription ID”.
> >>
> >> 3) While this is a theoretical problem, I’m not keen on adding yet more
> >> configurations
> >> to the broker or client. The `interval.ms` configuration on the
> >> CLIENT_METRICS
> >> resource could perhaps have a minimum and maximum value to prevent
> >> accidental
> >> misconfiguration.
> >>
> >> 4) One of the reasons that this KIP has taken so long to get to this
> stage
> >> is that
> >> it tried to do many things all at once. So, it’s greatly simplified
> >> compared with
> >> 6 months ago. I can see the value of collecting client configurations
> for
> >> problem
> >> determination, but I don’t want to make this KIP more complicated. I
> think
> >> the
> >> idea has merit as a separate follow-on KIP. I would be happy to
> collaborate
> >> with you on this.
> >>
> >> 5) The default is set to 5 minutes to minimise the load on the broker
> for
> >> situations
> >> in which the administrator didn’t set an interval on a metrics
> >> subscription. To
> >> use an interval of 1 minute, it is only necessary to set `interval.ms`
> on
> >> the metrics
> >> subscription to 60000ms.
> >>
> >> 6) Uncompressed data is always supported. The KIP says:
> >> "The CompressionType of NONE will not be
> >> "present in the response from the broker, though the broker does support
> >> uncompressed
> >> "client telemetry if none of the accepted compression codecs are
> supported
> >> by the client.”
> >> So in your example, the client need only use CompressionType=NONE.
> >>
> >> Thanks,
> >> Andrew
> >>
> >>> On 4 Aug 2023, at 14:04, Doğuşcan Namal <namal.dogus...@gmail.com>
> >> wrote:
> >>>
> >>> Hi Andrew, thanks a lot for this KIP. I was thinking of something
> similar
> >>> so thanks for writing this down 😊
> >>>
> >>>
> >>>
> >>> Couple of questions related to the design:
> >>>
> >>>
> >>>
> >>> 1. Can we investigate the option for using the Kraft controllers
> instead
> >> of
> >>> the brokers for sending metrics? The disadvantage of sending these
> >> metrics
> >>> directly to the brokers tightly couples metric observability to data
> >> plane
> >>> availability. If the broker is unhealthy then the root cause of an
> >> incident
> >>> is clear however on partial failures it makes it hard to debug these
> >>> incidents from the brokers perspective.
> >>>
> >>>
> >>>
> >>> 2. Ratelimiting will be disable if the
> `PushTelemetryRequest.Terminating`
> >>> flag is set. However, this may cause unavailability on the broker if
> too
> >>> many clients are terminated at once, especially network threads could
> >>> become busy and introduce latency on the produce/consume on other
> >>> non-terminating clients connections. I think there is a room for
> >>> improvement here. If the client is gracefully shutting down, it could
> >> wait
> >>> for the request to be handled if it is being ratelimited, it doesn't
> need
> >>> to "force push" the metrics. For that reason, maybe we could define a
> >>> separate ratelimiting for telemetry data?
> >>>
> >>>
> >>>
> >>> 3. `PushIntervalMs` is set on the client side by a response from
> >>> `GetTelemetrySubscriptionsResponse`. If the broker sets this value to
> too
> >>> low, like 1msec, this may hog all of the clients activity and cause an
> >>> impact on the client side. I think we should introduce a configuration
> >> both
> >>> on the client and the broker side for the minimum and maximum numbers
> for
> >>> this value to fence out misconfigurations.
> >>>
> >>>
> >>>
> >>> 4. One of the important things I face during debugging the client side
> >>> failures is to understand the client side configurations. Can the
> client
> >>> sends these configs during the GetTelemetrySubscriptions request as
> well?
> >>>
> >>>
> >>>
> >>> Small comments:
> >>>
> >>> 5. Default PushIntervalMs is 5 minutes. Can we make it 1 minute
> instead?
> >> I
> >>> think 5 minutes of aggregated data is too not helpful in the world of
> >>> telemetry 😊
> >>>
> >>> 6. UnsupportedCompressionType: Shall we fallback to non-compression
> mode
> >> in
> >>> that case? I think compression is nice to have, but non-compressed
> >>> telemetry data is valuable as well. Especially for low throughput
> >> clients,
> >>> compressing telemetry data may cause more CPU load then the actual data
> >>> plane work.
> >>>
> >>>
> >>> Thanks again.
> >>>
> >>> Doguscan
> >>>
> >>>
> >>>
> >>>> On Jun 13, 2023, at 8:06 AM, Andrew Schofield
> >>>
> >>>> <andrew_schofield_j...@outlook.com> wrote:
> >>>
> >>>>
> >>>
> >>>> Hi,
> >>>
> >>>> I would like to start a new discussion thread on KIP-714: Client
> metrics
> >>> and
> >>>
> >>>> observability.
> >>>
> >>>>
> >>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >>>
> >>>>
> >>>
> >>>> I have edited the proposal significantly to reduce the scope. The
> >> overall
> >>>
> >>>> mechanism for client metric subscriptions is unchanged, but the
> >>>
> >>>> KIP is now based on the existing client metrics, rather than
> introducing
> >>> new
> >>>
> >>>> metrics. The purpose remains helping cluster operators
> >>>
> >>>> investigate performance problems experienced by clients without
> >> requiring
> >>>
> >>>> changes to the client application code or configuration.
> >>>
> >>>>
> >>>
> >>>> Thanks,
> >>>
> >>>> Andrew
>
>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to