Re: [DISCUSS] KIP-714: Client metrics and observability

Travis Bischel Mon, 14 Jun 2021 07:24:32 -0700

Hi! I have a few thoughts on this KIP. First, I'd like to thank you for your 
work
and writeup, it's clear that a lot of thought went into this and it's very 
thorough!
However, I'm not convinced it's the right approach from a fundamental level.

Fundamentally, this KIP seems like somewhat of a solution to an organizational
problem. Metrics are organizational concerns, not Kafka operator concerns.
Clients should make it easy to plug in metrics (this is the approach I take in
my own client), and organizations should have processes such that all clients
gather and ship metrics how that organization desires. If an organization is
set up correctly, there is no reason for metrics to be forwarded through Kafka.
This feels like a solution to an organization not properly setting up how
processes ship metrics, and in some ways, it's an overbroad solution, and in
other ways, it doesn't cover the entire problem.

>From the perspective of Kafka operators, it is easy to see that this KIP is
nice in that it just dictates what clients should support for metrics and that
the metrics should ship through Kafka. But, from the perspective of an
observability team, this workflow is basically hijacking the standard flow that
organizations may have. I would rather have applications collect metrics and
ship them the same way every other application does. I'd rather not have to
configure additional plugins within Kafka to take metrics and forward them.

More importantly, this KIP prescibes cardinality problems, requires that to
officially support the KIP a client must support all relevant metrics within
the KIP, and requires that a client cannot support other metrics unless those
other metrics also go through a KIP process. It is difficult to imagine all of
these metrics being relevant to every organization, and there is no way for an
organization to filter what is relevant within the client. Instead, the
filtering is pushed downwards, meaning more network IO and more CPU costs to
filter what is irrelevant and aggregate what needs to be aggregated, and more
time for an organization to setup whatever it is that will be doing this
filtering and aggregating. Contrast this with a client that enables hooking in
to capture numbers that are relevant within an org itself: the org can gather
what they want, ship only want they want, and ship directly to the
observability system they have already set up. As an aside, it may also be
wise to avoid shipping metrics through Kafka about client interaction with
Kafka, because if Kafka is having problems, then orgs lose insight into those
problems. This would be like statuspage using itself for status on its own
systems.

Another downside is that by dictating the important metrics, this KIP either
has two choices: try to choose what is important to every org, and inevitably
leave out something important to somebody else, or just add everything and let
the orgs filter. This KIP mostly looks to go with the latter approach, meaning
orgs will be shipping & filtering. With hooks, an org would be able to gather
exactly what they want.

As well, I expect that org applications have metrics on the state of the
applications outside of the Kafka client. Applications are already sending
non-Kafka-client related metrics outbound to observability systems. If a Kafka
client provided hooks, then users could just gather the additional relevant
Kafka client metrics and ship those metrics the same way they do all of their
other metrics. It feels a bit odd for a Kafka client to have its own separate
way of forwarding metrics. Another benefit hooks in clients is that
organizations do not _have_ to set up additional plugins to forward metrics
from Kafka. Hooks avoid extra organizational work.

The option that the KIP provides for users of clients to opt out of metrics may
avoid some of the above issues (by just disabling things at the user level),
but that's not really great from the perspective of client authors, because the
existence of this KIP forces authors to either just not implement the KIP, or
increase complexity within the KIP. Further, from an operator perspective, if I
would prefer clients to ship metrics through the systems they already have in
place, now I have to expect that anything that uses librdkafka or the official
Java client will be shipping me metrics that I have to deal with (since the KIP
is default enabled).

Lastly, I'm a little wary that this KIP may stem from a product goal of
Confluent: since most everything uses librdkafka or the Java client, then by
defaulting clients sending metrics, Confluent gets an easy way to provide
metric panels for a nice cloud UI. If any client does not want to support these
metrics, and then a user wonders why these hypothetical panels have no metrics,
then Confluent can just reply "use a supported client".  Even if this
(potentially unlikely) scenario is true, then hooks would still be a great
alternative, because then Confluent could provide drop-in hooks for any client
and the end result of easy-panels would be the same.

In summary,

- Metrics are more of an organizational concern, not specifically a broker
  operator concern.

- The proposal seems to hijack how metrics are gathered within organizations

- I don't think KIPs should dictate which metrics should be gathered and which
  should not. Clients instead should make it easy for users to gather anything
  they could be interested in, and ignore anything they are not.

- I think hooks are more extensible, more exact, and fit better into
  organizational workflows.

On 2021/06/02 12:45:45, Magnus Edenhill <mag...@edenhill.se> wrote: 
> Hey all,
> 
> I'm proposing KIP-714 to add remote Client metrics and observability.
> This functionality will allow centralized monitoring and troubleshooting of
> clients and their internals.
> 
> Please see
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> 
> Looking forward to your feedback!
> 
> Regards,
> Magnus
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to