Re: [DISCUSS] KIP-714: Client metrics and observability

Colin McCabe Wed, 16 Jun 2021 15:52:20 -0700

Hi Magnus,

Thanks for the KIP. This is certainly something I've been wishing for for a 
while.


Maybe we should emphasize more that the metrics that are being gathered here 
are Kafka metrics, not general application business logic metrics. That seems 
like a point of confusion in some of the replies here. The analogy with a 
telecom gathering metrics about a DSL modem is a good one. These are really 
metrics about the Kafka cluster itself, very similar to the metrics we expose 
about the broker, controller, and so forth.

In my experience, most users want their Kafka clients to be "plug and play" -- 
they want to start up a Kafka client, and do some things. Their focus is on 
their application, not on the details of the infrastructure. If something is 
goes wrong, they want the Kafka team to diagnose the problem and fix it, or at 
least tell them what the issue is. When the Kafka teams tells them they need to 
install and maintain a third-party metrics system to diagnose the problem, this 
can be a very big disappointment. Many users don't have this level of expertise.

A few critiques:

- As I wrote above, I think this could benefit a lot by being split into 
several RPCs. A registration RPC, a report RPC, and an unregister RPC seem like 
logical choices.

- I don't think the client should be able to choose its own UUID. This adds 
complexity and introduces a chance that clients will choose an ID that is not 
unique. We already have an ID that the client itself supplies (clientID) so 
there is no need to introduce another such ID.

- I might be misunderstanding something here, but my reading of this is that 
the client chooses what metrics to send and the broker filters that on the 
broker-side. I think this is backwards -- the broker should inform the client 
about what it wants, and the client should send only that data. (Of course, the 
client may also not know what the broker is asking for, in which case it can 
choose to not send the data). We shouldn't have clients pumping out data that 
nobody wants to read. (sorry if I misinterpreted and this is already the 
case...)

- In general the schema seems to have a bad case of string-itis. UUID, content 
type, and requested metrics are all strings. Since these messages will be sent 
very frequently, it's quite costly to use strings for all these things. We have 
a type for UUID, which uses 16 bytes -- let's use that type for client instance 
ID, rather than a string which will be much larger. Also, since we already send 
clientID in the message header, there is no need to include it again in the 
instance ID.

- I think it would also be nice to have an enum or something for 
AcceptedContentTypes, RequestedMetrics, etc. We know that new additions to 
these categories will require KIPs, so it should be straightforward for the 
project to just have an enum that allows us to communicate these as ints.

- Can you talk about whether you are adding any new library dependencies to the 
Kafka client? It seems like you'd want to add opencensus / opentelemetry, if we 
are using that format here.

- Standard client resource labels: can we send these only in the registration 
RPC?

best,
Colin

On Wed, Jun 16, 2021, at 08:27, Magnus Edenhill wrote:
> Hi Ryanne,
> 
> this proposal stems from a need to improve troubleshooting Kafka issues.
> 
> As it currently stands, when an application team is experiencing Kafka
> service degradation,
> or the Kafka operator is seeing misbehaving clients, there are plenty of
> steps that needs
> to be taken before any client-side metrics can be observed at all, if at
> all:
>  - Is the application even collecting client metrics? If not it needs to be
> reconfigured or implemented, and restarted;
>    a restart may have business impact, and may also temporarily? remedy the
> problem without giving any further insight
>    into what was wrong.
>  - Are the desired metrics collected? Where are they stored? For how long?
> Is there enough correlating information
>    to map it to cluster-side metrics and events? Does the application
> on-call know how to find the collected metrics?
>  - Export and send these metrics to whoever knows how to interpret them. In
> what format? Are all relevant metadata fields
>    provided?
> 
> The KIP aims to solve all these obstacles by giving the Kafka operator the
> tools to collect this information.
> 
> Regards,
> Magnus
> 
> 
> Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan <ryannedo...@gmail.com>:
> 
> > Magnus, I think such a substantial change requires more motivation than is
> > currently provided. As I read it, the motivation boils down to this: you
> > want your clients to phone-home unless they opt-out. As stated in the KIP,
> > "there are plenty of existing solutions [...] to send metrics [...] to a
> > collector", so the opt-out appears to be the only motivation. Am I missing
> > something?
> >
> > Ryanne
> >
> > On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill <mag...@edenhill.se> wrote:
> >
> > > Hey all,
> > >
> > > I'm proposing KIP-714 to add remote Client metrics and observability.
> > > This functionality will allow centralized monitoring and troubleshooting
> > of
> > > clients and their internals.
> > >
> > > Please see
> > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > >
> > > Looking forward to your feedback!
> > >
> > > Regards,
> > > Magnus
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to