Re: [DISCUSS] KIP-714: Client metrics and observability

Magnus Edenhill Tue, 29 Jun 2021 07:22:45 -0700

Den tors 17 juni 2021 kl 00:52 skrev Colin McCabe <cmcc...@apache.org>:


> Hi Magnus,
>
> Thanks for the KIP. This is certainly something I've been wishing for for
> a while.
>
> Maybe we should emphasize more that the metrics that are being gathered
> here are Kafka metrics, not general application business logic metrics.
> That seems like a point of confusion in some of the replies here. The
> analogy with a telecom gathering metrics about a DSL modem is a good one.
> These are really metrics about the Kafka cluster itself, very similar to
> the metrics we expose about the broker, controller, and so forth.
>

Good point, will make this more clear in the KIP.


>
> In my experience, most users want their Kafka clients to be "plug and
> play" -- they want to start up a Kafka client, and do some things. Their
> focus is on their application, not on the details of the infrastructure. If
> something is goes wrong, they want the Kafka team to diagnose the problem
> and fix it, or at least tell them what the issue is. When the Kafka teams
> tells them they need to install and maintain a third-party metrics system
> to diagnose the problem, this can be a very big disappointment. Many users
> don't have this level of expertise.
>
> A few critiques:
>
> - As I wrote above, I think this could benefit a lot by being split into
> several RPCs. A registration RPC, a report RPC, and an unregister RPC seem
> like logical choices.
>

Responded to this in your previous mail, but in short I think a single
request is sufficient and keeps the implementation complexity / state down.


>
> - I don't think the client should be able to choose its own UUID. This
> adds complexity and introduces a chance that clients will choose an ID that
> is not unique. We already have an ID that the client itself supplies
> (clientID) so there is no need to introduce another such ID.
>

The CLIENT_INSTANCE_ID (which is a combination of the client.id and a UUID)
is actually generated by the receiving broker on first contact.
The need for a new unique semi-random id is outlined in the KIP, but in
short; the client.id is not unique, and we need something unique that still
is prefix-matchable to the client.id so that we can add subscriptions
either using prefix-matching of just the client.id (which may match one or
more client instances), and exact matching which will match a one specific
client instance.



> - I might be misunderstanding something here, but my reading of this is
> that the client chooses what metrics to send and the broker filters that on
> the broker-side. I think this is backwards -- the broker should inform the
> client about what it wants, and the client should send only that data. (Of
> course, the client may also not know what the broker is asking for, in
> which case it can choose to not send the data). We shouldn't have clients
> pumping out data that nobody wants to read. (sorry if I misinterpreted and
> this is already the case...)
>

This is indeed completely controlled from the cluster side:
The cluster operator (et.al) configured client metric subscriptions, which
are basically: what metrics to collect, at what interval, from what client
instance(s).
These subscriptions are then propagated to matching clients, which in turn
starts pushing the requested metrics (but nothing else) to the broker.



> - In general the schema seems to have a bad case of string-itis. UUID,
> content type, and requested metrics are all strings. Since these messages
> will be sent very frequently, it's quite costly to use strings for all
> these things. We have a type for UUID, which uses 16 bytes -- let's use
> that type for client instance ID, rather than a string which will be much
> larger. Also, since we already send clientID in the message header, there
> is no need to include it again in the instance ID.
>

As explained above we need the client.id in the CLIENT_INSTANCE_ID. And I
don't think the overhead of this one string per request is going to be much
of an issue,
typical metric push intervals are probably in the >60s range.
If this becomes a problem we could use a per-connection identifier that the
broker translates to the client instance id before pushing metrics upwards
in the system.


> - I think it would also be nice to have an enum or something for
> AcceptedContentTypes, RequestedMetrics, etc. We know that new additions to
> these categories will require KIPs, so it should be straightforward for the
> project to just have an enum that allows us to communicate these as ints.
>

I'm thinking this might be overly constraining. The broker doesn't parse or
handle the received metrics data itself but just pushes it to the metrics
plugin,
using an enum would require a KIP and broker upgrade if the metrics plugin
supports a newer version of OTLP.
It is probably better if we don't strictly control the metric format itself.



> - Can you talk about whether you are adding any new library dependencies
> to the Kafka client? It seems like you'd want to add opencensus /
> opentelemetry, if we are using that format here.
>

Yeah, as we get closer to concensus more implementation specific details
will be added to the KIP.



>
> - Standard client resource labels: can we send these only in the
> registration RPC?
>

These labels are part of the serialized OTLP data, which means it would
need to be unpacked and repacked (including compression) by the broker (or
metrics plugin), which
I believe is more costly than sending them for each request.

Thanks,
Magnus

>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to