Re: [DISCUSS] KIP-714: Client metrics and observability

Magnus Edenhill Wed, 22 Sep 2021 04:54:14 -0700

Den mån 20 sep. 2021 kl 20:41 skrev Colin McCabe <cmcc...@apache.org>:

> On Tue, Sep 14, 2021, at 00:47, Magnus Edenhill wrote:
> > Thanks for your feedback Colin, see my updated proposal below.
> > ...
>
> Hi Magnus,
>
> Thanks for the update.
>
> >
> > Splitting up the API into separate data and control requests makes sense.
> > With a split we would have one API for querying the broker for configured
> > metrics subscriptions,
> > and one API for pushing the collected metrics to the broker.
> >
> > A mechanism is still needed to notify the client when the subscription is
> > changed;
> > I’ve added a SubscriptionId for this purpose (which could be a checksum
> of
> > the configured metrics subscription), this id is sent to the client along
> > with the metrics subscription, and the client sends it back to the broker
> > when pushing metrics. If the broker finds the pushed subscription id to
> > differ from what is expected it will return an error to the client, which
> > triggers the client to retrieve the new subscribed metrics and an updated
> > subscription id. The generation of the subscriptionId is opaque to the
> > client.
> >
>
> Hmm, SubscriptionId seems rather complex. We don't have this kind of
> complicated machinery for changing ApiVersions, and that is something that
> can also change over time, and which affects the clients.
>

I'm not sure how it relates to ApiVersion?
The SubscriptionId is a rather simple and stateless way to make sure the
client is using the most recently configured metrics subscriptions.

>
> Changing the configured metrics should be extremely rare. In this case,
> why don't we just close all connections on the broker side? Then the
> clients can re-connect and re-fetch the information about the metrics
> they're supposed to send.

While the standard metrics subscription is rarely updated, the second
use-case of troubleshooting a specific client will require the metrics
subscription to be updated and propagated in a timely manner.
Closing all client connections on the broker side is quite an intrusive
thing to do, will create a thundering horde of reconnects, and doesn't
really serve much of a purpose since the metrics in
this proposal are explicitly not bound to a single broker connection, but
to a client instance, allowing any broker connection to be used. This is a
feature of the proposal.

> >
> > Something like this:
> >
> > // Get the configured metrics subscription.
> > GetTelemetrySubscriptionsRequest {
> >    StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> > newly generated instance id from the broker.
> > }
>
> It seems like the goal here is to have the client register itself, so that
> we can tell if this is an old client reconnecting. If that is the case,
> then I suggest to rename the RPC to RegisterClient.
>

Registering a client is perhaps a good idea, but a bigger take that also
involves other parts of the protocol (we wouldn't need to send the ClientId
in the protocol header, for instance), and is
thus outside the scope of this proposal. What we are aiming to provide here
is as stateless transmission as possible of client metrics through any
broker.

>
> I think we need a better name than "clientInstanceId" since that name is
> very similar to "clientId." Perhaps something like originId? Or clientUuid?
> Let's also use UUID here rather than a string.
>

I believe clientInstanceId is descriptive of what it is; the identification
of a specific client instance or incarnation.
There's some previous art here, e.g., group.id vs group.instance.id. (in
this case it made sense to have the instance configurable, but we want
metrics to be zero-conf).

+1 on using the UUID type as opposed to string.

>
> > 6. > PushTelemetryRequest{
> >        ClientInstanceId = f00d-feed-deff-ceff-ffff-….,
> >        SubscriptionId = 0x234adf34,
> >        ContentType = OTLPv8|ZSTD,
> >        Terminating = False,
> >        Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
> >   }
>
> It's not necessary for the client to re-send its client instance ID here,
> since it already registered with RegisterClient. If the TCP connection
> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
> should get rid of, as I said above.
>

The overhead of resending the client instance id (16 bytes) is minimal in
relation to the metrics data itself, and it is typically only sent every
60s or so.

As for caching it on the connection; the client only requests the
clientInstanceId once per client instance lifetime, not per broker
connection, so the client does not need to re-register itself on each
connection.

The SubscriptionId is used to make sure the client's metrics subscription
is up to date, pretty much a configuration version but without the need for
sequanciality checks, a simple inequality check by the broker is sufficient.

> I don't see the need for protobufs. Why not just use Kafka's own
> serialization mechanism? As much as possible, we should try to avoid
> creating "turduckens" of protocol X containing a buffer serialized with
> protocol Y, containing a protocol serialized with protocol Z. These aren't
> conducive to a good implementation, and make it harder for people to write
> clients. Just use Kafka's RPC protocol (with optional fields if you wish).
>

This is covered in the Rejected alternatives; We do not want to duplicate
the efforts of the OpenTelemetry project, but rather reap the benefits of
their work.
There's also future functionality in this space that OpenTelemetry
provides: events and tracing.

If we do compression on Kafka RPC, I would prefer that we do it a more
> generic way that applies to all control messages, not just this one. I also
> doubt we need to support lots and lots of different compression codecs, at
> first at least.
>

Generic compression as part of the Kafka framing makes sense, but is
outside the scope of this proposal.
However, as with ProduceRequest, there's benefit of having just the data
parts of the request compressed to avoid the need of decompression and
recompression.
A metrics plugin on the broker may for example simply forward the received
compressed metrics as is to an upstream system.

>
> Another thing I'd like to understand is whether we truly need
> "terminating" (or anything like it). I'm still confused about how the
> backend could use this. Keep in mind that we may receive it on multiple
> brokers (or not receive it at all). We may receive more stuff about client
> XYZ from broker 1 after we have already received a "terminated" for client
> XYZ from broker 2.
>

The Terminating flag is useful for the receiving metrics system. E.g., it
can be used to disable alarms for missing data points.
>From the broker's perspective it can be used to delete its metrics
subscription cache entry for the client, but those should have timeouts
anyway.

Maybe we should move this field to the metrics data itself as a label.

> > Makes sense, in the updated proposal above I changed ContentType to a
> > bitmask.
> >
>
> Hmm. You might have forgotten to save your changes. It's still listed as a
> string in the KIP.
>

I just updated the proposal in this thread, will update the KIP as we get
closer to agreement. :)

> > > Hmm, that data is about 10 fields, most of which are strings. It
> certainly
> > > adds a lot of overhead to resend it each time.
> > >
> > > I don't follow the comment about unpacking and repacking -- since the
> > > client registered with the broker it already knows all this
> information, so
> > > there's nothing to unpack or repack, except from memory. If it's more
> > > convenient to serialize it once rather than multiple times, that is an
> > > implementation detail of the broker side plugin, which we are not
> > > specifying here anyway.
> > >
> >
> > The current proposal is pretty much stateless on the broker, it does not
> > need to hold any state for a client (instance), and no state
> > synchronization is needed
> > between brokers in the cluster, which allows a client to seamlessly send
> > metrics to any broker it wants and keeps the API overhead down (no need
> to
> > re-register when
> > switching brokers for instance).
> >
> > We could remove the labels that are already available to the broker on a
> > per-request basis or that it already maintains state for:
> >  - client_id
> >  - client_instance_id
> >  - client_software_*
> >
> > Leaving the following to still be included:
> >  - group_id
> >  - group_instance_id
> >  - transactional_id
> >   etc..
> >
> > What do you think of that?
>
> Yes, that might be a reasonable balance. I don't think we currently
> associate group_id with a connection, so it would be reasonable to re-send
> it in the telemetry request.
>
> What I would suggest next is to set up a protocol definition for the
> metrics you are sending over the wire. For example, you could specify
> something like this for producer metrics:
>
> > { "name": "clientProducerRecordBytes", "type": "int64",
> > "versions": "0+", "taggedVersions": "0+", "tag": 0,
> >  "about": "client.producer.record.bytes" },
> > { "name": "clientProducerQueueMaxMessages", "type": "int32",
> > "versions": "0+", "taggedVersions": "0+", "tag": 1,
> > "about": "client.producer.queue.max.messages" },
>
> This specifies what it is (int32, int64, etc.), how it's being sent, etc.
>

There's an extensive list of standard metrics+types in the KIP, but these
will be using the OpenTelemetry format
so there is no point in having them described as Kafka protocol fields.

Thanks for all your valuable input, Colin!

Regards,
Magnus

>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to