Re: [DISCUSS] KIP-714: Client metrics and observability

2021-07-22 Thread Feng Min
On Wed, Jul 21, 2021 at 6:17 PM Colin McCabe  wrote:

> On Tue, Jun 29, 2021, at 07:22, Magnus Edenhill wrote:
> > Den tors 17 juni 2021 kl 00:52 skrev Colin McCabe :
> > > A few critiques:
> > >
> > > - As I wrote above, I think this could benefit a lot by being split
> into
> > > several RPCs. A registration RPC, a report RPC, and an unregister RPC
> seem
> > > like logical choices.
> > >
> >
> > Responded to this in your previous mail, but in short I think a single
> > request is sufficient and keeps the implementation complexity / state
> down.
> >
>
> Hi Magnus,
>
> I still suspect that trying to do everything with a single RPC is more
> complex than using multiple RPCs.
>
> Can you go into more detail about how the client learns what metrics it
> should send? This was the purpose of the "registration" step in my scheme
> above.
>
> It seems quite awkward to combine an RPC for reporting metrics with and
> RPC for finding out what metrics are configured to be reported. For
> example, how would you build a tool to check what metrics are configured to
> be reported? Does the tool have to report fake metrics, just because
> there's no other way to get back that information? Seems wrong. (It would
> be a bit like combining createTopics and listTopics for "simplicity")
>

 +1 on separate RPC on metric discovery and metric report. I actually think
it makes complexity/state down compared with single RPC.


>
> > > - I don't think the client should be able to choose its own UUID. This
> > > adds complexity and introduces a chance that clients will choose an ID
> that
> > > is not unique. We already have an ID that the client itself supplies
> > > (clientID) so there is no need to introduce another such ID.
> > >
> >
> > The CLIENT_INSTANCE_ID (which is a combination of the client.id and a
> UUID)
> > is actually generated by the receiving broker on first contact.
> > The need for a new unique semi-random id is outlined in the KIP, but in
> > short; the client.id is not unique, and we need something unique that
> still
> > is prefix-matchable to the client.id so that we can add subscriptions
> > either using prefix-matching of just the client.id (which may match one
> or
> > more client instances), and exact matching which will match a one
> specific
> > client instance.
>
> Hmm... the client id is already sent in every RPC as part of the header.
> It's not necessary to send it again as part of one of the other RPC fields,
> right?
>
> More generally, why does the client instance ID need to be
> prefix-matchable? That seems like an implementation detail of the metrics
> collection system used on the broker side. Maybe someone wants to group by
> things other than client IDs -- perhaps client versions, for instance. By
> the same argument, we should put the client version string in the client
> instance ID, since someone might want to group by that. Or maybe we should
> include the hostname, and the IP, and, and, and You see the issue here.
> I think we shouldn't get involved in this kind of decision -- if we just
> pass a UUID, the broker-side software can group it or prefix it however it
> wants internally.
>
> > > - In general the schema seems to have a bad case of string-itis. UUID,
> > > content type, and requested metrics are all strings. Since these
> messages
> > > will be sent very frequently, it's quite costly to use strings for all
> > > these things. We have a type for UUID, which uses 16 bytes -- let's use
> > > that type for client instance ID, rather than a string which will be
> much
> > > larger. Also, since we already send clientID in the message header,
> there
> > > is no need to include it again in the instance ID.
> > >
> >
> > As explained above we need the client.id in the CLIENT_INSTANCE_ID. And
> I
> > don't think the overhead of this one string per request is going to be
> much
> > of an issue,
> > typical metric push intervals are probably in the >60s range.
> > If this becomes a problem we could use a per-connection identifier that
> the
> > broker translates to the client instance id before pushing metrics
> upwards
> > in the system.
> >
>
> This is actually an interesting design question -- why not use a
> per-TCP-connection identifier, rather than a per-client-instance
> identifier? If we are grouping by other things anyway (clientID, principal,
> etc.) on the server side, do we need to maintain a per-process identifier
> rather than a per-connection one?
>
> >
> > > - I think it would also be nice to have an enum or something for
> > > AcceptedContentTypes, RequestedMetrics, etc. We know that new
> additions to
> > > these categories will require KIPs, so it should be straightforward
> for the
> > > project to just have an enum that allows us to communicate these as
> ints.
> > >
> >
> > I'm thinking this might be overly constraining. The broker doesn't parse
> or
> > handle the received metrics data itself but just pushes it to the metrics
> > plugin, using an enum would re

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-07-21 Thread Colin McCabe
On Tue, Jun 29, 2021, at 07:22, Magnus Edenhill wrote:
> Den tors 17 juni 2021 kl 00:52 skrev Colin McCabe :
> > A few critiques:
> >
> > - As I wrote above, I think this could benefit a lot by being split into
> > several RPCs. A registration RPC, a report RPC, and an unregister RPC seem
> > like logical choices.
> >
> 
> Responded to this in your previous mail, but in short I think a single
> request is sufficient and keeps the implementation complexity / state down.
> 

Hi Magnus,

I still suspect that trying to do everything with a single RPC is more complex 
than using multiple RPCs.

Can you go into more detail about how the client learns what metrics it should 
send? This was the purpose of the "registration" step in my scheme above.

It seems quite awkward to combine an RPC for reporting metrics with and RPC for 
finding out what metrics are configured to be reported. For example, how would 
you build a tool to check what metrics are configured to be reported? Does the 
tool have to report fake metrics, just because there's no other way to get back 
that information? Seems wrong. (It would be a bit like combining createTopics 
and listTopics for "simplicity")

> > - I don't think the client should be able to choose its own UUID. This
> > adds complexity and introduces a chance that clients will choose an ID that
> > is not unique. We already have an ID that the client itself supplies
> > (clientID) so there is no need to introduce another such ID.
> >
> 
> The CLIENT_INSTANCE_ID (which is a combination of the client.id and a UUID)
> is actually generated by the receiving broker on first contact.
> The need for a new unique semi-random id is outlined in the KIP, but in
> short; the client.id is not unique, and we need something unique that still
> is prefix-matchable to the client.id so that we can add subscriptions
> either using prefix-matching of just the client.id (which may match one or
> more client instances), and exact matching which will match a one specific
> client instance.

Hmm... the client id is already sent in every RPC as part of the header. It's 
not necessary to send it again as part of one of the other RPC fields, right?

More generally, why does the client instance ID need to be prefix-matchable? 
That seems like an implementation detail of the metrics collection system used 
on the broker side. Maybe someone wants to group by things other than client 
IDs -- perhaps client versions, for instance. By the same argument, we should 
put the client version string in the client instance ID, since someone might 
want to group by that. Or maybe we should include the hostname, and the IP, 
and, and, and You see the issue here. I think we shouldn't get involved in 
this kind of decision -- if we just pass a UUID, the broker-side software can 
group it or prefix it however it wants internally.

> > - In general the schema seems to have a bad case of string-itis. UUID,
> > content type, and requested metrics are all strings. Since these messages
> > will be sent very frequently, it's quite costly to use strings for all
> > these things. We have a type for UUID, which uses 16 bytes -- let's use
> > that type for client instance ID, rather than a string which will be much
> > larger. Also, since we already send clientID in the message header, there
> > is no need to include it again in the instance ID.
> >
> 
> As explained above we need the client.id in the CLIENT_INSTANCE_ID. And I
> don't think the overhead of this one string per request is going to be much
> of an issue,
> typical metric push intervals are probably in the >60s range.
> If this becomes a problem we could use a per-connection identifier that the
> broker translates to the client instance id before pushing metrics upwards
> in the system.
> 

This is actually an interesting design question -- why not use a 
per-TCP-connection identifier, rather than a per-client-instance identifier? If 
we are grouping by other things anyway (clientID, principal, etc.) on the 
server side, do we need to maintain a per-process identifier rather than a 
per-connection one?

> 
> > - I think it would also be nice to have an enum or something for
> > AcceptedContentTypes, RequestedMetrics, etc. We know that new additions to
> > these categories will require KIPs, so it should be straightforward for the
> > project to just have an enum that allows us to communicate these as ints.
> >
> 
> I'm thinking this might be overly constraining. The broker doesn't parse or
> handle the received metrics data itself but just pushes it to the metrics
> plugin, using an enum would require a KIP and broker upgrade if the metrics 
> plugin
> supports a newer version of OTLP.
> It is probably better if we don't strictly control the metric format itself.
> 

Unfortunately, we have to strictly control the metrics format, because 
otherwise clients can't implement it. I agree that we don't need to specify how 
the broker-side code works, since that is pluggable

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-07-09 Thread Xavier Léauté
>
> 1. Did you consider using a `default ClientTelemetryReceiver
> clientReceiver() { return null; }` method on the existing MetricsReporter
> interface, avoiding the need for the ClientTelemetry trait?


I did. Part of the motivation was to separate more clearly the
MetricsReporter methods which are more directly tied to the KafkaMetrics
framework from the metrics collected from clients by the broker.
It would also make it more explicit that this trait only makes sense in the
context of a broker, unlike more general MetricsReporters which can be run
inside client or connect plugins.
That being said, ClientTelemetry would typically still rely on the
configuration and context provided via the metrics reporter, so I agree
that there might not be much value in a separate interface yet.

Maybe we'd be better served if we did a clean break like we did in KIP-504
with the Authorizer interface and revampt the interfaces altogether.
Currently the initialization of a metrics reporter is somewhat difficult,
due to the mix of context information being provided via Reconfigurable,
ClusterResourceListener, and MetricsContext.
There is a lack of a clear initialization sequence, and detecting whether
the reporter runs inside of a client, connect, or a broker is somewhat
brittle.
I felt that fixing those aspects would be outside of the scope of this KIP,
which is already quite large, and would instead keep changes to existing
interfaces minimal.

I don't have a strong feeling though, so if we decide that having a default
method is in line with our current conventions I'd be happy to change that.


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-29 Thread Magnus Edenhill
Hey Tom,

Den mån 21 juni 2021 kl 21:08 skrev Tom Bentley :

>
> 1. Did you consider using a `default ClientTelemetryReceiver
> clientReceiver() { return null; }` method on the existing MetricsReporter
> interface, avoiding the need for the ClientTelemetry trait?
>

I'll let Xavier answer this one since he designed the new interface.



> 2. On the metrics naming and format, I wasn't really clear about what's
> being proposed. I assume we're taking a subset of the existing client
> metrics and representing them as OpenTelemetry metrics, but it didn't
> really explain how the existing metric names would be mapped to meter and
> instrument names. Or did I misunderstand?
>

The KIP is approaching the set of standard metrics from a general viewpoint
rather
than what exactly is provided by the Java clients today, and this is
because we
want these standard metrics to make sense across all languages and all
client implementations.
They're loosely based on existing metrics across the dominant client
implementations.
It is up to each client maintainer to map its existing metrics to the
metrics defined here.
Also, not all metrics may make sense for all clients since the
implementations differ.


3. In the client behaviour section it doesn't explicitly say whether the
> client uses a dedicated thread for this work (I assume it does).
>

Client implementation details are currently left out of the KIP, the focus
is currently
more general protocol-level and high-level client and broker semantics.

I'm not sure if it's best to add Java client specifics to KIP-714, or make
a new KIP
with the Java client implementation details once KIP-714 is accepted.


4. The description of the FunctionalityNotEnabled error code suggests that
> PushTelemetryRequest would only be included in an ApiVersions response if
> the broker was configured with a plugin. I think the ApiVersionsResponse is
> normally a constant response (not dependent on broker config), so I wonder
> whether this is really a precedent we want to set here? Surely in a broker
> without a plugin configured it could just return an empty set of
> RequestedMetrics and a maxint NextPushMs in the PushTelemetryResponse?
>

Yes, that's a good idea. That would also solve the (future) issue with
enabling a metrics plugin
while the broker was running.


> 5. Maybe the AcceptedContentTypes should be documented to be in priority
> order. That would simplify the action for UnsupportedCompressionType.
>

Good idea!


> 6. """As the client will not know the broker id of its bootstrap servers
> the broker_id label should be set to “bootstrap”.""" Maybe using the same
> convention as is used in the NetworkClient, where bootstrap servers are the
> id of the negative of their index in the list?
>

This too!


> 7. Maybe call it "client.process.rss.bytes" rather than
> "client.process.memory.bytes",
> to be explicit?
>

Yeah I started out with rss but then went with something more generic.
Don't really have a strong opinion.


8. It's a little confusing that --id option to kafka-client-metrics.sh can
> be a prefix or an exact match. Perhaps --id and --id-prefix would be
> clearer.
>

Makes sense.


> 9. Maybe I missed it, but does the client continue to push metrics to the
> same broker as it randomly picked initially? If it gets disconnected from
> that broker what happens, does it just randomly pick another?
>

Yep, and the new broker must accept the already assigned CLIENT_INSTANCE_ID
that the client is using.


10. To subscribe to all metrics I assume I can just do
> `kafka-client-metrics.sh ... --metric ''`? It might be worth saying this
> explicitly. AFAICS this is the only way to find out all the metrics
> supported by a client if you don't already know from the client's software
> version.
>

Will make a note of that.


Thanks for the valuable input, will update the KIP accordingly.

/Magnus


>


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-29 Thread Magnus Edenhill
Den fre 18 juni 2021 kl 22:32 skrev Travis Bischel :

> H Colin (and Magnus),
>
> Thanks for the replies!
>
> I think the biggest concern I have is the cardinality bits. I'm
> sympathetic to the aspect of this making it easier for Kafka brokers to
> understand *every* aspect of the kafka ecoystem. I am not sure this will
> 100% solve the need there, though: if a client is unable to connect to a
> broker, visibility disappears immediately, no?
>

At the end of the day this is an unsolvable problem, but what the proposed
approach gives us is a channel that is operational when Kafka is
operational, regardless of external systems.
If a Kafka client can't connect to Kafka, its internal Kafka metrics are
not the main interest, but rather on the connectivity/networking side.



>
> I do still think that the problem of difficulty of monitoring within an
> organization results from issues within organizations themselves: orgs
> should have proper processes in place such that anything talking to Kafka
> has the org's plug-in monitoring libraries. Kafka operators can define
> those libraries, such that all clients in the org have the libraries the
> operators require. This satisfies the same goals this KIP aims to provide,
> albeit with the increased org cost of not just having something defined to
> be plugged in.
>

Yeah, that would be great, and some orgs do indeed come close to this. But
most don't, and then there's the case of multi-org; where the client
developers and cluster operators reside in different organizations.



>
> If Kafka operators themselves can which metrics they want, so that the
> broker can tell the client "only send these metrics", then my biggest
> concern is removed.
>

That's indeed how it works, the metrics that a client pushes are set up by
the cluster operator (et.al) by configuring metrics subscriptions. The
client will not
send any metrics that have not been centrally requested/subscribed, it is
all controlled from the cluster; what clients sends what metrics at what
interval.



>
> I do still think that hooks can be a cleaner abstraction to this same
> goal, and then pre-provided libraries (say, "this library provides X,Y,Z
> and sends to prometheus from your client") could exist that more exactly
> satisfy what this KIP aims to provide. This would also avoid the
> kitchen-sink vs. not-comprehensive-enough issue I brought up previously.
> This would also avoid require KIPs for any supported metrics.
>


This defeats the general availability always-on goal of the KIP though:
client metrics available on demand out of the box.


Thanks for your comments Travis.

/Magnus





> On 2021/06/16 22:27:55, "Colin McCabe"  wrote:
> > On Sun, Jun 13, 2021, at 21:51, Travis Bischel wrote:
> > > Hi! I have a few thoughts on this KIP. First, I'd like to thank you
> for
> > > the writeup,
> > > clearly a lot of thought has gone into it and it is very thorough.
> > > However, I'm not
> > > convinced it's the right approach from a fundamental level.
> > >
> > > Fundamentally, this KIP seems like somewhat of a solution to an
> organizational
> > > problem. Metrics are organizational concerns, not Kafka operator
> concerns.
> >
> > Hi Travis,
> >
> > Metrics are certainly Kafka operator concerns. It is very important for
> cluster operators to know things like how many clients there are, what they
> clients are doing, and so forth. This information is needed to administer
> Kafka. Therefore it certainly falls in the domain of the Kafka operations
> team (and the Kafka development team.)
> >
> > We have added many metrics in the past to make it easier to monitor
> clients. I think this is just another step in that direction.
> >
> > > Clients should make it easy to plug in metrics (this is the approach I
> take in
> > > my own client), and organizations should have processes such that all
> clients
> > > gather and ship metrics how that organization desires.
> > >
> > > If an organization is set up correctly, there is no reason for metrics
> to be
> > > forwarded through Kafka. This feels like a solution to an organization
> not
> > > properly setting up how processes ship metrics, and in some ways, it's
> an
> > > overbroad solution, and in other ways, it doesn't cover the entire
> problem.
> >
> > I think the reason was explained pretty clearly: many admins find it
> difficult to set up monitoring for every client in the organization. In
> general the team which maintains a Kafka cluster is often separate from the
> teams that use the cluster. Therefore rolling out monitoring for clients
> can be very difficult to coordinate.
> >
> > No metrics will ever cover every possible use-case, but the set proposed
> here does seem useful.
> >
> > >
> > > From the perspective of Kafka operators, it is easy to see that this
> KIP is
> > > nice in that it just dictates what clients should support for metrics
> and that
> > > the metrics should ship through Kafka. But, from the perspective of an
> > > observabil

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-29 Thread Magnus Edenhill
Den tors 17 juni 2021 kl 00:52 skrev Colin McCabe :

> Hi Magnus,
>
> Thanks for the KIP. This is certainly something I've been wishing for for
> a while.
>
> Maybe we should emphasize more that the metrics that are being gathered
> here are Kafka metrics, not general application business logic metrics.
> That seems like a point of confusion in some of the replies here. The
> analogy with a telecom gathering metrics about a DSL modem is a good one.
> These are really metrics about the Kafka cluster itself, very similar to
> the metrics we expose about the broker, controller, and so forth.
>

Good point, will make this more clear in the KIP.


>
> In my experience, most users want their Kafka clients to be "plug and
> play" -- they want to start up a Kafka client, and do some things. Their
> focus is on their application, not on the details of the infrastructure. If
> something is goes wrong, they want the Kafka team to diagnose the problem
> and fix it, or at least tell them what the issue is. When the Kafka teams
> tells them they need to install and maintain a third-party metrics system
> to diagnose the problem, this can be a very big disappointment. Many users
> don't have this level of expertise.
>
> A few critiques:
>
> - As I wrote above, I think this could benefit a lot by being split into
> several RPCs. A registration RPC, a report RPC, and an unregister RPC seem
> like logical choices.
>

Responded to this in your previous mail, but in short I think a single
request is sufficient and keeps the implementation complexity / state down.


>
> - I don't think the client should be able to choose its own UUID. This
> adds complexity and introduces a chance that clients will choose an ID that
> is not unique. We already have an ID that the client itself supplies
> (clientID) so there is no need to introduce another such ID.
>

The CLIENT_INSTANCE_ID (which is a combination of the client.id and a UUID)
is actually generated by the receiving broker on first contact.
The need for a new unique semi-random id is outlined in the KIP, but in
short; the client.id is not unique, and we need something unique that still
is prefix-matchable to the client.id so that we can add subscriptions
either using prefix-matching of just the client.id (which may match one or
more client instances), and exact matching which will match a one specific
client instance.



> - I might be misunderstanding something here, but my reading of this is
> that the client chooses what metrics to send and the broker filters that on
> the broker-side. I think this is backwards -- the broker should inform the
> client about what it wants, and the client should send only that data. (Of
> course, the client may also not know what the broker is asking for, in
> which case it can choose to not send the data). We shouldn't have clients
> pumping out data that nobody wants to read. (sorry if I misinterpreted and
> this is already the case...)
>

This is indeed completely controlled from the cluster side:
The cluster operator (et.al) configured client metric subscriptions, which
are basically: what metrics to collect, at what interval, from what client
instance(s).
These subscriptions are then propagated to matching clients, which in turn
starts pushing the requested metrics (but nothing else) to the broker.



> - In general the schema seems to have a bad case of string-itis. UUID,
> content type, and requested metrics are all strings. Since these messages
> will be sent very frequently, it's quite costly to use strings for all
> these things. We have a type for UUID, which uses 16 bytes -- let's use
> that type for client instance ID, rather than a string which will be much
> larger. Also, since we already send clientID in the message header, there
> is no need to include it again in the instance ID.
>

As explained above we need the client.id in the CLIENT_INSTANCE_ID. And I
don't think the overhead of this one string per request is going to be much
of an issue,
typical metric push intervals are probably in the >60s range.
If this becomes a problem we could use a per-connection identifier that the
broker translates to the client instance id before pushing metrics upwards
in the system.


> - I think it would also be nice to have an enum or something for
> AcceptedContentTypes, RequestedMetrics, etc. We know that new additions to
> these categories will require KIPs, so it should be straightforward for the
> project to just have an enum that allows us to communicate these as ints.
>

I'm thinking this might be overly constraining. The broker doesn't parse or
handle the received metrics data itself but just pushes it to the metrics
plugin,
using an enum would require a KIP and broker upgrade if the metrics plugin
supports a newer version of OTLP.
It is probably better if we don't strictly control the metric format itself.



> - Can you talk about whether you are adding any new library dependencies
> to the Kafka client? It seems like you'd want to 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-29 Thread Magnus Edenhill
Thanks for your feedback, Colin, see response below.


Den tors 17 juni 2021 kl 00:28 skrev Colin McCabe :

> On Sun, Jun 13, 2021, at 21:51, Travis Bischel wrote:
>

...

>  > Another downside is that by dictating the important metrics, this KIP
> either
>
> has two choices: try to choose what is important to every org, and
> inevitably
> > leave out something important to somebody else, or just add everything
> and let
> > the orgs filter. This KIP mostly looks to go with the latter approach,
> meaning
> > orgs will be shipping & filtering. With hooks, an org would be able to
> gather
> > exactly what they want.
>
> I actually do agree with this criticism to some extent. It would be good
> if the broker could specify what metrics it wants, and the clients would
> send only those metrics.
>

The metrics to collect are indeed controlled by the cluster operator (or
whoever has access),
this is done by setting up metrics subscriptions (a new Admin ConfigEntry)
that are propagated to the client through the
PushTelemetryResponse, telling the client exactly what metrics to push and
at what interval.



> More generally, I'd like to see this split up into several RPCs rather
> than one mega-RPC.
>
> Maybe something like
> 1. RegisterClient{Request,Response}
> 2. ClientMetricsReport{Request,Response}
> 3. UnregisterClient{Request,Response}
>
> Then the broker can communicate which metrics it wants in
> RegisterClientResponse. It can also assign a client instance ID (which I
> think should be a UUID, not another string).
>

All this functionality is covered by the single PushTelemetryRequest which
is used both
for pushing metrics to the broker (in the request) and propagating metrics
subscriptions
to the client (in the response). Using a single request type for both these
operations allows
piggy-backing either metrics or subscriptions (depending on direction) in a
request that
is sent at regular intervals, sort of like a recurring poll.

I think something like RegisterClientRequest makes sense for deconfliction
and fencing,
such as with InitProducerIdRequest, but we don't have any need for that so
I don't think the
added complexity gives us much.

/Magnus


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-21 Thread Tom Bentley
Hi Magnus,

Thanks for the KIP.

1. Did you consider using a `default ClientTelemetryReceiver
clientReceiver() { return null; }` method on the existing MetricsReporter
interface, avoiding the need for the ClientTelemetry trait?
2. On the metrics naming and format, I wasn't really clear about what's
being proposed. I assume we're taking a subset of the existing client
metrics and representing them as OpenTelemetry metrics, but it didn't
really explain how the existing metric names would be mapped to meter and
instrument names. Or did I misunderstand?
3. In the client behaviour section it doesn't explicitly say whether the
client uses a dedicated thread for this work (I assume it does).
4. The description of the FunctionalityNotEnabled error code suggests that
PushTelemetryRequest would only be included in an ApiVersions response if
the broker was configured with a plugin. I think the ApiVersionsResponse is
normally a constant response (not dependent on broker config), so I wonder
whether this is really a precedent we want to set here? Surely in a broker
without a plugin configured it could just return an empty set of
RequestedMetrics and a maxint NextPushMs in the PushTelemetryResponse?
5. Maybe the AcceptedContentTypes should be documented to be in priority
order. That would simplify the action for UnsupportedCompressionType.
6. """As the client will not know the broker id of its bootstrap servers
the broker_id label should be set to “bootstrap”.""" Maybe using the same
convention as is used in the NetworkClient, where bootstrap servers are the
id of the negative of their index in the list?
7. Maybe call it "client.process.rss.bytes" rather than
"client.process.memory.bytes",
to be explicit?
8. It's a little confusing that --id option to kafka-client-metrics.sh can
be a prefix or an exact match. Perhaps --id and --id-prefix would be
clearer.
9. Maybe I missed it, but does the client continue to push metrics to the
same broker as it randomly picked initially? If it gets disconnected from
that broker what happens, does it just randomly pick another?
10. To subscribe to all metrics I assume I can just do
`kafka-client-metrics.sh ... --metric ''`? It might be worth saying this
explicitly. AFAICS this is the only way to find out all the metrics
supported by a client if you don't already know from the client's software
version.

Kind regards,

Tom

On Fri, Jun 18, 2021 at 9:39 PM Travis Bischel 
wrote:

> H Colin (and Magnus),
>
> Thanks for the replies!
>
> I think the biggest concern I have is the cardinality bits. I'm
> sympathetic to the aspect of this making it easier for Kafka brokers to
> understand *every* aspect of the kafka ecoystem. I am not sure this will
> 100% solve the need there, though: if a client is unable to connect to a
> broker, visibility disappears immediately, no?
>
> I do still think that the problem of difficulty of monitoring within an
> organization results from issues within organizations themselves: orgs
> should have proper processes in place such that anything talking to Kafka
> has the org's plug-in monitoring libraries. Kafka operators can define
> those libraries, such that all clients in the org have the libraries the
> operators require. This satisfies the same goals this KIP aims to provide,
> albeit with the increased org cost of not just having something defined to
> be plugged in.
>
> If Kafka operators themselves can which metrics they want, so that the
> broker can tell the client "only send these metrics", then my biggest
> concern is removed.
>
> I do still think that hooks can be a cleaner abstraction to this same
> goal, and then pre-provided libraries (say, "this library provides X,Y,Z
> and sends to prometheus from your client") could exist that more exactly
> satisfy what this KIP aims to provide. This would also avoid the
> kitchen-sink vs. not-comprehensive-enough issue I brought up previously.
> This would also avoid require KIPs for any supported metrics.
>
> On 2021/06/16 22:27:55, "Colin McCabe"  wrote:
> > On Sun, Jun 13, 2021, at 21:51, Travis Bischel wrote:
> > > Hi! I have a few thoughts on this KIP. First, I'd like to thank you
> for
> > > the writeup,
> > > clearly a lot of thought has gone into it and it is very thorough.
> > > However, I'm not
> > > convinced it's the right approach from a fundamental level.
> > >
> > > Fundamentally, this KIP seems like somewhat of a solution to an
> organizational
> > > problem. Metrics are organizational concerns, not Kafka operator
> concerns.
> >
> > Hi Travis,
> >
> > Metrics are certainly Kafka operator concerns. It is very important for
> cluster operators to know things like how many clients there are, what they
> clients are doing, and so forth. This information is needed to administer
> Kafka. Therefore it certainly falls in the domain of the Kafka operations
> team (and the Kafka development team.)
> >
> > We have added many metrics in the past to make it easier to monitor
> clients. I think this is jus

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-18 Thread Travis Bischel
H Colin (and Magnus),

Thanks for the replies!

I think the biggest concern I have is the cardinality bits. I'm sympathetic to 
the aspect of this making it easier for Kafka brokers to understand *every* 
aspect of the kafka ecoystem. I am not sure this will 100% solve the need 
there, though: if a client is unable to connect to a broker, visibility 
disappears immediately, no?

I do still think that the problem of difficulty of monitoring within an 
organization results from issues within organizations themselves: orgs should 
have proper processes in place such that anything talking to Kafka has the 
org's plug-in monitoring libraries. Kafka operators can define those libraries, 
such that all clients in the org have the libraries the operators require. This 
satisfies the same goals this KIP aims to provide, albeit with the increased 
org cost of not just having something defined to be plugged in.

If Kafka operators themselves can which metrics they want, so that the broker 
can tell the client "only send these metrics", then my biggest concern is 
removed.

I do still think that hooks can be a cleaner abstraction to this same goal, and 
then pre-provided libraries (say, "this library provides X,Y,Z and sends to 
prometheus from your client") could exist that more exactly satisfy what this 
KIP aims to provide. This would also avoid the kitchen-sink vs. 
not-comprehensive-enough issue I brought up previously. This would also avoid 
require KIPs for any supported metrics.

On 2021/06/16 22:27:55, "Colin McCabe"  wrote: 
> On Sun, Jun 13, 2021, at 21:51, Travis Bischel wrote:
> > Hi! I have a few thoughts on this KIP. First, I'd like to thank you for 
> > the writeup,
> > clearly a lot of thought has gone into it and it is very thorough. 
> > However, I'm not
> > convinced it's the right approach from a fundamental level.
> > 
> > Fundamentally, this KIP seems like somewhat of a solution to an 
> > organizational
> > problem. Metrics are organizational concerns, not Kafka operator concerns.
> 
> Hi Travis,
> 
> Metrics are certainly Kafka operator concerns. It is very important for 
> cluster operators to know things like how many clients there are, what they 
> clients are doing, and so forth. This information is needed to administer 
> Kafka. Therefore it certainly falls in the domain of the Kafka operations 
> team (and the Kafka development team.)
> 
> We have added many metrics in the past to make it easier to monitor clients. 
> I think this is just another step in that direction.
> 
> > Clients should make it easy to plug in metrics (this is the approach I take 
> > in
> > my own client), and organizations should have processes such that all 
> > clients
> > gather and ship metrics how that organization desires.
> >
> > If an organization is set up correctly, there is no reason for metrics to be
> > forwarded through Kafka. This feels like a solution to an organization not
> > properly setting up how processes ship metrics, and in some ways, it's an
> > overbroad solution, and in other ways, it doesn't cover the entire problem.
> 
> I think the reason was explained pretty clearly: many admins find it 
> difficult to set up monitoring for every client in the organization. In 
> general the team which maintains a Kafka cluster is often separate from the 
> teams that use the cluster. Therefore rolling out monitoring for clients can 
> be very difficult to coordinate.
> 
> No metrics will ever cover every possible use-case, but the set proposed here 
> does seem useful.
> 
> > 
> > From the perspective of Kafka operators, it is easy to see that this KIP is
> > nice in that it just dictates what clients should support for metrics and 
> > that
> > the metrics should ship through Kafka. But, from the perspective of an
> > observability team, this workflow is basically hijacking the standard flow 
> > that
> > organizations may have. I would rather have applications collect metrics and
> > ship them the same way every other application does. I'd rather not have to
> > configure additional plugins within Kafka to take metrics and forward them.
> 
> This change doesn't remove any functionality. If you don't want to use 
> KIP-714 metrics collection, you can simply turn it off and continue 
> collecting metrics the way you always have.
> 
> > 
> > More importantly, this KIP prescibes cardinality problems, requires that to
> > officially support the KIP a client must support all relevant metrics within
> > the KIP, and requires that a client cannot support other metrics unless 
> > those
> > other metrics also go through a KIP process. It is difficult to imagine all 
> > of
> > these metrics being relevant to every organization, and there is no way for 
> > an
> > organization to filter what is relevant within the client. Instead, the
> > filtering is pushed downwards, meaning more network IO and more CPU costs to
> > filter what is irrelevant and aggregate what needs to be aggregated, and 
> > more
> > tim

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-18 Thread Colin McCabe
On Thu, Jun 17, 2021, at 12:13, Ryanne Dolan wrote:
> Colin,
> 
> > lack of support for collecting client metrics
> 
> ...but kafka is not a metrics collector. There are lots of things kafka
> doesn't support. Should it also collect clients' logs for the same reasons?
> What other side channels should it proxy through brokers?
> 

Hi Ryanne,

Kafka already is a metrics collector. 

Take a look at KIP-511: "Collect and Expose Client's Name and Version in the 
Brokers," which aggregates metrics from various clients and re-exposes it as a 
broker metric. Or KIP-607: "Add Metrics to Kafka Streams to Report Properties 
of RocksDB" which aggregates metrics from the local RocksDB instances and 
re-exposes them. Or KIP-608 - "Expose Kafka Metrics in Authorizer". Or lots of 
other KIPs.

This has been the direction we've been moving for a while. It's a direction 
motivated by our experiences in the field with users, who find it cumbersome to 
set up dedicated infra to monitor individual Kafka clients. Magnus, especially, 
has a huge amount of experience here.

>
> > He mentioned the fact that configuring client metrics usually involves
> > setting up a separate metrics collection infrastructure.
> 
> This is not changed with the KIP. It's just a matter of who owns that
> infra, which I don't think should matter to Apache Kafka.
> 

Magnus and I explained a few times the reasons why it does matter. Within most 
organizations, there are usually several teams using clients, which are 
separate from the team which maintains the Kafka cluster. The Kafka team has 
the Kafka experts, which makes it the best place to centralize collecting and 
analyzing Kafka metrics.

In a sense the whole concept of cloud computing is "just a matter of who owns 
infra." It is quite important to users.

> We already have MetricsReporter. I still don't see specific motivation
> beyond the "opt-out" part?
> 
> I think we need exceptional motivation for such a proposal.
> 

 As I've said earlier, if you are happy with the current metrics setup, then 
you can continue using it -- nothing in this KIP means you have to change what 
you're doing.

best,
Colin


> On Thu, Jun 17, 2021, 1:43 PM Colin McCabe  wrote:
> 
> > Hi Ryan,
> >
> > These are not "arguments for observability in general" but descriptions of
> > specific issues that come up due to Kafka's lack of support for collecting
> > client metrics. He mentioned the fact that configuring client metrics
> > usually involves setting up a separate metrics collection infrastructure.
> > Even if this is easy and straightforward to do (which is not the case for
> > most organizations), it still requires reconfiguring and restarting the
> > application, which is disruptive. Correlating client metrics with server
> > metrics is also often hard. These issues are all mitigated by centralizing
> > metrics collection on the broker.
> >
> > best,
> > Colin
> >
> >
> > On Wed, Jun 16, 2021, at 19:03, Ryanne Dolan wrote:
> > > Magnus, I think these are arguments for observability in general, but not
> > > why kafka should sit between a client and a metics collector.
> > >
> > > Ryanne
> > >
> > > On Wed, Jun 16, 2021, 10:27 AM Magnus Edenhill 
> > wrote:
> > >
> > > > Hi Ryanne,
> > > >
> > > > this proposal stems from a need to improve troubleshooting Kafka
> > issues.
> > > >
> > > > As it currently stands, when an application team is experiencing Kafka
> > > > service degradation,
> > > > or the Kafka operator is seeing misbehaving clients, there are plenty
> > of
> > > > steps that needs
> > > > to be taken before any client-side metrics can be observed at all, if
> > at
> > > > all:
> > > >  - Is the application even collecting client metrics? If not it needs
> > to be
> > > > reconfigured or implemented, and restarted;
> > > >a restart may have business impact, and may also temporarily?
> > remedy the
> > > > problem without giving any further insight
> > > >into what was wrong.
> > > >  - Are the desired metrics collected? Where are they stored? For how
> > long?
> > > > Is there enough correlating information
> > > >to map it to cluster-side metrics and events? Does the application
> > > > on-call know how to find the collected metrics?
> > > >  - Export and send these metrics to whoever knows how to interpret
> > them. In
> > > > what format? Are all relevant metadata fields
> > > >provided?
> > > >
> > > > The KIP aims to solve all these obstacles by giving the Kafka operator
> > the
> > > > tools to collect this information.
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > > >
> > > > Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan <
> > ryannedo...@gmail.com>:
> > > >
> > > > > Magnus, I think such a substantial change requires more motivation
> > than
> > > > is
> > > > > currently provided. As I read it, the motivation boils down to this:
> > you
> > > > > want your clients to phone-home unless they opt-out. As stated in the
> > > > KIP,
> > > > > "there are plenty of e

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-17 Thread Ryanne Dolan
Colin,

> lack of support for collecting client metrics

...but kafka is not a metrics collector. There are lots of things kafka
doesn't support. Should it also collect clients' logs for the same reasons?
What other side channels should it proxy through brokers?

> He mentioned the fact that configuring client metrics usually involves
setting up a separate metrics collection infrastructure.

This is not changed with the KIP. It's just a matter of who owns that
infra, which I don't think should matter to Apache Kafka.

We already have MetricsReporter. I still don't see specific motivation
beyond the "opt-out" part?

I think we need exceptional motivation for such a proposal.

On Thu, Jun 17, 2021, 1:43 PM Colin McCabe  wrote:

> Hi Ryan,
>
> These are not "arguments for observability in general" but descriptions of
> specific issues that come up due to Kafka's lack of support for collecting
> client metrics. He mentioned the fact that configuring client metrics
> usually involves setting up a separate metrics collection infrastructure.
> Even if this is easy and straightforward to do (which is not the case for
> most organizations), it still requires reconfiguring and restarting the
> application, which is disruptive. Correlating client metrics with server
> metrics is also often hard. These issues are all mitigated by centralizing
> metrics collection on the broker.
>
> best,
> Colin
>
>
> On Wed, Jun 16, 2021, at 19:03, Ryanne Dolan wrote:
> > Magnus, I think these are arguments for observability in general, but not
> > why kafka should sit between a client and a metics collector.
> >
> > Ryanne
> >
> > On Wed, Jun 16, 2021, 10:27 AM Magnus Edenhill 
> wrote:
> >
> > > Hi Ryanne,
> > >
> > > this proposal stems from a need to improve troubleshooting Kafka
> issues.
> > >
> > > As it currently stands, when an application team is experiencing Kafka
> > > service degradation,
> > > or the Kafka operator is seeing misbehaving clients, there are plenty
> of
> > > steps that needs
> > > to be taken before any client-side metrics can be observed at all, if
> at
> > > all:
> > >  - Is the application even collecting client metrics? If not it needs
> to be
> > > reconfigured or implemented, and restarted;
> > >a restart may have business impact, and may also temporarily?
> remedy the
> > > problem without giving any further insight
> > >into what was wrong.
> > >  - Are the desired metrics collected? Where are they stored? For how
> long?
> > > Is there enough correlating information
> > >to map it to cluster-side metrics and events? Does the application
> > > on-call know how to find the collected metrics?
> > >  - Export and send these metrics to whoever knows how to interpret
> them. In
> > > what format? Are all relevant metadata fields
> > >provided?
> > >
> > > The KIP aims to solve all these obstacles by giving the Kafka operator
> the
> > > tools to collect this information.
> > >
> > > Regards,
> > > Magnus
> > >
> > >
> > > Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan <
> ryannedo...@gmail.com>:
> > >
> > > > Magnus, I think such a substantial change requires more motivation
> than
> > > is
> > > > currently provided. As I read it, the motivation boils down to this:
> you
> > > > want your clients to phone-home unless they opt-out. As stated in the
> > > KIP,
> > > > "there are plenty of existing solutions [...] to send metrics [...]
> to a
> > > > collector", so the opt-out appears to be the only motivation. Am I
> > > missing
> > > > something?
> > > >
> > > > Ryanne
> > > >
> > > > On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill 
> > > wrote:
> > > >
> > > > > Hey all,
> > > > >
> > > > > I'm proposing KIP-714 to add remote Client metrics and
> observability.
> > > > > This functionality will allow centralized monitoring and
> > > troubleshooting
> > > > of
> > > > > clients and their internals.
> > > > >
> > > > > Please see
> > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > > > >
> > > > > Looking forward to your feedback!
> > > > >
> > > > > Regards,
> > > > > Magnus
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-17 Thread Colin McCabe
Hi Ryan,

These are not "arguments for observability in general" but descriptions of 
specific issues that come up due to Kafka's lack of support for collecting  
client metrics. He mentioned the fact that configuring client metrics usually 
involves setting up a separate metrics collection infrastructure. Even if this 
is easy and straightforward to do (which is not the case for most 
organizations), it still requires reconfiguring and restarting the application, 
which is disruptive. Correlating client metrics with server metrics is also 
often hard. These issues are all mitigated by centralizing metrics collection 
on the broker.

best,
Colin


On Wed, Jun 16, 2021, at 19:03, Ryanne Dolan wrote:
> Magnus, I think these are arguments for observability in general, but not
> why kafka should sit between a client and a metics collector.
> 
> Ryanne
> 
> On Wed, Jun 16, 2021, 10:27 AM Magnus Edenhill  wrote:
> 
> > Hi Ryanne,
> >
> > this proposal stems from a need to improve troubleshooting Kafka issues.
> >
> > As it currently stands, when an application team is experiencing Kafka
> > service degradation,
> > or the Kafka operator is seeing misbehaving clients, there are plenty of
> > steps that needs
> > to be taken before any client-side metrics can be observed at all, if at
> > all:
> >  - Is the application even collecting client metrics? If not it needs to be
> > reconfigured or implemented, and restarted;
> >a restart may have business impact, and may also temporarily? remedy the
> > problem without giving any further insight
> >into what was wrong.
> >  - Are the desired metrics collected? Where are they stored? For how long?
> > Is there enough correlating information
> >to map it to cluster-side metrics and events? Does the application
> > on-call know how to find the collected metrics?
> >  - Export and send these metrics to whoever knows how to interpret them. In
> > what format? Are all relevant metadata fields
> >provided?
> >
> > The KIP aims to solve all these obstacles by giving the Kafka operator the
> > tools to collect this information.
> >
> > Regards,
> > Magnus
> >
> >
> > Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan :
> >
> > > Magnus, I think such a substantial change requires more motivation than
> > is
> > > currently provided. As I read it, the motivation boils down to this: you
> > > want your clients to phone-home unless they opt-out. As stated in the
> > KIP,
> > > "there are plenty of existing solutions [...] to send metrics [...] to a
> > > collector", so the opt-out appears to be the only motivation. Am I
> > missing
> > > something?
> > >
> > > Ryanne
> > >
> > > On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill 
> > wrote:
> > >
> > > > Hey all,
> > > >
> > > > I'm proposing KIP-714 to add remote Client metrics and observability.
> > > > This functionality will allow centralized monitoring and
> > troubleshooting
> > > of
> > > > clients and their internals.
> > > >
> > > > Please see
> > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > > >
> > > > Looking forward to your feedback!
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > >
> >
> 


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-16 Thread Ryanne Dolan
Magnus, I think these are arguments for observability in general, but not
why kafka should sit between a client and a metics collector.

Ryanne

On Wed, Jun 16, 2021, 10:27 AM Magnus Edenhill  wrote:

> Hi Ryanne,
>
> this proposal stems from a need to improve troubleshooting Kafka issues.
>
> As it currently stands, when an application team is experiencing Kafka
> service degradation,
> or the Kafka operator is seeing misbehaving clients, there are plenty of
> steps that needs
> to be taken before any client-side metrics can be observed at all, if at
> all:
>  - Is the application even collecting client metrics? If not it needs to be
> reconfigured or implemented, and restarted;
>a restart may have business impact, and may also temporarily? remedy the
> problem without giving any further insight
>into what was wrong.
>  - Are the desired metrics collected? Where are they stored? For how long?
> Is there enough correlating information
>to map it to cluster-side metrics and events? Does the application
> on-call know how to find the collected metrics?
>  - Export and send these metrics to whoever knows how to interpret them. In
> what format? Are all relevant metadata fields
>provided?
>
> The KIP aims to solve all these obstacles by giving the Kafka operator the
> tools to collect this information.
>
> Regards,
> Magnus
>
>
> Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan :
>
> > Magnus, I think such a substantial change requires more motivation than
> is
> > currently provided. As I read it, the motivation boils down to this: you
> > want your clients to phone-home unless they opt-out. As stated in the
> KIP,
> > "there are plenty of existing solutions [...] to send metrics [...] to a
> > collector", so the opt-out appears to be the only motivation. Am I
> missing
> > something?
> >
> > Ryanne
> >
> > On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill 
> wrote:
> >
> > > Hey all,
> > >
> > > I'm proposing KIP-714 to add remote Client metrics and observability.
> > > This functionality will allow centralized monitoring and
> troubleshooting
> > of
> > > clients and their internals.
> > >
> > > Please see
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > >
> > > Looking forward to your feedback!
> > >
> > > Regards,
> > > Magnus
> > >
> >
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-16 Thread Colin McCabe
Hi Magnus,

Thanks for the KIP. This is certainly something I've been wishing for for a 
while.

Maybe we should emphasize more that the metrics that are being gathered here 
are Kafka metrics, not general application business logic metrics. That seems 
like a point of confusion in some of the replies here. The analogy with a 
telecom gathering metrics about a DSL modem is a good one. These are really 
metrics about the Kafka cluster itself, very similar to the metrics we expose 
about the broker, controller, and so forth.

In my experience, most users want their Kafka clients to be "plug and play" -- 
they want to start up a Kafka client, and do some things. Their focus is on 
their application, not on the details of the infrastructure. If something is 
goes wrong, they want the Kafka team to diagnose the problem and fix it, or at 
least tell them what the issue is. When the Kafka teams tells them they need to 
install and maintain a third-party metrics system to diagnose the problem, this 
can be a very big disappointment. Many users don't have this level of expertise.

A few critiques:

- As I wrote above, I think this could benefit a lot by being split into 
several RPCs. A registration RPC, a report RPC, and an unregister RPC seem like 
logical choices.

- I don't think the client should be able to choose its own UUID. This adds 
complexity and introduces a chance that clients will choose an ID that is not 
unique. We already have an ID that the client itself supplies (clientID) so 
there is no need to introduce another such ID.

- I might be misunderstanding something here, but my reading of this is that 
the client chooses what metrics to send and the broker filters that on the 
broker-side. I think this is backwards -- the broker should inform the client 
about what it wants, and the client should send only that data. (Of course, the 
client may also not know what the broker is asking for, in which case it can 
choose to not send the data). We shouldn't have clients pumping out data that 
nobody wants to read. (sorry if I misinterpreted and this is already the 
case...)

- In general the schema seems to have a bad case of string-itis. UUID, content 
type, and requested metrics are all strings. Since these messages will be sent 
very frequently, it's quite costly to use strings for all these things. We have 
a type for UUID, which uses 16 bytes -- let's use that type for client instance 
ID, rather than a string which will be much larger. Also, since we already send 
clientID in the message header, there is no need to include it again in the 
instance ID.

- I think it would also be nice to have an enum or something for 
AcceptedContentTypes, RequestedMetrics, etc. We know that new additions to 
these categories will require KIPs, so it should be straightforward for the 
project to just have an enum that allows us to communicate these as ints.

- Can you talk about whether you are adding any new library dependencies to the 
Kafka client? It seems like you'd want to add opencensus / opentelemetry, if we 
are using that format here.

- Standard client resource labels: can we send these only in the registration 
RPC?

best,
Colin

On Wed, Jun 16, 2021, at 08:27, Magnus Edenhill wrote:
> Hi Ryanne,
> 
> this proposal stems from a need to improve troubleshooting Kafka issues.
> 
> As it currently stands, when an application team is experiencing Kafka
> service degradation,
> or the Kafka operator is seeing misbehaving clients, there are plenty of
> steps that needs
> to be taken before any client-side metrics can be observed at all, if at
> all:
>  - Is the application even collecting client metrics? If not it needs to be
> reconfigured or implemented, and restarted;
>a restart may have business impact, and may also temporarily? remedy the
> problem without giving any further insight
>into what was wrong.
>  - Are the desired metrics collected? Where are they stored? For how long?
> Is there enough correlating information
>to map it to cluster-side metrics and events? Does the application
> on-call know how to find the collected metrics?
>  - Export and send these metrics to whoever knows how to interpret them. In
> what format? Are all relevant metadata fields
>provided?
> 
> The KIP aims to solve all these obstacles by giving the Kafka operator the
> tools to collect this information.
> 
> Regards,
> Magnus
> 
> 
> Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan :
> 
> > Magnus, I think such a substantial change requires more motivation than is
> > currently provided. As I read it, the motivation boils down to this: you
> > want your clients to phone-home unless they opt-out. As stated in the KIP,
> > "there are plenty of existing solutions [...] to send metrics [...] to a
> > collector", so the opt-out appears to be the only motivation. Am I missing
> > something?
> >
> > Ryanne
> >
> > On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill  wrote:
> >
> > > Hey all,
> > >
> > > I'm proposing

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-16 Thread Colin McCabe
On Sun, Jun 13, 2021, at 21:51, Travis Bischel wrote:
> Hi! I have a few thoughts on this KIP. First, I'd like to thank you for 
> the writeup,
> clearly a lot of thought has gone into it and it is very thorough. 
> However, I'm not
> convinced it's the right approach from a fundamental level.
> 
> Fundamentally, this KIP seems like somewhat of a solution to an organizational
> problem. Metrics are organizational concerns, not Kafka operator concerns.

Hi Travis,

Metrics are certainly Kafka operator concerns. It is very important for cluster 
operators to know things like how many clients there are, what they clients are 
doing, and so forth. This information is needed to administer Kafka. Therefore 
it certainly falls in the domain of the Kafka operations team (and the Kafka 
development team.)

We have added many metrics in the past to make it easier to monitor clients. I 
think this is just another step in that direction.

> Clients should make it easy to plug in metrics (this is the approach I take in
> my own client), and organizations should have processes such that all clients
> gather and ship metrics how that organization desires.
>
> If an organization is set up correctly, there is no reason for metrics to be
> forwarded through Kafka. This feels like a solution to an organization not
> properly setting up how processes ship metrics, and in some ways, it's an
> overbroad solution, and in other ways, it doesn't cover the entire problem.

I think the reason was explained pretty clearly: many admins find it difficult 
to set up monitoring for every client in the organization. In general the team 
which maintains a Kafka cluster is often separate from the teams that use the 
cluster. Therefore rolling out monitoring for clients can be very difficult to 
coordinate.

No metrics will ever cover every possible use-case, but the set proposed here 
does seem useful.

> 
> From the perspective of Kafka operators, it is easy to see that this KIP is
> nice in that it just dictates what clients should support for metrics and that
> the metrics should ship through Kafka. But, from the perspective of an
> observability team, this workflow is basically hijacking the standard flow 
> that
> organizations may have. I would rather have applications collect metrics and
> ship them the same way every other application does. I'd rather not have to
> configure additional plugins within Kafka to take metrics and forward them.

This change doesn't remove any functionality. If you don't want to use KIP-714 
metrics collection, you can simply turn it off and continue collecting metrics 
the way you always have.

> 
> More importantly, this KIP prescibes cardinality problems, requires that to
> officially support the KIP a client must support all relevant metrics within
> the KIP, and requires that a client cannot support other metrics unless those
> other metrics also go through a KIP process. It is difficult to imagine all of
> these metrics being relevant to every organization, and there is no way for an
> organization to filter what is relevant within the client. Instead, the
> filtering is pushed downwards, meaning more network IO and more CPU costs to
> filter what is irrelevant and aggregate what needs to be aggregated, and more
> time for an organization to setup whatever it is that will be doing this
> filtering and aggregating. Contrast this with a client that enables hooking in
> to capture numbers that are relevant within an org itself: the org can gather
> what they want, ship only want they want, and ship directly to the
> observability system they have already set up. As an aside, it may also be
> wise to avoid shipping metrics through Kafka about client interaction with
> Kafka, because if Kafka is having problems, then orgs lose insight into those
> problems. This would be like statuspage using itself for status on its own
> systems.
> 
> Another downside is that by dictating the important metrics, this KIP either
> has two choices: try to choose what is important to every org, and inevitably
> leave out something important to somebody else, or just add everything and let
> the orgs filter. This KIP mostly looks to go with the latter approach, meaning
> orgs will be shipping & filtering. With hooks, an org would be able to gather
> exactly what they want.

I actually do agree with this criticism to some extent. It would be good if the 
broker could specify what metrics it wants, and the clients would send only 
those metrics.

More generally, I'd like to see this split up into several RPCs rather than one 
mega-RPC.

Maybe something like 
1. RegisterClient{Request,Response}
2. ClientMetricsReport{Request,Response}
3. UnregisterClient{Request,Response}

Then the broker can communicate which metrics it wants in 
RegisterClientResponse. It can also assign a client instance ID (which I think 
should be a UUID, not another string).

> 
> As well, I expect that org applications have metrics on the state of the
>

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-16 Thread Magnus Edenhill
Hi Ryanne,

this proposal stems from a need to improve troubleshooting Kafka issues.

As it currently stands, when an application team is experiencing Kafka
service degradation,
or the Kafka operator is seeing misbehaving clients, there are plenty of
steps that needs
to be taken before any client-side metrics can be observed at all, if at
all:
 - Is the application even collecting client metrics? If not it needs to be
reconfigured or implemented, and restarted;
   a restart may have business impact, and may also temporarily? remedy the
problem without giving any further insight
   into what was wrong.
 - Are the desired metrics collected? Where are they stored? For how long?
Is there enough correlating information
   to map it to cluster-side metrics and events? Does the application
on-call know how to find the collected metrics?
 - Export and send these metrics to whoever knows how to interpret them. In
what format? Are all relevant metadata fields
   provided?

The KIP aims to solve all these obstacles by giving the Kafka operator the
tools to collect this information.

Regards,
Magnus


Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan :

> Magnus, I think such a substantial change requires more motivation than is
> currently provided. As I read it, the motivation boils down to this: you
> want your clients to phone-home unless they opt-out. As stated in the KIP,
> "there are plenty of existing solutions [...] to send metrics [...] to a
> collector", so the opt-out appears to be the only motivation. Am I missing
> something?
>
> Ryanne
>
> On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill  wrote:
>
> > Hey all,
> >
> > I'm proposing KIP-714 to add remote Client metrics and observability.
> > This functionality will allow centralized monitoring and troubleshooting
> of
> > clients and their internals.
> >
> > Please see
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >
> > Looking forward to your feedback!
> >
> > Regards,
> > Magnus
> >
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-16 Thread Magnus Edenhill
Thanks for your feedback, Travis!

I believe there are different audiences and uses for application (business
logic)
and client (infrastructure) metrics. Kafka clients are part of the
infrastructure,
not the business logic, and should be monitored as such by the organization,
sub-organization, or team, that knows Kafka best and already do Kafka
monitoring - the Kafka operators.


So to be clear, this KIP does not cover application metrics, but Kafka
client metrics.
It does in no way replace or change the way application metrics are
collected, they are
not relevant to the intended use.

An analogy from the telco space are CPEs (customer premises equipment),
e.g. an ADSL router in the customer's home. The network owner - the
infrastructure operator -
monitors the ADSL router metrics for queue pressure, latencies, error
rates, etc, which allows
the operator to effectively troubleshoot customer issues, scale the
network, and foresee
issues, completely without any intervention needed by the end user itself.
This is what we want to achieve with this KIP, extending the infrastructure
operator's
(aka the Kafka cluster operator) monitoring abilities to allow for
end-to-end troubleshooting and observability.


The collection model in the KIP is subscription-based, no metrics will be
collected by default.
Two things need to happen before anything is collected:
 - a metrics plugin needs to be configured on the brokers. This is a custom
plugin to
   serve whatever needs the operator might have for the metrics.
 - client metric subscriptions need to be configured through the Kafka
Admin API to
   select which metrics to collect. The subscription defines what metrics
to collect and at
  what interval; this effectively puts filtering at the edge (client) to
spare central resources.

This functionality is thus opt-in on the cluster side, and opt-out on the
client side, and
great care is taken not to expose any sensitive information in the metrics.


As for what needs to be implemented by a supporting client;
a supporting client does not need to implement all the defined metrics,
each client maintainer may choose
her own subset that makes sense for that given client implementation, and
it is fine to add metrics not
listed in the KIP as long as they're in the client's namespace.
But there's obviously value in having a shared set of common metrics that
all clients provide.
The goal is for all client implementations to support this.


Regards,
Magnus

Den mån 14 juni 2021 kl 16:24 skrev Travis Bischel :

> Hi! I have a few thoughts on this KIP. First, I'd like to thank you for
> the writeup,
> clearly a lot of thought has gone into it and it is very thorough.
> However, I'm not
> convinced it's the right approach from a fundamental level.
>
> Fundamentally, this KIP seems like somewhat of a solution to an
> organizational
> problem. Metrics are organizational concerns, not Kafka operator concerns.
> Clients should make it easy to plug in metrics (this is the approach I
> take in
> my own client), and organizations should have processes such that all
> clients
> gather and ship metrics how that organization desires. If an organization
> is
> set up correctly, there is no reason for metrics to be forwarded through
> Kafka.
> This feels like a solution to an organization not properly setting up how
> processes ship metrics, and in some ways, it's an overbroad solution, and
> in
> other ways, it doesn't cover the entire problem.
>
> From the perspective of Kafka operators, it is easy to see that this KIP is
> nice in that it just dictates what clients should support for metrics and
> that
> the metrics should ship through Kafka. But, from the perspective of an
> observability team, this workflow is basically hijacking the standard flow
> that
> organizations may have. I would rather have applications collect metrics
> and
> ship them the same way every other application does. I'd rather not have to
> configure additional plugins within Kafka to take metrics and forward them.
>
> More importantly, this KIP prescibes cardinality problems, requires that to
> officially support the KIP a client must support all relevant metrics
> within
> the KIP, and requires that a client cannot support other metrics unless
> those
> other metrics also go through a KIP process. It is difficult to imagine
> all of
> these metrics being relevant to every organization, and there is no way
> for an
> organization to filter what is relevant within the client. Instead, the
> filtering is pushed downwards, meaning more network IO and more CPU costs
> to
> filter what is irrelevant and aggregate what needs to be aggregated, and
> more
> time for an organization to setup whatever it is that will be doing this
> filtering and aggregating. Contrast this with a client that enables
> hooking in
> to capture numbers that are relevant within an org itself: the org can
> gather
> what they want, ship only want they want, and ship directly to the
> observability sys

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-14 Thread Ryanne Dolan
Magnus, I think such a substantial change requires more motivation than is
currently provided. As I read it, the motivation boils down to this: you
want your clients to phone-home unless they opt-out. As stated in the KIP,
"there are plenty of existing solutions [...] to send metrics [...] to a
collector", so the opt-out appears to be the only motivation. Am I missing
something?

Ryanne

On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill  wrote:

> Hey all,
>
> I'm proposing KIP-714 to add remote Client metrics and observability.
> This functionality will allow centralized monitoring and troubleshooting of
> clients and their internals.
>
> Please see
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>
> Looking forward to your feedback!
>
> Regards,
> Magnus
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-14 Thread Travis Bischel
Apologies for this duplicate reply, I did not notice the success confirmation 
on the first submission.

On 2021/06/14 04:52:11, Travis Bischel  wrote: 
> Hi! I have a few thoughts on this KIP. First, I'd like to thank you for your 
> work
> and writeup, it's clear that a lot of thought went into this and it's very 
> thorough!
> However, I'm not convinced it's the right approach from a fundamental level.
> 
> Fundamentally, this KIP seems like somewhat of a solution to an organizational
> problem. Metrics are organizational concerns, not Kafka operator concerns.
> Clients should make it easy to plug in metrics (this is the approach I take in
> my own client), and organizations should have processes such that all clients
> gather and ship metrics how that organization desires. If an organization is
> set up correctly, there is no reason for metrics to be forwarded through 
> Kafka.
> This feels like a solution to an organization not properly setting up how
> processes ship metrics, and in some ways, it's an overbroad solution, and in
> other ways, it doesn't cover the entire problem.
> 
> From the perspective of Kafka operators, it is easy to see that this KIP is
> nice in that it just dictates what clients should support for metrics and that
> the metrics should ship through Kafka. But, from the perspective of an
> observability team, this workflow is basically hijacking the standard flow 
> that
> organizations may have. I would rather have applications collect metrics and
> ship them the same way every other application does. I'd rather not have to
> configure additional plugins within Kafka to take metrics and forward them.
> 
> More importantly, this KIP prescibes cardinality problems, requires that to
> officially support the KIP a client must support all relevant metrics within
> the KIP, and requires that a client cannot support other metrics unless those
> other metrics also go through a KIP process. It is difficult to imagine all of
> these metrics being relevant to every organization, and there is no way for an
> organization to filter what is relevant within the client. Instead, the
> filtering is pushed downwards, meaning more network IO and more CPU costs to
> filter what is irrelevant and aggregate what needs to be aggregated, and more
> time for an organization to setup whatever it is that will be doing this
> filtering and aggregating. Contrast this with a client that enables hooking in
> to capture numbers that are relevant within an org itself: the org can gather
> what they want, ship only want they want, and ship directly to the
> observability system they have already set up. As an aside, it may also be
> wise to avoid shipping metrics through Kafka about client interaction with
> Kafka, because if Kafka is having problems, then orgs lose insight into those
> problems. This would be like statuspage using itself for status on its own
> systems.
> 
> Another downside is that by dictating the important metrics, this KIP either
> has two choices: try to choose what is important to every org, and inevitably
> leave out something important to somebody else, or just add everything and let
> the orgs filter. This KIP mostly looks to go with the latter approach, meaning
> orgs will be shipping & filtering. With hooks, an org would be able to gather
> exactly what they want.
> 
> As well, I expect that org applications have metrics on the state of the
> applications outside of the Kafka client. Applications are already sending
> non-Kafka-client related metrics outbound to observability systems. If a Kafka
> client provided hooks, then users could just gather the additional relevant
> Kafka client metrics and ship those metrics the same way they do all of their
> other metrics. It feels a bit odd for a Kafka client to have its own separate
> way of forwarding metrics. Another benefit hooks in clients is that
> organizations do not _have_ to set up additional plugins to forward metrics
> from Kafka. Hooks avoid extra organizational work.
> 
> The option that the KIP provides for users of clients to opt out of metrics 
> may
> avoid some of the above issues (by just disabling things at the user level),
> but that's not really great from the perspective of client authors, because 
> the
> existence of this KIP forces authors to either just not implement the KIP, or
> increase complexity within the KIP. Further, from an operator perspective, if 
> I
> would prefer clients to ship metrics through the systems they already have in
> place, now I have to expect that anything that uses librdkafka or the official
> Java client will be shipping me metrics that I have to deal with (since the 
> KIP
> is default enabled).
> 
> Lastly, I'm a little wary that this KIP may stem from a product goal of
> Confluent: since most everything uses librdkafka or the Java client, then by
> defaulting clients sending metrics, Confluent gets an easy way to provide
> metric panels for a nice cloud UI. If any client does no

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-14 Thread Travis Bischel
Hi! I have a few thoughts on this KIP. First, I'd like to thank you for the 
writeup,
clearly a lot of thought has gone into it and it is very thorough. However, I'm 
not
convinced it's the right approach from a fundamental level.

Fundamentally, this KIP seems like somewhat of a solution to an organizational
problem. Metrics are organizational concerns, not Kafka operator concerns.
Clients should make it easy to plug in metrics (this is the approach I take in
my own client), and organizations should have processes such that all clients
gather and ship metrics how that organization desires. If an organization is
set up correctly, there is no reason for metrics to be forwarded through Kafka.
This feels like a solution to an organization not properly setting up how
processes ship metrics, and in some ways, it's an overbroad solution, and in
other ways, it doesn't cover the entire problem.

>From the perspective of Kafka operators, it is easy to see that this KIP is
nice in that it just dictates what clients should support for metrics and that
the metrics should ship through Kafka. But, from the perspective of an
observability team, this workflow is basically hijacking the standard flow that
organizations may have. I would rather have applications collect metrics and
ship them the same way every other application does. I'd rather not have to
configure additional plugins within Kafka to take metrics and forward them.

More importantly, this KIP prescibes cardinality problems, requires that to
officially support the KIP a client must support all relevant metrics within
the KIP, and requires that a client cannot support other metrics unless those
other metrics also go through a KIP process. It is difficult to imagine all of
these metrics being relevant to every organization, and there is no way for an
organization to filter what is relevant within the client. Instead, the
filtering is pushed downwards, meaning more network IO and more CPU costs to
filter what is irrelevant and aggregate what needs to be aggregated, and more
time for an organization to setup whatever it is that will be doing this
filtering and aggregating. Contrast this with a client that enables hooking in
to capture numbers that are relevant within an org itself: the org can gather
what they want, ship only want they want, and ship directly to the
observability system they have already set up. As an aside, it may also be
wise to avoid shipping metrics through Kafka about client interaction with
Kafka, because if Kafka is having problems, then orgs lose insight into those
problems. This would be like statuspage using itself for status on its own
systems.

Another downside is that by dictating the important metrics, this KIP either
has two choices: try to choose what is important to every org, and inevitably
leave out something important to somebody else, or just add everything and let
the orgs filter. This KIP mostly looks to go with the latter approach, meaning
orgs will be shipping & filtering. With hooks, an org would be able to gather
exactly what they want.

As well, I expect that org applications have metrics on the state of the
applications outside of the Kafka client. Applications are already sending
non-Kafka-client related metrics outbound to observability systems. If a Kafka
client provided hooks, then users could just gather the additional relevant
Kafka client metrics and ship those metrics the same way they do all of their
other metrics. It feels a bit odd for a Kafka client to have its own separate
way of forwarding metrics. Another benefit hooks in clients is that
organizations do not _have_ to set up additional plugins to forward metrics
from Kafka. Hooks avoid extra organizational work.

The option that the KIP provides for users of clients to opt out of metrics may
avoid some of the above issues (by just disabling things at the user level),
but that's not really great from the perspective of client authors, because the
existence of this KIP forces authors to either just not implement the KIP, or
increase complexity within the KIP. Further, from an operator perspective, if I
would prefer clients to ship metrics through the systems they already have in
place, now I have to expect that anything that uses librdkafka or the official
Java client will be shipping me metrics that I have to deal with (since the KIP
is default enabled).

Lastly, I'm a little wary that this KIP may stem from a product goal of
Confluent: since most everything uses librdkafka or the Java client, then by
defaulting clients sending metrics, Confluent gets an easy way to provide
metric panels for a nice cloud UI. If any client does not want to support these
metrics, and then a user wonders why these hypothetical panels have no metrics,
then Confluent can just reply "use a supported client".  Even if this
(potentially unlikely) scenario is true, then hooks would still be a great
alternative, because then Confluent could provide drop-in hooks for any client
and

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-06-14 Thread Travis Bischel
Hi! I have a few thoughts on this KIP. First, I'd like to thank you for your 
work
and writeup, it's clear that a lot of thought went into this and it's very 
thorough!
However, I'm not convinced it's the right approach from a fundamental level.

Fundamentally, this KIP seems like somewhat of a solution to an organizational
problem. Metrics are organizational concerns, not Kafka operator concerns.
Clients should make it easy to plug in metrics (this is the approach I take in
my own client), and organizations should have processes such that all clients
gather and ship metrics how that organization desires. If an organization is
set up correctly, there is no reason for metrics to be forwarded through Kafka.
This feels like a solution to an organization not properly setting up how
processes ship metrics, and in some ways, it's an overbroad solution, and in
other ways, it doesn't cover the entire problem.

>From the perspective of Kafka operators, it is easy to see that this KIP is
nice in that it just dictates what clients should support for metrics and that
the metrics should ship through Kafka. But, from the perspective of an
observability team, this workflow is basically hijacking the standard flow that
organizations may have. I would rather have applications collect metrics and
ship them the same way every other application does. I'd rather not have to
configure additional plugins within Kafka to take metrics and forward them.

More importantly, this KIP prescibes cardinality problems, requires that to
officially support the KIP a client must support all relevant metrics within
the KIP, and requires that a client cannot support other metrics unless those
other metrics also go through a KIP process. It is difficult to imagine all of
these metrics being relevant to every organization, and there is no way for an
organization to filter what is relevant within the client. Instead, the
filtering is pushed downwards, meaning more network IO and more CPU costs to
filter what is irrelevant and aggregate what needs to be aggregated, and more
time for an organization to setup whatever it is that will be doing this
filtering and aggregating. Contrast this with a client that enables hooking in
to capture numbers that are relevant within an org itself: the org can gather
what they want, ship only want they want, and ship directly to the
observability system they have already set up. As an aside, it may also be
wise to avoid shipping metrics through Kafka about client interaction with
Kafka, because if Kafka is having problems, then orgs lose insight into those
problems. This would be like statuspage using itself for status on its own
systems.

Another downside is that by dictating the important metrics, this KIP either
has two choices: try to choose what is important to every org, and inevitably
leave out something important to somebody else, or just add everything and let
the orgs filter. This KIP mostly looks to go with the latter approach, meaning
orgs will be shipping & filtering. With hooks, an org would be able to gather
exactly what they want.

As well, I expect that org applications have metrics on the state of the
applications outside of the Kafka client. Applications are already sending
non-Kafka-client related metrics outbound to observability systems. If a Kafka
client provided hooks, then users could just gather the additional relevant
Kafka client metrics and ship those metrics the same way they do all of their
other metrics. It feels a bit odd for a Kafka client to have its own separate
way of forwarding metrics. Another benefit hooks in clients is that
organizations do not _have_ to set up additional plugins to forward metrics
from Kafka. Hooks avoid extra organizational work.

The option that the KIP provides for users of clients to opt out of metrics may
avoid some of the above issues (by just disabling things at the user level),
but that's not really great from the perspective of client authors, because the
existence of this KIP forces authors to either just not implement the KIP, or
increase complexity within the KIP. Further, from an operator perspective, if I
would prefer clients to ship metrics through the systems they already have in
place, now I have to expect that anything that uses librdkafka or the official
Java client will be shipping me metrics that I have to deal with (since the KIP
is default enabled).

Lastly, I'm a little wary that this KIP may stem from a product goal of
Confluent: since most everything uses librdkafka or the Java client, then by
defaulting clients sending metrics, Confluent gets an easy way to provide
metric panels for a nice cloud UI. If any client does not want to support these
metrics, and then a user wonders why these hypothetical panels have no metrics,
then Confluent can just reply "use a supported client".  Even if this
(potentially unlikely) scenario is true, then hooks would still be a great
alternative, because then Confluent could provide drop-in hooks for

<    1   2