+1 (binding)
On 9/13/23 5:48 PM, Jason Gustafson wrote:
Hey Andrew,
+1 on the KIP. For many users of Kafka, it may not be fully understood how
much of a challenge client monitoring is. With tens of clients in a
cluster, it is already difficult to coordinate metrics collection. When
there are thousands of clients, and when the cluster operator has no
control over them, it is essentially impossible. For the fat clients that
we have, the lack of useful telemetry is a huge operational gap.
Consistency between clients has also been a major challenge. I think the
effort toward standardization in this KIP will have some positive impact
even in deployments which have effective client-side monitoring. Overall, I
think this proposal will provide a lot of value across the board.
Best,
Jason
On Wed, Sep 13, 2023 at 9:50 AM Philip Nee <philip...@gmail.com> wrote:
Hey Andrew -
Thank you for taking the time to reply to my questions. I'm just adding
some notes to this discussion.
1. epoch: It can be helpful to know the delta of the client side and the
actual leader epoch. It is helpful to understand why sometimes commit
fails/client not making progress.
2. Client connection: If the client selects the "wrong" connection to push
out the data, I assume the request would timeout; which should lead to
disconnecting from the node and reselecting another node as you mentioned,
via the least loaded node.
Cheers,
P
On Tue, Sep 12, 2023 at 10:40 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:
Hi Philip,
Thanks for your vote and interest in the KIP.
KIP-714 does not introduce any new client metrics, and that’s
intentional.
It does
tell how that all of the client metrics can have their names transformed
into
equivalent "telemetry metric names”, and then potentially used in metrics
subscriptions.
I am interested in the idea of client’s leader epoch in this context, but
I don’t have
an immediate plan for how best to do this, and it would take another KIP
to enhance
existing metrics or introduce some new ones. Those would then naturally
be
applicable to the metrics push introduced in KIP-714.
In a similar vein, there are no existing client metrics specifically for
auto-commit.
We could add them to Kafka, but I really think this is just an example of
asynchronous
commit in which the application has decided not to specify when the
commit
should
begin.
It is possible to increase the cadence of pushing by modifying the
interval.ms
configuration property of the CLIENT_METRICS resource.
There is an “assigned-partitions” metric for each consumer, but not one
for
active partitions. We could add one, again as a follow-on KIP.
I take your point about holding on to a connection in a channel which
might
experience congestion. Do you have a suggestion for how to improve on
this?
For example, the client does have the concept of a least-loaded node.
Maybe
this is something we should investigate in the implementation and decide
on the
best approach. In general, I think sticking with the same node for
consecutive
pushes is best, but if you choose the “wrong” node to start with, it’s
not
ideal.
Thanks,
Andrew
On 8 Sep 2023, at 19:29, Philip Nee <philip...@gmail.com> wrote:
Hey Andrew -
+1 but I don't have a binding vote!
It took me a while to go through the KIP. Here are some of my notes
during
the reading:
*Metrics*
- Should we care about the client's leader epoch? There is a case where
the
user recreates the topic, but the consumer thinks it is still the same
topic and therefore, attempts to start from an offset that doesn't
exist.
KIP-848 addresses this issue, but I can still see some potential
benefits
from knowing the client's epoch information.
- I assume poll idle is similar to poll interval: I needed to read the
description a few times.
- I don't have a clear use case in mind for the commit latency, but I
do
think sometimes people lack clarity about how much progress was tracked
by
the auto-commit. Would tracking auto-commit-related metrics be
useful? I
was thinking: the last offset committed or the actual cadence in ms.
- Are there cases when we need to increase the cadence of telemetry
data
push? i.e. variable interval.
- Thanks for implementing the randomized initial metric push; I think
it
is
really important.
- Is there a potential use case for tracking the number of active
partitions? The consumer can pause partitions via API, during
revocation,
or during offset reset for the stream.
*Connections*:
- The KIP stated that it will keep the same connection until the
connection
is disconnected. I wonder if that could potentially cause congestion if
it
is already a busy channel, which leads to connection timeout and
subsequently disconnection.
Thanks,
P
On Fri, Sep 8, 2023 at 4:15 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:
Bumping the voting thread for KIP-714.
So far, we have:
Non-binding +2 (Milind and Kirk), non-binding -1 (Ryanne)
Thanks,
Andrew
On 4 Aug 2023, at 09:45, Andrew Schofield <andrew_schofi...@live.com
wrote:
Hi,
After almost 2 1/2 years in the making, I would like to call a vote
for
KIP-714 (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
).
This KIP aims to improve monitoring and troubleshooting of client
performance by enabling clients to push metrics to brokers.
I’d like to thank everyone that participated in the discussion,
especially the librdkafka team since one of the aims of the KIP is to
enable any client to participate, not just the Apache Kafka project’s
Java
clients.
Thanks,
Andrew