Hi Tom, Thanks for your comments. Sorry for the delay in responding. They’re still useful comments in spite of the fact that the voting has begun.
1) This is a good idea. I expect the client will emit the client instance ID as a log line. 2) I will add PROXY protocol support to the future work. I agree. 3) Thanks for the suggestion. 4) Yes, the client authenticates before it can send any of the RPCs in this KIP. 5) a) Yes, a rogue client could in principle send metrics to all brokers resulting in a lot of data being exported to the back end. Of course, a proper deployment of a client telemetry reporter plugin would be instrumented to help the operator diagnose this situation. b) There are already instances of the client sending compressed data to the broker. I think it is prudent to limit the maximum metrics payload size. I will update the KIP accordingly. 6) Yes, bundling and relocating. 7) I will add a broker metric to help with diagnosis. 8) I will add some clarifying text. If the broker does not have a configured metrics reporter that supports the new interface, it should not push metrics and will not receive a metrics subscription. I am thinking over the options for achieving this cleanly and will update the KIP. Thanks for your interest in the KIP. Thanks, Andrew > On 11 Aug 2023, at 09:48, Tom Bentley <tbent...@redhat.com> wrote: > > Hi Andrew, > > Thanks for picking this KIP up. I know you've started a vote, so these are > unhelpfully late... sorry about that, but hopefully they're still useful. > > 1. "The Kafka Java client provides an API for the application to read the > generated client instance id to assist in mapping and identification of the > client based on the collected metrics." In the multi-client, single-process > model perhaps it would be desirable to have the option of including this in > log messages emitted by the client library. > > 2. "Mapping the client instance id to an actual application instance > running on a (virtual) machine can be done by inspecting the metrics > resource labels, such as the client source address and source port, or > security principal, all of which are added by the receiving broker." The > specific example of correlation given here (source IP address) is > problematic in environments where there may be network proxies (e.g. > Kubernetes ingress) on the path between client and broker: The broker sees > the IP address of the proxy. This is a rough edge which could be smoothed > out if Kafka supported the PROXY protocol[1] which seems to have become > something of a defacto standard. I'm not suggesting this need to be part of > the KIP, but perhaps it could be added to Future Work? > [1]: http://www.haproxy.org/download/2.9/doc/proxy-protocol.txt > > 3. Compression... just an idle idea, but I wonder if a useful further > improvement in compression ratio could be achieve using zstd's support for > dictionary compression[2]. I.e. a client could initially use training mode > when sending metrics, but eventually include a dictionary to be used for > subsequent metric sends. It's probably not worth the effort (at least for > the initial implementation), but since you've gone to the effort of > providing some numbers anyway, maybe it's not much additional effort to at > least find out whether this makes a useful difference. > [2]: http://facebook.github.io/zstd/#small-data > > 4. Maybe I didn't spot it, but I assume the initial > GetTelemetrySubscriptionsRequest > happens after authentication? > > 5. Rogue clients -- There are some interesting things to consider if we're > trying to defend against a genuinely adversarial client. > > a) Client sends profiling information to all brokers at the maximum rate. > Each broker forwards to the time series DB. Obviously this scales linearly > with number of brokers, but it's clear that the load on the tsdb could be > many times larger than users might expect. > b) Client sends crafted compressed data which decompresses to require more > memory that the broker can allocate. > > 6. Shadowing the OLTP and protobuf jars -- to be clear by this you mean > both bundling _and_ relocating? > > 7. "In case the cluster load induced from metrics requests becomes > unmanageable the remedy is to temporarily remove or limit configured > metrics subscriptions. " How would a user know that the observed load was > due to handling metrics requests? > > 8. If I understand correctly, when the configured metrics reporter does not > implement the new interface the client would still follow the described > protocol only to have nowhere to send the metrics. Am I overlooking > something? > > Thanks again, > > Tom > > On Fri, 11 Aug 2023 at 07:52, Andrew Schofield < > andrew_schofield_j...@outlook.com> wrote: > >> Hi Doguscan, >> Thanks for your question. >> >> If the target broker is unreachable, the client can send the metrics to >> another >> broker. It can select any of the other brokers for this purpose. What I >> expect in >> practice is that it loses connection to the broker it’s being using for >> metrics, >> chooses or establishes a connection to another broker, and then selects >> that >> broker for subsequent metrics pushes. >> >> Thanks, >> Andrew >> >>> On 8 Aug 2023, at 08:34, Doğuşcan Namal <namal.dogus...@gmail.com> >> wrote: >>> >>> Thanks for your answers Andrew. I share your pain that it took a while >> for >>> you to get this KIP approved and you want to reduce the scope of it, will >>> be happy to help you with the implementation :) >>> >>> Could you help me walk through what happens if the target broker is >>> unreachable? Is the client going to drop these metrics or is it going to >>> send it to the other brokers it is connected to? This information is >>> crucial to understand the client side impact on leadership failovers. >>> Moreover, in case of partial outages, such as only the network between >> the >>> client and the broker is partitioned whereas the network within the >> cluster >>> is healthy, practically there is no other way than the client side >> metrics >>> to identify this problem. >>> >>> Doguscan >>> >>> On Fri, 4 Aug 2023 at 15:33, Andrew Schofield < >>> andrew_schofield_j...@outlook.com> wrote: >>> >>>> Hi Doguscan, >>>> Thanks for your comments. I’m glad to hear you’re interested in this >> KIP. >>>> >>>> 1) It is preferred that a client sends its metrics to the same broker >>>> connection >>>> but actually it is able to send them to any broker. As a result, if a >>>> broker becomes >>>> unhealthy, the client can push its metrics to any other broker. It seems >>>> to me that >>>> pushing to KRaft controllers instead just has the effect of increasing >> the >>>> load on >>>> the controllers, while still having the characteristic that an unhealthy >>>> controller >>>> would present inconvenience for collecting metrics. >>>> >>>> 2) When the `PushTelemetryRequest.Terminating` flag is set, the standard >>>> request >>>> throttling is not disabled. The metrics rate-limiting based on the push >>>> interval is >>>> not applied in this case for a single request for the combination of >>>> client instance ID >>>> and subscription ID. >>>> >>>> (I have corrected the KIP text because it erroneously said “client ID >> and >>>> subscription ID”. >>>> >>>> 3) While this is a theoretical problem, I’m not keen on adding yet more >>>> configurations >>>> to the broker or client. The `interval.ms` configuration on the >>>> CLIENT_METRICS >>>> resource could perhaps have a minimum and maximum value to prevent >>>> accidental >>>> misconfiguration. >>>> >>>> 4) One of the reasons that this KIP has taken so long to get to this >> stage >>>> is that >>>> it tried to do many things all at once. So, it’s greatly simplified >>>> compared with >>>> 6 months ago. I can see the value of collecting client configurations >> for >>>> problem >>>> determination, but I don’t want to make this KIP more complicated. I >> think >>>> the >>>> idea has merit as a separate follow-on KIP. I would be happy to >> collaborate >>>> with you on this. >>>> >>>> 5) The default is set to 5 minutes to minimise the load on the broker >> for >>>> situations >>>> in which the administrator didn’t set an interval on a metrics >>>> subscription. To >>>> use an interval of 1 minute, it is only necessary to set `interval.ms` >> on >>>> the metrics >>>> subscription to 60000ms. >>>> >>>> 6) Uncompressed data is always supported. The KIP says: >>>> "The CompressionType of NONE will not be >>>> "present in the response from the broker, though the broker does support >>>> uncompressed >>>> "client telemetry if none of the accepted compression codecs are >> supported >>>> by the client.” >>>> So in your example, the client need only use CompressionType=NONE. >>>> >>>> Thanks, >>>> Andrew >>>> >>>>> On 4 Aug 2023, at 14:04, Doğuşcan Namal <namal.dogus...@gmail.com> >>>> wrote: >>>>> >>>>> Hi Andrew, thanks a lot for this KIP. I was thinking of something >> similar >>>>> so thanks for writing this down 😊 >>>>> >>>>> >>>>> >>>>> Couple of questions related to the design: >>>>> >>>>> >>>>> >>>>> 1. Can we investigate the option for using the Kraft controllers >> instead >>>> of >>>>> the brokers for sending metrics? The disadvantage of sending these >>>> metrics >>>>> directly to the brokers tightly couples metric observability to data >>>> plane >>>>> availability. If the broker is unhealthy then the root cause of an >>>> incident >>>>> is clear however on partial failures it makes it hard to debug these >>>>> incidents from the brokers perspective. >>>>> >>>>> >>>>> >>>>> 2. Ratelimiting will be disable if the >> `PushTelemetryRequest.Terminating` >>>>> flag is set. However, this may cause unavailability on the broker if >> too >>>>> many clients are terminated at once, especially network threads could >>>>> become busy and introduce latency on the produce/consume on other >>>>> non-terminating clients connections. I think there is a room for >>>>> improvement here. If the client is gracefully shutting down, it could >>>> wait >>>>> for the request to be handled if it is being ratelimited, it doesn't >> need >>>>> to "force push" the metrics. For that reason, maybe we could define a >>>>> separate ratelimiting for telemetry data? >>>>> >>>>> >>>>> >>>>> 3. `PushIntervalMs` is set on the client side by a response from >>>>> `GetTelemetrySubscriptionsResponse`. If the broker sets this value to >> too >>>>> low, like 1msec, this may hog all of the clients activity and cause an >>>>> impact on the client side. I think we should introduce a configuration >>>> both >>>>> on the client and the broker side for the minimum and maximum numbers >> for >>>>> this value to fence out misconfigurations. >>>>> >>>>> >>>>> >>>>> 4. One of the important things I face during debugging the client side >>>>> failures is to understand the client side configurations. Can the >> client >>>>> sends these configs during the GetTelemetrySubscriptions request as >> well? >>>>> >>>>> >>>>> >>>>> Small comments: >>>>> >>>>> 5. Default PushIntervalMs is 5 minutes. Can we make it 1 minute >> instead? >>>> I >>>>> think 5 minutes of aggregated data is too not helpful in the world of >>>>> telemetry 😊 >>>>> >>>>> 6. UnsupportedCompressionType: Shall we fallback to non-compression >> mode >>>> in >>>>> that case? I think compression is nice to have, but non-compressed >>>>> telemetry data is valuable as well. Especially for low throughput >>>> clients, >>>>> compressing telemetry data may cause more CPU load then the actual data >>>>> plane work. >>>>> >>>>> >>>>> Thanks again. >>>>> >>>>> Doguscan >>>>> >>>>> >>>>> >>>>>> On Jun 13, 2023, at 8:06 AM, Andrew Schofield >>>>> >>>>>> <andrew_schofield_j...@outlook.com> wrote: >>>>> >>>>>> >>>>> >>>>>> Hi, >>>>> >>>>>> I would like to start a new discussion thread on KIP-714: Client >> metrics >>>>> and >>>>> >>>>>> observability. >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability >>>>> >>>>>> >>>>> >>>>>> I have edited the proposal significantly to reduce the scope. The >>>> overall >>>>> >>>>>> mechanism for client metric subscriptions is unchanged, but the >>>>> >>>>>> KIP is now based on the existing client metrics, rather than >> introducing >>>>> new >>>>> >>>>>> metrics. The purpose remains helping cluster operators >>>>> >>>>>> investigate performance problems experienced by clients without >>>> requiring >>>>> >>>>>> changes to the client application code or configuration. >>>>> >>>>>> >>>>> >>>>>> Thanks, >>>>> >>>>>> Andrew