Hi Jun, thanks for your initiated questions, see my answers below. There's been a number of clarifications to the KIP.
Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao <j...@confluent.io.invalid>: > Hi, Magnus, > > Thanks for updating the KIP. The overall approach makes sense to me. A few > more detailed comments below. > > 20. ClientTelemetry: Should it be extending configurable and closable? > I'll pass this question to Sarat and/or Xavier. > 21. Compression of the metrics on the client: what's the default? > How about we specify a prioritized list: zstd, lz4, snappy, gzip? But ultimately it is up to what the client supports. 23. A client instance is considered a metric resource and the > resource-level (thus client instance level) labels could include: > client_software_name=confluent-kafka-python > client_software_version=v2.1.3 > client_instance_id=B64CD139-3975-440A-91D4 > transactional_id=someTxnApp > Are those labels added in PushTelemetryRequest? If so, are they per metric > or per request? > client_software* and client_instance_id are not added by the client, but available to the broker-side metrics plugin for adding as it see fits, remove them from the KIP. As for transactional_id, group_id, etc, which I believe will be useful in troubleshooting, are included only once (per push) as resource-level attributes (the client instance is a singular resource). > > 24. "the broker will only send > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" : > 24.1 If it's always true, does it need to be part of the protocol? > We're anticipating that it will take a lot longer to upgrade the majority of clients than the broker/plugin side, which is why we want the client to support both temporalities out-of-the-box so that cumulative reporting can be turned on seamlessly in the future. > 24.2 Does delta only apply to Counter type? > And Histograms. More details in Xavier's OTLP link. > 24.3 In the delta representation, the first request needs to send the full > value, how does the broker plugin know whether a value is full or delta? > The client may (should) send the start time for each metric sample, indicating when the metric began to be collected. We've discussed whether this should be the client instance start time or the time when a matching metric subscription for that metric is received. For completeness we recommend using the former, the client instance start time. > 25. quota: > 25.1 Since we are fitting PushTelemetryRequest into the existing request > quota, it would be useful to document the impact, i.e. client metric > throttling causes the data from the same client to be delayed. > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like the > producer? > Yes, it should be, as to protect the cluster from rogue clients. But, in practice the size of metrics will be quite low (e.g., 1-10kb per 60s interval), so I don't think this will pose a problem. The KIP has been updated with more details on quota/throttling behaviour, see the "Throttling and rate-limiting" section. 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when > the request/bandwidth quota is exceeded since those requests are not > rejected. We only set this error when the request is rejected (e.g., topic > creation). It would be useful to clarify when this error is used. > Right, I was trying to reuse an existing error-code. We can introduce a new one for the case where a client pushes metrics at a higher frequency than the than the configured push interval (e.g., out-of-profile sends). This causes the broker to drop those metrics and send this error code back to the client. There will be no connection throttling / channel-muting in this case (unless the standard quotas are exceeded). > 27. kafka-client-metrics.sh: Could we add an example on how to disable a > bad client? > There's now a --block option to kafka-client-metrics.sh which overrides all subscriptions for the matched client(s). This allows silencing metrics for one or more clients without having to remove existing subscriptions. From the client's perspective it will look like it no longer has any subscriptions. # Block metrics collection for a specific client instance $ kafka-client-metrics.sh --bootstrap-server $BROKERS \ --add \ --name 'Disabe_b69cc35a' \ # A descriptive name makes it easier to clean up old subscriptions. --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \ # Match this specific client instance --block > 28. New broker side metrics: Could we spell out the details of the metrics > (e.g., group, tags, etc)? > KIP has been updated accordingly (thanks Sarat). > > 29. Client instance-level metrics: client.io.wait.time is a gauge not a > histogram. > I believe a population/distribution should preferably be represented as a histogram, space permitting, and only secondarily as a Gauge average. While we might not want to maintain a bunch of histograms for each partition, since that could be quite space consuming, this client.io.wait.time is a single metric per client instance and can thus afford a Histogram representation. Thanks, Magnus > Thanks, > > Jun > > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <mag...@edenhill.se> > wrote: > > > Hi all, > > > > I've updated the KIP with responses to the latest comments: Java client > > dependencies (Thanks Kirk!), alternate designs (separate cluster, > separate > > producer, etc), etc. > > > > I will revive the vote thread. > > > > Thanks, > > Magnus > > > > > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <ryannedo...@gmail.com > >: > > > > > I think we should be very careful about introducing new runtime > > > dependencies into the clients. Historically this has been rare and > > > essentially necessary (e.g. compression libs). > > > > > > Ryanne > > > > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <k...@mustardgrain.com> wrote: > > > > > > > Hi Jun, > > > > > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote: > > > > > 13. Using OpenTelemetry. Does that require runtime dependency > > > > > on OpenTelemetry library? How good is the compatibility story > > > > > of OpenTelemetry? This is important since an application could have > > > other > > > > > OpenTelemetry dependencies than the Kafka client. > > > > > > > > The current design is that the OpenTelemetry JARs would ship with the > > > > client. Perhaps we can design the client such that the JARs aren't > even > > > > loaded if the user has opted out. The user could even exclude the > JARs > > > from > > > > their dependencies if they so wished. > > > > > > > > I can't speak to the compatibility of the libraries. Is it possible > > that > > > > we include a shaded version? > > > > > > > > Thanks, > > > > Kirk > > > > > > > > > > > > > > 14. The proposal listed idempotence=true. This is more of a > > > configuration > > > > > than a metric. Are we including that as a metric? What other > > > > configurations > > > > > are we including? Should we separate the configurations from the > > > metrics? > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill < > mag...@edenhill.se> > > > > wrote: > > > > > > > > > > > Hey Bob, > > > > > > > > > > > > That's a good point. > > > > > > > > > > > > Request type labels were considered but since they're already > > tracked > > > > by > > > > > > broker-side metrics > > > > > > they were left out as to avoid metric duplication, however those > > > > metrics > > > > > > are not per connection, > > > > > > so they won't be that useful in practice for troubleshooting > > specific > > > > > > client instances. > > > > > > > > > > > > I'll add the request_type label to the relevant metrics. > > > > > > > > > > > > Thanks, > > > > > > Magnus > > > > > > > > > > > > > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett > > > > > > <bob.barr...@confluent.io.invalid>: > > > > > > > > > > > > > Hi Magnus, > > > > > > > > > > > > > > Thanks for the thorough KIP, this seems very useful. > > > > > > > > > > > > > > Would it make sense to include the request type as a label for > > the > > > > > > > `client.request.success`, `client.request.errors` and > > > > > > `client.request.rtt` > > > > > > > metrics? I think it would be very useful to see which specific > > > > requests > > > > > > are > > > > > > > succeeding and failing for a client. One specific case I can > > think > > > of > > > > > > where > > > > > > > this could be useful is producer batch timeouts. If a Java > > > > application > > > > > > does > > > > > > > not enable producer client logs (unfortunately, in my > experience > > > this > > > > > > > happens more often than it should), the application logs will > > only > > > > > > contain > > > > > > > the expiration error message, but no information about what is > > > > causing > > > > > > the > > > > > > > timeout. The requests might all be succeeding but taking too > long > > > to > > > > > > > process batches, or metadata requests might be failing, or some > > or > > > > all > > > > > > > produce requests might be failing (if the bootstrap servers are > > > > reachable > > > > > > > from the client but one or more other brokers are not, for > > > example). > > > > If > > > > > > the > > > > > > > cluster operator is able to identify the specific requests that > > are > > > > slow > > > > > > or > > > > > > > failing for a client, they will be better able to diagnose the > > > issue > > > > > > > causing batch timeouts. > > > > > > > > > > > > > > One drawback I can think of is that this will increase the > > > > cardinality of > > > > > > > the request metrics. But any given client is only going to use > a > > > > small > > > > > > > subset of the request types, and since we already have > partition > > > > labels > > > > > > for > > > > > > > the topic-level metrics, I think request labels will still make > > up > > > a > > > > > > > relatively small percentage of the set of metrics. > > > > > > > > > > > > > > Thanks, > > > > > > > Bob > > > > > > > > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass < > > > > > > > viktorsomo...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Magnus, > > > > > > > > > > > > > > > > I think this is a very useful addition. We also have a > similar > > > (but > > > > > > much > > > > > > > > more simplistic) implementation of this. Maybe I missed it in > > the > > > > KIP > > > > > > but > > > > > > > > what about adding metrics about the subscription cache > itself? > > > > That I > > > > > > > think > > > > > > > > would improve its usability and debuggability as we'd be able > > to > > > > see > > > > > > its > > > > > > > > performance, hit/miss rates, eviction counts and others. > > > > > > > > > > > > > > > > Best, > > > > > > > > Viktor > > > > > > > > > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill < > > > > mag...@edenhill.se> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Mickael, > > > > > > > > > > > > > > > > > > see inline. > > > > > > > > > > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison < > > > > > > > > > mickael.mai...@gmail.com > > > > > > > > > >: > > > > > > > > > > > > > > > > > > > Hi Magnus, > > > > > > > > > > > > > > > > > > > > I see you've addressed some of the points I raised above > > but > > > > some > > > > > > (4, > > > > > > > > > > 5) have not been addressed yet. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Re 4) How will the user/app know metrics are being sent. > > > > > > > > > > > > > > > > > > One possibility is to add a JMX metric (thus for user > > > > consumption) > > > > > > for > > > > > > > > the > > > > > > > > > number of metric pushes the > > > > > > > > > client has performed, or perhaps the number of metrics > > > > subscriptions > > > > > > > > > currently being collected. > > > > > > > > > Would that be sufficient? > > > > > > > > > > > > > > > > > > Re 5) Metric sizes and rates > > > > > > > > > > > > > > > > > > A worst case scenario for a producer that is producing to > 50 > > > > unique > > > > > > > > topics > > > > > > > > > and emitting all standard metrics yields > > > > > > > > > a serialized size of around 100KB prior to compression, > which > > > > > > > compresses > > > > > > > > > down to about 20-30% of that depending > > > > > > > > > on compression type and topic name uniqueness. > > > > > > > > > The numbers for a consumer would be similar. > > > > > > > > > > > > > > > > > > In practice the number of unique topics would be far less, > > and > > > > the > > > > > > > > > subscription set would typically be for a subset of > metrics. > > > > > > > > > So we're probably closer to 1kb, or less, compressed size > per > > > > client > > > > > > > per > > > > > > > > > push interval. > > > > > > > > > > > > > > > > > > As both the subscription set and push intervals are > > controlled > > > > by the > > > > > > > > > cluster operator it shouldn't be too hard > > > > > > > > > to strike a good balance between metrics overhead and > > > > granularity. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm really uneasy with this being enabled by default on > the > > > > client > > > > > > > > > > side. When collecting data, I think the best practice is > to > > > > ensure > > > > > > > > > > users are explicitly enabling it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Requiring metrics to be explicitly enabled on clients > > severely > > > > > > cripples > > > > > > > > its > > > > > > > > > usability and value. > > > > > > > > > > > > > > > > > > One of the problems that this KIP aims to solve is for > useful > > > > metrics > > > > > > > to > > > > > > > > be > > > > > > > > > available on demand > > > > > > > > > regardless of the technical expertise of the user. As > Ryanne > > > > points, > > > > > > > out > > > > > > > > a > > > > > > > > > savvy user/organization > > > > > > > > > will typically have metrics collection and monitoring in > > place > > > > > > already, > > > > > > > > and > > > > > > > > > the benefits of this KIP > > > > > > > > > are then more of a common set and format metrics across > > client > > > > > > > > > implementations and languages. > > > > > > > > > But that is not the typical Kafka user in my experience, > > > they're > > > > not > > > > > > > > Kafka > > > > > > > > > experts and they don't have the > > > > > > > > > knowledge of how to best instrument their clients. > > > > > > > > > Having metrics enabled by default for this user base allows > > the > > > > Kafka > > > > > > > > > operators to proactively and reactively > > > > > > > > > monitor and troubleshoot client issues, without the need > for > > > the > > > > less > > > > > > > > savvy > > > > > > > > > user to do anything. > > > > > > > > > It is often too late to tell a user to enable metrics when > > the > > > > > > problem > > > > > > > > has > > > > > > > > > already occurred. > > > > > > > > > > > > > > > > > > Now, to be clear, even though metrics are enabled by > default > > on > > > > > > clients > > > > > > > > it > > > > > > > > > is not enabled by default > > > > > > > > > on the brokers; the Kafka operator needs to build and set > up > > a > > > > > > metrics > > > > > > > > > plugin and add metrics subscriptions > > > > > > > > > before anything is sent from the client. > > > > > > > > > It is opt-out on the clients and opt-in on the broker. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > You mentioned brokers already have > > > > > > > > > > some(most?) of the information contained in metrics, if > so > > > > then why > > > > > > > > > > are we collecting it again? Surely there must be some new > > > > > > information > > > > > > > > > > in the client metrics. > > > > > > > > > > > > > > > > > > > > > > > > > > > > From the user's perspective the Kafka infrastructure > extends > > > from > > > > > > > > > producer.send() to > > > > > > > > > messages being returned from consumer.poll(), a giant black > > box > > > > where > > > > > > > > > there's a lot going on between those > > > > > > > > > two points. The brokers currently only see what happens > once > > > > those > > > > > > > > requests > > > > > > > > > and messages hits the broker, > > > > > > > > > but as Kafka clients are complex pieces of machinery > there's > > a > > > > myriad > > > > > > > of > > > > > > > > > queues, timers, and state > > > > > > > > > that's critical to the operation and infrastructure that's > > not > > > > > > > currently > > > > > > > > > visible to the operator. > > > > > > > > > Relying on the user to accurately and timely provide this > > > missing > > > > > > > > > information is not generally feasible. > > > > > > > > > > > > > > > > > > > > > > > > > > > Most of the standard metrics listed in the KIP are data > > points > > > > that > > > > > > the > > > > > > > > > broker does not have. > > > > > > > > > Only a small number of metrics are duplicates (like the > > request > > > > > > counts > > > > > > > > and > > > > > > > > > sizes), but they are included > > > > > > > > > to ease correlation when inspecting these client metrics. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Moreover this is a brand new feature so it's even harder > to > > > > justify > > > > > > > > > > enabling it and forcing onto all our users. If disabled > by > > > > default, > > > > > > > > > > it's relatively easy to enable in a new release if we > > decide > > > > to, > > > > > > but > > > > > > > > > > once enabled by default it's much harder to disable. Also > > > this > > > > > > > feature > > > > > > > > > > will apply to all future metrics we will add. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think maturity of a feature implementation should be the > > > > deciding > > > > > > > > factor, > > > > > > > > > rather than > > > > > > > > > the design of it (which this KIP is). I.e., if the > > > > implementation is > > > > > > > not > > > > > > > > > deemed mature enough > > > > > > > > > for release X.Y it will be disabled. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Overall I think it's an interesting feature but I'd > prefer > > to > > > > be > > > > > > > > > > slightly defensive and see how it works in practice > before > > > > enabling > > > > > > > it > > > > > > > > > > everywhere. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Right, and I agree on being defensive, but since this > feature > > > > still > > > > > > > > > requires manual > > > > > > > > > enabling on the brokers before actually being used, I think > > > that > > > > > > gives > > > > > > > > > enough control > > > > > > > > > to opt-in or out of this feature as needed. > > > > > > > > > > > > > > > > > > Thanks for your comments! > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Magnus > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Mickael > > > > > > > > > > > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill < > > > > mag...@edenhill.se > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > Thanks David for pointing this out, > > > > > > > > > > > I've updated the KIP to include client_id as a matching > > > > selector. > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > Magnus > > > > > > > > > > > > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao > > > > > > > > > <d...@confluent.io.invalid > > > > > > > > > > >: > > > > > > > > > > > > > > > > > > > > > > > Hey Magnus, > > > > > > > > > > > > > > > > > > > > > > > > I noticed that the KIP outlines the initial selectors > > > > supported > > > > > > > as: > > > > > > > > > > > > > > > > > > > > > > > > - client_instance_id - CLIENT_INSTANCE_ID UUID > > string > > > > > > > > > > representation. > > > > > > > > > > > > - client_software_name - client software > > > implementation > > > > > > name. > > > > > > > > > > > > - client_software_version - client software > > > > implementation > > > > > > > > > version. > > > > > > > > > > > > > > > > > > > > > > > > In the given reactive monitoring workflow, we mention > > > that > > > > the > > > > > > > > > > application > > > > > > > > > > > > user does not know their client's client instance ID, > > but > > > > it's > > > > > > > > > outlined > > > > > > > > > > > > that the operator can add a metrics subscription > > > selecting > > > > for > > > > > > > > > > clientId. I > > > > > > > > > > > > don't see clientId as one of the supported selectors. > > > > > > > > > > > > I can see how this would have made sense in a > previous > > > > > > iteration > > > > > > > > > given > > > > > > > > > > that > > > > > > > > > > > > the previous client instance ID proposal was to > > construct > > > > the > > > > > > > > client > > > > > > > > > > > > instance ID using clientId as a prefix. Now that the > > > client > > > > > > > > instance > > > > > > > > > > ID is > > > > > > > > > > > > a UUID, would we want to add clientId as a supported > > > > selector? > > > > > > > > > > > > Let me know what you think. > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill < > > > > > > > > mag...@edenhill.se > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi Mickael! > > > > > > > > > > > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison > < > > > > > > > > > > > > > mickael.mai...@gmail.com > > > > > > > > > > > > > >: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Magnus, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the proposal. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. Looking at the protocol section, isn't > > > > > > "ClientInstanceId" > > > > > > > > > > expected > > > > > > > > > > > > > > to be a field in > > GetTelemetrySubscriptionsResponseV0? > > > > > > > > Otherwise, > > > > > > > > > > how > > > > > > > > > > > > > > does a client retrieve this value? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Good catch, it got removed by mistake in one of the > > > > edits. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. In the client API section, you mention a new > > > method > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify which > > > interfaces > > > > are > > > > > > > > > > affected? > > > > > > > > > > > > > > Is it only Consumer and Producer? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > And Admin. Will update the KIP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by > default. > > > > Even if > > > > > > > the > > > > > > > > > data > > > > > > > > > > > > > > collected is supposed to be not sensitive, I > think > > > > this can > > > > > > > be > > > > > > > > > > > > > > problematic in some environments. Also users > don't > > > > seem to > > > > > > > have > > > > > > > > > the > > > > > > > > > > > > > > choice to only expose some metrics. Knowing how > > much > > > > data > > > > > > > > transit > > > > > > > > > > > > > > through some applications can be considered > > critical. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The broker already knows how much data transits > > through > > > > the > > > > > > > > client > > > > > > > > > > > > though, > > > > > > > > > > > > > right? > > > > > > > > > > > > > Care has been taken not to expose information in > the > > > > standard > > > > > > > > > metrics > > > > > > > > > > > > that > > > > > > > > > > > > > might > > > > > > > > > > > > > reveal sensitive information. > > > > > > > > > > > > > > > > > > > > > > > > > > Do you have an example of how the proposed metrics > > > could > > > > leak > > > > > > > > > > sensitive > > > > > > > > > > > > > information? > > > > > > > > > > > > > As for limiting the what metrics to export; I guess > > > that > > > > > > could > > > > > > > > make > > > > > > > > > > sense > > > > > > > > > > > > > in some > > > > > > > > > > > > > very sensitive use-cases, but those users might > > disable > > > > > > metrics > > > > > > > > > > > > altogether > > > > > > > > > > > > > for now. > > > > > > > > > > > > > Could these concerns be addressed by a later KIP? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4. As a user, how do you know if your application > > is > > > > > > actively > > > > > > > > > > sending > > > > > > > > > > > > > > metrics? Are there new metrics exposing what's > > going > > > > on, > > > > > > like > > > > > > > > how > > > > > > > > > > much > > > > > > > > > > > > > > data is being sent? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > That's a good question. > > > > > > > > > > > > > Since the proposed metrics interface is not aimed > at, > > > or > > > > > > > directly > > > > > > > > > > > > available > > > > > > > > > > > > > to, the application > > > > > > > > > > > > > I guess there's little point of adding it here, but > > > > instead > > > > > > > > adding > > > > > > > > > > > > > something to the > > > > > > > > > > > > > existing JMX metrics? > > > > > > > > > > > > > Do you have any suggestions? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5. If all metrics are enabled on a regular > Consumer > > > or > > > > > > > > Producer, > > > > > > > > > do > > > > > > > > > > > > > > you have an idea how much throughput this would > > use? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It depends on the number of partition/topics/etc > the > > > > client > > > > > > is > > > > > > > > > > producing > > > > > > > > > > > > > to/consuming from. > > > > > > > > > > > > > I'll add some sizes to the KIP for some typical > > > > use-cases. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > Magnus > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill < > > > > > > > > > > mag...@edenhill.se> > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom > Bentley < > > > > > > > > > > tbent...@redhat.com > > > > > > > > > > > > >: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Magnus, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I reviewed the KIP since you called the vote > > > > (sorry for > > > > > > > not > > > > > > > > > > > > reviewing > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > you announced your intention to call the > > vote). I > > > > have > > > > > > a > > > > > > > > few > > > > > > > > > > > > > questions > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > some of the details. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. There's no Javadoc on > > > > ClientTelemetryPayload.data(), > > > > > > > so > > > > > > > > I > > > > > > > > > > don't > > > > > > > > > > > > > know > > > > > > > > > > > > > > > > whether the payload is exposed through this > > > method > > > > as > > > > > > > > > > compressed or > > > > > > > > > > > > > > not. > > > > > > > > > > > > > > > > Later on you say "Decompression of the > payloads > > > > will be > > > > > > > > > > handled by > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > broker metrics plugin, the broker should > > expose a > > > > > > > suitable > > > > > > > > > > > > > > decompression > > > > > > > > > > > > > > > > API to the metrics plugin for this purpose.", > > > which > > > > > > > > suggests > > > > > > > > > > it's > > > > > > > > > > > > the > > > > > > > > > > > > > > > > compressed data in the buffer, but then we > > don't > > > > know > > > > > > > which > > > > > > > > > > codec > > > > > > > > > > > > was > > > > > > > > > > > > > > used, > > > > > > > > > > > > > > > > nor the API via which the plugin should > > > decompress > > > > it > > > > > > if > > > > > > > > > > required > > > > > > > > > > > > for > > > > > > > > > > > > > > > > forwarding to the ultimate metrics store. > > Should > > > > the > > > > > > > > > > > > > > ClientTelemetryPayload > > > > > > > > > > > > > > > > expose a method to get the compression and a > > > > > > > decompressor? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Good point, updated. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. The client-side API is expressed as > > > > StringOrError > > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int > > > timeout_ms). I > > > > > > > > > understand > > > > > > > > > > that > > > > > > > > > > > > > > you're > > > > > > > > > > > > > > > > thinking about the librdkafka implementation, > > but > > > > it > > > > > > > would > > > > > > > > be > > > > > > > > > > good > > > > > > > > > > > > to > > > > > > > > > > > > > > show > > > > > > > > > > > > > > > > the API as it would appear on the Apache > Kafka > > > > clients. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This was meant as pseudo-code, but I changed it > > to > > > > Java. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol > > > > request > > > > > > used > > > > > > > > by > > > > > > > > > > the > > > > > > > > > > > > > > client to > > > > > > > > > > > > > > > > send metrics to any broker it is connected > to." > > > To > > > > be > > > > > > > > clear, > > > > > > > > > > this > > > > > > > > > > > > > means > > > > > > > > > > > > > > > > that the client can choose any of the > connected > > > > brokers > > > > > > > and > > > > > > > > > > push to > > > > > > > > > > > > > > just > > > > > > > > > > > > > > > > one of them? What should a supporting client > do > > > if > > > > it > > > > > > > gets > > > > > > > > an > > > > > > > > > > error > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > pushing metrics to a broker, retry sending to > > the > > > > same > > > > > > > > broker > > > > > > > > > > or > > > > > > > > > > > > try > > > > > > > > > > > > > > > > pushing to another broker, or drop the > metrics? > > > > Should > > > > > > > > > > supporting > > > > > > > > > > > > > > clients > > > > > > > > > > > > > > > > send successive requests to a single broker, > or > > > > round > > > > > > > > robin, > > > > > > > > > > or is > > > > > > > > > > > > > > that up > > > > > > > > > > > > > > > > to the client author? I'm guessing the > > behaviour > > > > should > > > > > > > be > > > > > > > > > > sticky > > > > > > > > > > > > to > > > > > > > > > > > > > > > > support the rate limiting features, but I > think > > > it > > > > > > would > > > > > > > be > > > > > > > > > > good > > > > > > > > > > > > for > > > > > > > > > > > > > > client > > > > > > > > > > > > > > > > authors if this section were explicit on the > > > > > > recommended > > > > > > > > > > behaviour. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > You are right, I've updated the KIP to make > this > > > > clearer. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4. "Mapping the client instance id to an > actual > > > > > > > application > > > > > > > > > > > > instance > > > > > > > > > > > > > > > > running on a (virtual) machine can be done by > > > > > > inspecting > > > > > > > > the > > > > > > > > > > > > metrics > > > > > > > > > > > > > > > > resource labels, such as the client source > > > address > > > > and > > > > > > > > source > > > > > > > > > > port, > > > > > > > > > > > > > or > > > > > > > > > > > > > > > > security principal, all of which are added by > > the > > > > > > > receiving > > > > > > > > > > broker. > > > > > > > > > > > > > > This > > > > > > > > > > > > > > > > will allow the operator together with the > user > > to > > > > > > > identify > > > > > > > > > the > > > > > > > > > > > > actual > > > > > > > > > > > > > > > > application instance." Is this really always > > > true? > > > > The > > > > > > > > source > > > > > > > > > > IP > > > > > > > > > > > > and > > > > > > > > > > > > > > port > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some setups. > > The > > > > > > > > principal, > > > > > > > > > as > > > > > > > > > > > > > already > > > > > > > > > > > > > > > > mentioned in the KIP, might be shared between > > > > multiple > > > > > > > > > > > > applications. > > > > > > > > > > > > > > So at > > > > > > > > > > > > > > > > worst the organization running the clients > > might > > > > have > > > > > > to > > > > > > > > > > consult > > > > > > > > > > > > the > > > > > > > > > > > > > > logs > > > > > > > > > > > > > > > > of a set of client applications, right? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed > > mapping > > > > from > > > > > > > > > > > > > > client_instance_id > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > an actual instance, that's why the KIP > recommends > > > > client > > > > > > > > > > > > > implementations > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > log the client instance id > > > > > > > > > > > > > > > upon retrieval, and also provide an API for the > > > > > > application > > > > > > > > to > > > > > > > > > > > > retrieve > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > instance id programmatically > > > > > > > > > > > > > > > if it has a better way of exposing it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio up > to > > > > 10x is > > > > > > > > > > possible for > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > standard metrics." Client authors might > > > appreciate > > > > your > > > > > > > > > > mentioning > > > > > > > > > > > > > > which > > > > > > > > > > > > > > > > compression codec got these results. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Good point. Updated. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6. "Should the client send a push request > prior > > > to > > > > > > expiry > > > > > > > > of > > > > > > > > > > the > > > > > > > > > > > > > > previously > > > > > > > > > > > > > > > > calculated PushIntervalMs the broker will > > discard > > > > the > > > > > > > > metrics > > > > > > > > > > and > > > > > > > > > > > > > > return a > > > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set > to > > > > > > > > RateLimited." > > > > > > > > > > Is > > > > > > > > > > > > this > > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not > > mentioned > > > > in > > > > > > the > > > > > > > > "New > > > > > > > > > > Error > > > > > > > > > > > > > > Codes" > > > > > > > > > > > > > > > > section. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > That's a leftover, it should be using the > > standard > > > > > > > > ThrottleTime > > > > > > > > > > > > > > mechanism. > > > > > > > > > > > > > > > Fixed. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7. In the section "Standard client resource > > > labels" > > > > > > > > > > application_id > > > > > > > > > > > > is > > > > > > > > > > > > > > > > described as Kafka Streams only, but the > > section > > > of > > > > > > > "Client > > > > > > > > > > > > > > Identification" > > > > > > > > > > > > > > > > talks about "application instance id as an > > > optional > > > > > > > future > > > > > > > > > > > > > nice-to-have > > > > > > > > > > > > > > > > that may be included as a metrics label if it > > has > > > > been > > > > > > > set > > > > > > > > by > > > > > > > > > > the > > > > > > > > > > > > > > user", so > > > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams > clients > > > > should > > > > > > set > > > > > > > > an > > > > > > > > > > > > > > application_id > > > > > > > > > > > > > > > > or not. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'll clarify this in the KIP, but basically we > > > would > > > > need > > > > > > > to > > > > > > > > > add > > > > > > > > > > an ` > > > > > > > > > > > > > > > application.id` config > > > > > > > > > > > > > > > property for non-streams clients for this > > purpose, > > > > and > > > > > > > that's > > > > > > > > > > outside > > > > > > > > > > > > > the > > > > > > > > > > > > > > > scope of this KIP since we want to make it > > > > zero-conf:ish > > > > > > on > > > > > > > > the > > > > > > > > > > > > client > > > > > > > > > > > > > > side. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Kind regards, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Tom > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the review, > > > > > > > > > > > > > > > Magnus > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus > Edenhill > > < > > > > > > > > > > mag...@edenhill.se > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I've updated the KIP following our recent > > > > discussions > > > > > > > on > > > > > > > > > the > > > > > > > > > > > > > mailing > > > > > > > > > > > > > > > > list: > > > > > > > > > > > > > > > > > - split the protocol in two, one for > getting > > > the > > > > > > > metrics > > > > > > > > > > > > > > subscriptions, > > > > > > > > > > > > > > > > > and one for pushing the metrics. > > > > > > > > > > > > > > > > > - simplifications: initially only one > > > supported > > > > > > > metrics > > > > > > > > > > format, > > > > > > > > > > > > no > > > > > > > > > > > > > > > > > client.id in the instance id, etc. > > > > > > > > > > > > > > > > > - made CLIENT_METRICS subscription > > > configuration > > > > > > > entries > > > > > > > > > > more > > > > > > > > > > > > > > structured > > > > > > > > > > > > > > > > > and allowing better client matching > > > selectors > > > > (not > > > > > > > > only > > > > > > > > > > on the > > > > > > > > > > > > > > > > instance > > > > > > > > > > > > > > > > > id, but also the other > > > > > > > > > > > > > > > > > client resource labels, such as > > > > > > > client_software_name, > > > > > > > > > > etc.). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Unless there are further comments I'll call > > the > > > > vote > > > > > > > in a > > > > > > > > > > day or > > > > > > > > > > > > > two. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > Magnus > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus > > > > Edenhill < > > > > > > > > > > > > > > mag...@edenhill.se>: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Gwen, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the > last > > > > couple > > > > > > of > > > > > > > > > > discussion > > > > > > > > > > > > > > points > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > this thread > > > > > > > > > > > > > > > > > > and will call the Vote later this week. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > Magnus > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen > > > Shapira > > > > > > > > > > > > > > > > > <g...@confluent.io.invalid > > > > > > > > > > > > > > > > > > >: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> Hey, > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> I noticed that there was no discussion > for > > > the > > > > > > last > > > > > > > 10 > > > > > > > > > > days, > > > > > > > > > > > > > but I > > > > > > > > > > > > > > > > > >> couldn't > > > > > > > > > > > > > > > > > >> find the vote thread. Is there one that > > I'm > > > > > > missing? > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> Gwen > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus > > > > Edenhill < > > > > > > > > > > > > > > mag...@edenhill.se> > > > > > > > > > > > > > > > > > >> wrote: > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev > > Colin > > > > > > McCabe < > > > > > > > > > > > > > > > > cmcc...@apache.org > > > > > > > > > > > > > > > > > >: > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng > > Min > > > > > > wrote: > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the > > > > discussion. > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless > design, > > > > Client > > > > > > > can > > > > > > > > > > pretty > > > > > > > > > > > > > much > > > > > > > > > > > > > > use > > > > > > > > > > > > > > > > > any > > > > > > > > > > > > > > > > > >> > > > connection to any broker to send > > > > metrics. We > > > > > > > are > > > > > > > > > not > > > > > > > > > > > > > > associating > > > > > > > > > > > > > > > > > >> > > connection > > > > > > > > > > > > > > > > > >> > > > with client metric state. Is my > > > > > > understanding > > > > > > > > > > correct? > > > > > > > > > > > > If > > > > > > > > > > > > > > yes, > > > > > > > > > > > > > > > > > how > > > > > > > > > > > > > > > > > >> > about > > > > > > > > > > > > > > > > > >> > > > the following two scenarios > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID) > registers > > > two > > > > > > > > different > > > > > > > > > > client > > > > > > > > > > > > > > > > instance > > > > > > > > > > > > > > > > > id > > > > > > > > > > > > > > > > > >> > via > > > > > > > > > > > > > > > > > >> > > > separate registration. Is it > > > permitted? > > > > If > > > > > > OK, > > > > > > > > how > > > > > > > > > > to > > > > > > > > > > > > > > > > distinguish > > > > > > > > > > > > > > > > > >> them > > > > > > > > > > > > > > > > > >> > > from > > > > > > > > > > > > > > > > > >> > > > the case 2 below. > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > Hi Feng, > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can > > > > clarify I > > > > > > > > guess, > > > > > > > > > is > > > > > > > > > > > > that > > > > > > > > > > > > > > you > > > > > > > > > > > > > > > > > could > > > > > > > > > > > > > > > > > >> > have > > > > > > > > > > > > > > > > > >> > > something like two Producer > instances > > > > running > > > > > > > with > > > > > > > > > the > > > > > > > > > > > > same > > > > > > > > > > > > > > > > > client.id > > > > > > > > > > > > > > > > > >> > > (perhaps because they're using the > > same > > > > config > > > > > > > > file, > > > > > > > > > > for > > > > > > > > > > > > > > example). > > > > > > > > > > > > > > > > > >> They > > > > > > > > > > > > > > > > > >> > > could even be in the same process. > But > > > > they > > > > > > > would > > > > > > > > > get > > > > > > > > > > > > > separate > > > > > > > > > > > > > > > > > UUIDs. > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the term > client > > to > > > > mean > > > > > > > > > > "Producer or > > > > > > > > > > > > > > > > > Consumer". > > > > > > > > > > > > > > > > > >> So > > > > > > > > > > > > > > > > > >> > > if you have both a Producer and a > > > > Consumer in > > > > > > > your > > > > > > > > > > > > > > application I > > > > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for > > > both. > > > > > > Again > > > > > > > > > > Magnus can > > > > > > > > > > > > > > chime > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > >> > here, I > > > > > > > > > > > > > > > > > >> > > guess. > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > That's correct. > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > 2) How about the client > restarting? > > > > What's > > > > > > the > > > > > > > > > > > > > expectation? > > > > > > > > > > > > > > > > Should > > > > > > > > > > > > > > > > > >> the > > > > > > > > > > > > > > > > > >> > > > server expect the client to carry > a > > > > > > persisted > > > > > > > > > client > > > > > > > > > > > > > > instance id > > > > > > > > > > > > > > > > > or > > > > > > > > > > > > > > > > > >> > > should > > > > > > > > > > > > > > > > > >> > > > the client be treated as a new > > > instance? > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any > mechanism > > > for > > > > > > > > > > persistence, > > > > > > > > > > > > so I > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > > > >> assume > > > > > > > > > > > > > > > > > >> > > that when you restart the client you > > get > > > > a new > > > > > > > > > UUID. I > > > > > > > > > > > > agree > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > > >> > would > > > > > > > > > > > > > > > > > >> > > be good to spell this out. > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > Right, it will not be persisted since > a > > > > client > > > > > > > > > instance > > > > > > > > > > > > can't > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > >> restarted. > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > Will update the KIP to make this > > clearer. > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > /Magnus > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> -- > > > > > > > > > > > > > > > > > >> Gwen Shapira > > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >