Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Chris Egerton Mon, 21 Nov 2022 13:37:02 -0800

Hi Jorge,

Thanks for the updates, and apologies for the delay. The new diagram
directly under the "Proposed Changes" section is absolutely gorgeous!



Follow-ups:

RE 2: Good point. We can use the same level for these metrics, it's not a
big deal.

RE 3: As long as all the per-record metrics are kept at DEBUG level, it
should be fine to leave JMH benchmarking for a follow-up. If we want to add
new per-record, INFO-level metrics, I would be more comfortable with
including benchmarking as part of the testing plan for the KIP. One
possible compromise could be to propose that these features be merged at
DEBUG level, and then possibly upgraded to INFO level in the future pending
benchmarks to guard against performance degradation.

RE 4: I think for a true "end-to-end" metric, it'd be useful to include the
time taken by the task to actually deliver the record. However, with the
new metric names and descriptions provided in the KIP, I have no objections
with what's currently proposed, and a new "end-to-end" metric can be taken
on later in a follow-up KIP.

RE 6: You're right, existing producer metrics should be enough for now. We
can revisit this later if/when we add delivery-centric metrics for sink
tasks as well.

RE 7: The new metric names in the KIP LGTM; I don't see any need to expand
beyond those but if you'd still like to pursue others, LMK.


New thoughts:

One small thought: instead of "alias" in "alias="{transform_alias}" for the
per-transform metrics, could we use "transform"? IMO it's clearer since we
don't use "alias" in the names of transform-related properties, and "alias"
may be confused with the classloading term where you can use, e.g.,
"FileStreamSource" as the name of a connector class in a connector config
instead of "org.apache.kafka.connect.file.FileStreamSourceConnector".


Cheers,

Chris

On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thanks Mickael!
>
>
> On Wed, 9 Nov 2022 at 15:54, Mickael Maison <mickael.mai...@gmail.com>
> wrote:
>
> > Hi Jorge,
> >
> > Thanks for the KIP, it is a nice improvement.
> >
> > 1) The per transformation metrics still have a question mark next to
> > them in the KIP. Do you want to include them? If so we'll want to tag
> > them, we should be able to include the aliases in TransformationChain
> > and use them.
> >
>
> Yes, I have added the changes on TransformChain that will be needed to add
> these metrics.
>
>
> >
> > 2) I see no references to predicates. If we don't want to measure
> > their latency, can we say it explicitly?
> >
>
> Good question, I haven't considered these. Though as these are materialized
> as PredicatedTransformation, they should be covered by these changes.
> Adding a note about this.
>
>
> >
> > 3) Should we have sink-record-batch-latency-avg-ms? All other metrics
> > have both the maximum and average values.
> >
> >
> Good question. I will remove it and change the record latency from
> DEBUG->INFO as it already cover the maximum metric.
>
> Hope it's clearer now, let me know if there any additional feedback.
> Thanks!
>
>
>
> > Thanks,
> > Mickael
> >
> > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > <quilcate.jo...@gmail.com> wrote:
> > >
> > > Thanks, Chris! Great feedback! Please, find my comments below:
> > >
> > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton <chr...@aiven.io.invalid>
> > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks for the KIP. I agree with the overall direction and think this
> > would
> > > > be a nice improvement to Kafka Connect. Here are my initial thoughts
> > on the
> > > > details:
> > > >
> > > > 1. The motivation section outlines the gaps in Kafka Connect's task
> > metrics
> > > > nicely. I think it'd be useful to include more concrete details on
> why
> > > > these gaps need to be filled in, and in which cases additional
> metrics
> > > > would be helpful. One goal could be to provide enhanced monitoring of
> > > > production deployments that allows for cluster administrators to set
> up
> > > > automatic alerts for latency spikes and, if triggered, quickly
> > identify the
> > > > root cause of those alerts, reducing the time to remediation. Another
> > goal
> > > > could be to provide more insight to developers or cluster
> > administrators
> > > > who want to do performance testing on connectors in non-production
> > > > environments. It may help guide our decision making process to have a
> > > > clearer picture of the goals we're trying to achieve.
> > > >
> > >
> > > Agree. The Motivation section has been updated.
> > > Thanks for the examples, I see both of them being covered by the KIP.
> > > I see how these could give us a good distinction on whether to position
> > > some metrics at INFO or DEBUG level.
> > >
> > >
> > > > 2. If we're trying to address the alert-and-diagnose use case, it'd
> be
> > > > useful to have as much information as possible at INFO level, rather
> > than
> > > > forcing cluster administrators to possibly reconfigure a connector to
> > emit
> > > > DEBUG or TRACE level metrics in order to diagnose a potential
> > > > production-impacting performance bottleneck. I can see the rationale
> > for
> > > > emitting per-record metrics that track an average value at DEBUG
> > level, but
> > > > for per-record metrics that track a maximum value, is there any
> reason
> > not
> > > > to provide this information at INFO level?
> > > >
> > >
> > > Agree. Though with Max and Avg metrics being part of the same sensor —
> > > where Metric Level is defined — then both metrics get the same level.
> > >
> > >
> > > > 3. I'm also curious about the performance testing suggested by Yash
> to
> > > > gauge the potential impact of this change. Have you been able to do
> any
> > > > testing with your draft implementation yet?
> > > >
> > >
> > > No, not so far.
> > > I think it would be valuable to discuss the scope of this testing and
> > maybe
> > > tackle it
> > > in a separate issue as Sensors and Metrics are used all over the place.
> > > My initial understanding is that these tests should by placed in the
> > > jmh-benchmarks[1].
> > > Then, we could target testing Sensors and Metrics, and validate how
> much
> > > overhead
> > > is added by having only Max vs Max,Avg(,Min), etc.
> > > In the other hand, we could extend this to Transformers or other
> Connect
> > > layers.
> > >
> > > Here are some pointers to the Sensors and Metrics implementations that
> > > could be considered:
> > > Path to metric recording:
> > > -
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > -
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > >
> > > ```
> > > // increment all the stats
> > > for (StatAndConfig statAndConfig : this.stats) {
> > >    statAndConfig.stat.record(statAndConfig.config(), value, timeMs);
> > > }
> > > ```
> > >
> > > SampledStats:
> > > - Avg:
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > - Max:
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > - Min:
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > >
> > > `stat#record()` are implemented by `update` method in SampledStat:
> > >
> > > ```Max.java
> > >     @Override
> > >     protected void update(Sample sample, MetricConfig config, double
> > value,
> > > long now) {
> > >         sample.value += value;
> > >     }
> > > ```
> > >
> > > ```Avg.java
> > >     @Override
> > >     protected void update(Sample sample, MetricConfig config, double
> > value,
> > > long now) {
> > >         sample.value = Math.max(sample.value, value);
> > >     }
> > > ```
> > >
> > > As far as I understand, most of the work of the stats happens on the
> > > `combine` method that is not part of the connector execution but called
> > > when metrics are queried.
> > >
> > > I wonder whether we should consider Avg and Max for all metrics
> proposed
> > as
> > > the impact on the execution path seems minimal, and even see if Min is
> > also
> > > valuable, and use DEBUG only for more granular metrics.
> > >
> > > [1] https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > >
> > >
> > > > 4. Just to make sure I understand correctly--does "time when it has
> > been
> > > > received by the Sink task" refer to the wallclock time directly
> after a
> > > > call to SinkTask::put has been completed (as opposed to directly
> before
> > > > that call is made, or something else entirely)?
> > > >
> > >
> > > It currently means when it has been received by the Sink task
> > > right after consumer poll and before conversions.
> > > Would it be valuable to have it after put-sink-records?
> > >
> > >
> > > > 5. If the goal is to identify performance bottlenecks (either in
> > production
> > > > or pre-production environments), would it make sense to introduce
> > metrics
> > > > for each individual converter (i.e., key/value/header) and
> > transformation?
> > > > It's definitely an improvement to be able to identify the total time
> > for
> > > > conversion and transformation, but then the immediate follow-up
> > question if
> > > > a bottleneck is found in that phase is "which
> converter/transformation
> > is
> > > > responsible?" It'd be nice if we could provide a way to quickly
> answer
> > that
> > > > question.
> > > >
> > >
> > > This is a great idea. I'd like to consider this as well, though maybe
> > these
> > > more granular
> > > metrics would be good to have them as DEBUG.
> > >
> > >
> > > > 6. Any thoughts about offering latency metrics for source tasks
> between
> > > > receipt of the record from the task and delivery of the record to
> Kafka
> > > > (which would be tracked by producer callback)? We could also use the
> > record
> > > > timestamp either instead of or in addition to receipt time if the
> task
> > > > provides a timestamp with its records.
> > > >
> > >
> > > With source transform and convert metrics we get part of that latency.
> > > Looking at the Producer metrics, `request-latency` (though a very
> generic
> > > metric)
> > > sort of answer the time between send request and ack — if my
> > understanding
> > > is correct.
> > > Would these be enough or you're thinking about another approach?
> > > maybe a custom metric to cover the producer side?
> > >
> > >
> > > > 7. We may end up introducing a way for sink tasks to record
> per-record
> > > > delivery to the sink system (see KIP-767 [1]). I'd like it if we
> could
> > keep
> > > > the names of our metrics very precise in order to avoid confusing
> users
> > > > (who may think that we're providing metrics on actual delivery to the
> > sink
> > > > system, which may not be the case if the connector performs
> > asynchronous
> > > > writes), and in order to leave room for a metrics on true delivery
> > time by
> > > > sink tasks. It'd also be nice if we could remain consistent with
> > existing
> > > > metrics such as "put-batch-avg-time-ms". With that in mind, what do
> you
> > > > think about renaming these metrics:
> > > > - "sink-record-batch-latency-max-ms" to "put-batch-avg-latency-ms"
> > > > - "sink-record-latency-max-ms" to "put-sink-record-latency-max-ms"
> > > > - "sink-record-latency-avg-ms" to "put-sink-record-latency-avg-ms"
> > > > - "sink-record-convert-transform-time-max-ms" to
> > > > "convert-transform-sink-record-time-max-ms"
> > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > "convert-transform-sink-record-time-avg-ms"
> > > > - "source-record-transform-convert-time-max-ms" to
> > > > "transform-convert-source-record-time-max-ms"
> > > > - "source-record-transform-convert-time-avg-ms" to
> > > > "transform-convert-source-record-time-avg-ms"
> > > >
> > >
> > > Make sense, thanks! I have updated the list of metrics and group them
> by
> > > sensor and applying these suggestions.
> > > The only ones that I want to review are: sink-record-* to put-batch-*
> > > (first 3). Not sure if put-batch/put-sink-record describes the purpose
> of
> > > the metric — neither `sink-record-latency` to be honest.
> > > My initial thought was to have something like Kafka Streams
> e2e-latency.
> > > Based on 4. and 6. questions, an idea could be to add:
> > > - source-batch-e2e-latency-before-send: measure wallclock - source
> record
> > > timestamp after source connector poll.
> > > - source-batch-e2e-latency-after-send: measure wallclock - record
> > timestamp
> > > on producer send callback
> > > - sink-batch-e2e-latency-before-put: measure time wallclock - record
> > > timestamp after consumer poll
> > > - sink-batch-e2e-latency-after-put: measure time wallclock - record
> > > timestamp after sink connector put.
> > >
> > >
> > > > Thanks again for the KIP! Looking forward to your thoughts.
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > [1] -
> > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > >
> > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jo...@gmail.com> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I've made a slight addition to the KIP based on Yash feedback:
> > > > >
> > > > > - A new metric is added at INFO level to record the max latency
> from
> > the
> > > > > batch timestamp, by keeping the oldest record timestamp per batch.
> > > > > - A draft implementation is linked.
> > > > >
> > > > > Looking forward to your feedback.
> > > > > Also, a kindly reminder that the vote thread is open.
> > > > >
> > > > > Thanks!
> > > > > Jorge.
> > > > >
> > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jo...@gmail.com> wrote:
> > > > >
> > > > > > Great. I have updated the KIP to reflect this.
> > > > > >
> > > > > > Cheers,
> > > > > > Jorge.
> > > > > >
> > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <yash.ma...@gmail.com>
> > wrote:
> > > > > >
> > > > > >> Thanks, I think it makes sense to define these metrics at a
> DEBUG
> > > > > >> recording
> > > > > >> level.
> > > > > >>
> > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate Otoya <
> > > > > >> quilcate.jo...@gmail.com> wrote:
> > > > > >>
> > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> yash.ma...@gmail.com>
> > > > > wrote:
> > > > > >> >
> > > > > >> > > Hi Jorge,
> > > > > >> > >
> > > > > >> > > Thanks for the changes. With regard to having per batch vs
> per
> > > > > record
> > > > > >> > > metrics, the additional overhead I was referring to wasn't
> > about
> > > > > >> whether
> > > > > >> > or
> > > > > >> > > not we would need to iterate over all the records in a
> batch.
> > I
> > > > was
> > > > > >> > > referring to the potential additional overhead caused by the
> > > > higher
> > > > > >> > volume
> > > > > >> > > of calls to Sensor::record on the sensors for the new
> metrics
> > (as
> > > > > >> > compared
> > > > > >> > > to the existing batch only metrics), especially for high
> > > > throughput
> > > > > >> > > connectors where batch sizes could be large. I guess we may
> > want
> > > > to
> > > > > do
> > > > > >> > some
> > > > > >> > > sort of performance testing and get concrete numbers to
> verify
> > > > > whether
> > > > > >> > this
> > > > > >> > > is a valid concern or not?
> > > > > >> > >
> > > > > >> >
> > > > > >> > 6.1. Got it, thanks for clarifying. I guess there could be a
> > > > benchmark
> > > > > >> test
> > > > > >> > of the `Sensor::record` to get an idea of the performance
> > impact.
> > > > > >> > Regardless, the fact that these are single-record metrics
> > compared
> > > > to
> > > > > >> > existing batch-only could be explicitly defined by setting
> these
> > > > > >> metrics at
> > > > > >> > a DEBUG or TRACE metric recording level, leaving the existing
> at
> > > > INFO
> > > > > >> > level.
> > > > > >> > wdyt?
> > > > > >> >
> > > > > >> >
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Yash
> > > > > >> > >
> > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate Otoya
> <
> > > > > >> > > quilcate.jo...@gmail.com> wrote:
> > > > > >> > >
> > > > > >> > > > Hi Sagar and Yash,
> > > > > >> > > >
> > > > > >> > > > > the way it's defined in
> > > > > >> > > >
> https://kafka.apache.org/documentation/#connect_monitoring
> > for
> > > > > the
> > > > > >> > > metrics
> > > > > >> > > >
> > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > >> > > >
> > > > > >> > > > > The only thing I would argue is do we need
> > > > > >> sink-record-latency-min?
> > > > > >> > > Maybe
> > > > > >> > > > we
> > > > > >> > > > > could remove this min metric as well and make all of the
> > 3 e2e
> > > > > >> > metrics
> > > > > >> > > > > consistent
> > > > > >> > > >
> > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > >> > > >
> > > > > >> > > > > Probably users can track the metrics at their end to
> > > > > >> > > > > figure that out. Do you think that makes sense?
> > > > > >> > > >
> > > > > >> > > > 4.3. Yes, agree. With these new metrics it should be
> easier
> > for
> > > > > >> users
> > > > > >> > to
> > > > > >> > > > track this.
> > > > > >> > > >
> > > > > >> > > > > I think it makes sense to not have a min metric for
> > either to
> > > > > >> remain
> > > > > >> > > > > consistent with the existing put-batch and poll-batch
> > metrics
> > > > > >> > > >
> > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > >> > > >
> > > > > >> > > > > Another naming related suggestion I had was with the
> > > > > >> > > > > "convert-time" metrics - we should probably include
> > > > > >> transformations
> > > > > >> > in
> > > > > >> > > > the
> > > > > >> > > > > name since SMTs could definitely be attributable to a
> > sizable
> > > > > >> chunk
> > > > > >> > of
> > > > > >> > > > the
> > > > > >> > > > > latency depending on the specific transformation chain.
> > > > > >> > > >
> > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > >> > `sink-record-convert-transform...`
> > > > > >> > > > and `source-record-transform-convert...` to represent
> > correctly
> > > > > the
> > > > > >> > order
> > > > > >> > > > of operations.
> > > > > >> > > >
> > > > > >> > > > > it seems like both source and sink tasks only record
> > metrics
> > > > at
> > > > > a
> > > > > >> > > "batch"
> > > > > >> > > > > level, not on an individual record level. I think it
> > might be
> > > > > >> > > additional
> > > > > >> > > > > overhead if we want to record these new metrics all at
> the
> > > > > record
> > > > > >> > > level?
> > > > > >> > > >
> > > > > >> > > > 5.3. I considered at the beginning to implement all
> metrics
> > at
> > > > the
> > > > > >> > batch
> > > > > >> > > > level, but given how the framework process records, I
> > fallback
> > > > to
> > > > > >> the
> > > > > >> > > > proposed approach:
> > > > > >> > > > - Sink Task:
> > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already
> iterates
> > over
> > > > > >> > records,
> > > > > >> > > > so there is no additional overhead to capture record
> > latency per
> > > > > >> > record.
> > > > > >> > > >     -
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > >> > > >   - `WorkerSinkTask#convertAndTransformRecord(record)`
> > actually
> > > > > >> happens
> > > > > >> > > > individually. Measuring this operation per batch would
> > include
> > > > > >> > processing
> > > > > >> > > > that is not strictly part of "convert and transform"
> > > > > >> > > >     -
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > >> > > > - Source Task:
> > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates over a
> > batch
> > > > > and
> > > > > >> > > > applies transforms and convert record individually as
> well:
> > > > > >> > > >     -
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > >> > > >
> > > > > >> > > > > This might require some additional changes -
> > > > > >> > > > > for instance, with the "sink-record-latency" metric, we
> > might
> > > > > only
> > > > > >> > want
> > > > > >> > > > to
> > > > > >> > > > > have a "max" metric since "avg" would require recording
> a
> > > > value
> > > > > on
> > > > > >> > the
> > > > > >> > > > > sensor for each record (whereas we can get a "max" by
> only
> > > > > >> recording
> > > > > >> > a
> > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > >> > > >
> > > > > >> > > > 5.4. Recording record-latency per batch may not be as
> > useful as
> > > > > >> there
> > > > > >> > is
> > > > > >> > > no
> > > > > >> > > > guarantee that the oldest record will be representative of
> > the
> > > > > >> batch.
> > > > > >> > > >
> > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > yash.ma...@gmail.com>
> > > > > >> wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi Jorge and Sagar,
> > > > > >> > > > >
> > > > > >> > > > > I think it makes sense to not have a min metric for
> > either to
> > > > > >> remain
> > > > > >> > > > > consistent with the existing put-batch and poll-batch
> > metrics
> > > > > (it
> > > > > >> > > doesn't
> > > > > >> > > > > seem particularly useful either anyway). Also, the new
> > > > > >> > > > > "sink-record-latency" metric name looks fine to me,
> > thanks for
> > > > > >> making
> > > > > >> > > the
> > > > > >> > > > > changes! Another naming related suggestion I had was
> with
> > the
> > > > > >> > > > > "convert-time" metrics - we should probably include
> > > > > >> transformations
> > > > > >> > in
> > > > > >> > > > the
> > > > > >> > > > > name since SMTs could definitely be attributable to a
> > sizable
> > > > > >> chunk
> > > > > >> > of
> > > > > >> > > > the
> > > > > >> > > > > latency depending on the specific transformation chain.
> > > > > >> > > > >
> > > > > >> > > > > I have one high level question with respect to
> > implementation
> > > > -
> > > > > >> > > > currently,
> > > > > >> > > > > it seems like both source and sink tasks only record
> > metrics
> > > > at
> > > > > a
> > > > > >> > > "batch"
> > > > > >> > > > > level, not on an individual record level. I think it
> > might be
> > > > > >> > > additional
> > > > > >> > > > > overhead if we want to record these new metrics all at
> the
> > > > > record
> > > > > >> > > level?
> > > > > >> > > > > Could we instead make all of these new metrics for
> > batches of
> > > > > >> records
> > > > > >> > > > > rather than individual records in order to remain
> > consistent
> > > > > with
> > > > > >> the
> > > > > >> > > > > existing task level metrics? This might require some
> > > > additional
> > > > > >> > > changes -
> > > > > >> > > > > for instance, with the "sink-record-latency" metric, we
> > might
> > > > > only
> > > > > >> > want
> > > > > >> > > > to
> > > > > >> > > > > have a "max" metric since "avg" would require recording
> a
> > > > value
> > > > > on
> > > > > >> > the
> > > > > >> > > > > sensor for each record (whereas we can get a "max" by
> only
> > > > > >> recording
> > > > > >> > a
> > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > > Yash
> > > > > >> > > > >
> > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > sagarmeansoc...@gmail.com
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hi Jorge,
> > > > > >> > > > > >
> > > > > >> > > > > > Thanks for the changes.
> > > > > >> > > > > >
> > > > > >> > > > > > Regarding the metrics, I meant something like this:
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > >> > > > > >
> > > > > >> > > > > > the way it's defined in
> > > > > >> > > > > >
> > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > for
> > > > > >> the
> > > > > >> > > > > > metrics.
> > > > > >> > > > > >
> > > > > >> > > > > > I see what you mean by the 3 metrics and how it can be
> > > > > >> interpreted.
> > > > > >> > > The
> > > > > >> > > > > > only thing I would argue is do we need
> > > > > sink-record-latency-min?
> > > > > >> > Maybe
> > > > > >> > > > we
> > > > > >> > > > > > could remove this min metric as well and make all of
> > the 3
> > > > e2e
> > > > > >> > > metrics
> > > > > >> > > > > > consistent(since put-batch also doesn't expose a min
> > which
> > > > > makes
> > > > > >> > > sense
> > > > > >> > > > to
> > > > > >> > > > > > me). I think this is in contrast to what Yash pointed
> > out
> > > > > above
> > > > > >> so
> > > > > >> > I
> > > > > >> > > > > would
> > > > > >> > > > > > like to hear his thoughts as well.
> > > > > >> > > > > >
> > > > > >> > > > > > The other point Yash mentioned about the slightly
> flawed
> > > > > >> definition
> > > > > >> > > of
> > > > > >> > > > > e2e
> > > > > >> > > > > > is also true in a sense. But I have a feeling that's
> > one the
> > > > > >> > records
> > > > > >> > > > are
> > > > > >> > > > > > polled by the connector tasks, it would be difficult
> to
> > > > track
> > > > > >> the
> > > > > >> > > final
> > > > > >> > > > > leg
> > > > > >> > > > > > via the framework. Probably users can track the
> metrics
> > at
> > > > > their
> > > > > >> > end
> > > > > >> > > to
> > > > > >> > > > > > figure that out. Do you think that makes sense?
> > > > > >> > > > > >
> > > > > >> > > > > > Thanks!
> > > > > >> > > > > > Sagar.
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate
> > > > Otoya <
> > > > > >> > > > > > quilcate.jo...@gmail.com> wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > >> > > > > > >
> > > > > >> > > > > > > Thanks for your feedback!
> > > > > >> > > > > > >
> > > > > >> > > > > > > > 1) I am assuming the new metrics would be task
> level
> > > > > metric.
> > > > > >> > > > > > >
> > > > > >> > > > > > > 1.1 Yes, it will be a task level metric, implemented
> > on
> > > > the
> > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Could you specify the way it's done for other
> > > > sink/source
> > > > > >> > > > connector?
> > > > > >> > > > > > >
> > > > > >> > > > > > > 1.2. Not sure what do you mean by this. Could you
> > > > elaborate
> > > > > a
> > > > > >> bit
> > > > > >> > > > more?
> > > > > >> > > > > > >
> > > > > >> > > > > > > > 2. I am slightly confused about the e2e latency
> > > > metric...
> > > > > >> > > > > > >
> > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a similar
> > concept
> > > > as
> > > > > in
> > > > > >> > > > Streams
> > > > > >> > > > > > with
> > > > > >> > > > > > > KIP-613, though the e2e concept may not be
> > translatable.
> > > > > >> > > > > > > We could keep it as `sink-record-latency` to avoid
> > > > > conflating
> > > > > >> > > > > concepts. A
> > > > > >> > > > > > > similar metric naming was proposed in KIP-489 but at
> > the
> > > > > >> consumer
> > > > > >> > > > > level —
> > > > > >> > > > > > > though it seems dormant for a couple of years.
> > > > > >> > > > > > >
> > > > > >> > > > > > > > However, the put-batch time measures the
> > > > > >> > > > > > > > time to put a batch of records to external sink.
> > So, I
> > > > > would
> > > > > >> > > assume
> > > > > >> > > > > > the 2
> > > > > >> > > > > > > > can't be added as is to compute the e2e latency.
> > Maybe I
> > > > > am
> > > > > >> > > missing
> > > > > >> > > > > > > > something here. Could you plz clarify this.
> > > > > >> > > > > > >
> > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added, but with
> the 3
> > > > > >> latencies
> > > > > >> > > > (poll,
> > > > > >> > > > > > > convert, putBatch) will be clearer where the
> > bottleneck
> > > > may
> > > > > >> be,
> > > > > >> > and
> > > > > >> > > > > > > represent the internal processing.
> > > > > >> > > > > > >
> > > > > >> > > > > > > > however, as per the KIP it looks like it will be
> > > > > >> > > > > > > > the latency between when the record was written to
> > Kafka
> > > > > and
> > > > > >> > when
> > > > > >> > > > the
> > > > > >> > > > > > > > record is returned by a sink task's consumer's
> poll?
> > > > > >> > > > > > >
> > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > > >> > > > > > >
> > > > > >> > > > > > > > One more thing - I was wondering
> > > > > >> > > > > > > > if there's a particular reason for having a min
> > metric
> > > > for
> > > > > >> e2e
> > > > > >> > > > > latency
> > > > > >> > > > > > > but
> > > > > >> > > > > > > > not for convert time?
> > > > > >> > > > > > >
> > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which seems
> useful
> > to
> > > > > >> compare
> > > > > >> > > with
> > > > > >> > > > > > Max a
> > > > > >> > > > > > > get an idea of the window of results, though current
> > > > > >> latencies in
> > > > > >> > > > > > Connector
> > > > > >> > > > > > > do not include Min, and that's why I haven't added
> it
> > for
> > > > > >> convert
> > > > > >> > > > > > latency.
> > > > > >> > > > > > > Do you think it make sense to extend latency metrics
> > with
> > > > > Min?
> > > > > >> > > > > > >
> > > > > >> > > > > > > KIP is updated to clarify some of these changes.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Many thanks,
> > > > > >> > > > > > > Jorge.
> > > > > >> > > > > > >
> > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > > > >> yash.ma...@gmail.com>
> > > > > >> > > > wrote:
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Hi Jorge,
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Thanks for the KIP! I have the same confusion with
> > the
> > > > > >> > > e2e-latency
> > > > > >> > > > > > > metrics
> > > > > >> > > > > > > > as Sagar above. "e2e" would seem to indicate the
> > latency
> > > > > >> > between
> > > > > >> > > > when
> > > > > >> > > > > > the
> > > > > >> > > > > > > > record was written to Kafka and when the record
> was
> > > > > written
> > > > > >> to
> > > > > >> > > the
> > > > > >> > > > > sink
> > > > > >> > > > > > > > system by the connector - however, as per the KIP
> it
> > > > looks
> > > > > >> like
> > > > > >> > > it
> > > > > >> > > > > will
> > > > > >> > > > > > > be
> > > > > >> > > > > > > > the latency between when the record was written to
> > Kafka
> > > > > and
> > > > > >> > when
> > > > > >> > > > the
> > > > > >> > > > > > > > record is returned by a sink task's consumer's
> > poll? I
> > > > > think
> > > > > >> > that
> > > > > >> > > > > > metric
> > > > > >> > > > > > > > will be a little confusing to interpret. One more
> > thing
> > > > -
> > > > > I
> > > > > >> was
> > > > > >> > > > > > wondering
> > > > > >> > > > > > > > if there's a particular reason for having a min
> > metric
> > > > for
> > > > > >> e2e
> > > > > >> > > > > latency
> > > > > >> > > > > > > but
> > > > > >> > > > > > > > not for convert time?
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Thanks,
> > > > > >> > > > > > > > Yash
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > >> > sagarmeansoc...@gmail.com>
> > > > > >> > > > > > wrote:
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > > Hi Jorge,
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Thanks for the KIP. It looks like a very good
> > > > addition.
> > > > > I
> > > > > >> > > skimmed
> > > > > >> > > > > > > through
> > > > > >> > > > > > > > > once and had a couple of questions =>
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 1) I am assuming the new metrics would be task
> > level
> > > > > >> metric.
> > > > > >> > > > Could
> > > > > >> > > > > > you
> > > > > >> > > > > > > > > specify the way it's done for other sink/source
> > > > > connector?
> > > > > >> > > > > > > > > 2) I am slightly confused about the e2e latency
> > > > metric.
> > > > > >> Let's
> > > > > >> > > > > > consider
> > > > > >> > > > > > > > the
> > > > > >> > > > > > > > > sink connector metric. If I look at the way it's
> > > > > supposed
> > > > > >> to
> > > > > >> > be
> > > > > >> > > > > > > > calculated,
> > > > > >> > > > > > > > > i.e the difference between the record timestamp
> > and
> > > > the
> > > > > >> wall
> > > > > >> > > > clock
> > > > > >> > > > > > > time,
> > > > > >> > > > > > > > it
> > > > > >> > > > > > > > > looks like a per record metric. However, the
> > put-batch
> > > > > >> time
> > > > > >> > > > > measures
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > > time to put a batch of records to external sink.
> > So, I
> > > > > >> would
> > > > > >> > > > assume
> > > > > >> > > > > > > the 2
> > > > > >> > > > > > > > > can't be added as is to compute the e2e latency.
> > > > Maybe I
> > > > > >> am
> > > > > >> > > > missing
> > > > > >> > > > > > > > > something here. Could you plz clarify this.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Thanks!
> > > > > >> > > > > > > > > Sagar.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban
> > Quilcate
> > > > > >> Otoya
> > > > > >> > <
> > > > > >> > > > > > > > > quilcate.jo...@gmail.com> wrote:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > > Hi all,
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > I'd like to start a discussion thread on
> > KIP-864:
> > > > Add
> > > > > >> > > > End-To-End
> > > > > >> > > > > > > > Latency
> > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > >> > > > > > > > > > This KIP aims to improve the metrics available
> > on
> > > > > Source
> > > > > >> > and
> > > > > >> > > > Sink
> > > > > >> > > > > > > > > > Connectors to measure end-to-end latency,
> > including
> > > > > >> source
> > > > > >> > > and
> > > > > >> > > > > sink
> > > > > >> > > > > > > > > record
> > > > > >> > > > > > > > > > conversion time, and sink record e2e latency
> > > > (similar
> > > > > to
> > > > > >> > > > KIP-613
> > > > > >> > > > > > for
> > > > > >> > > > > > > > > > Streams).
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > The KIP is here:
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Please take a look and let me know what you
> > think.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Cheers,
> > > > > >> > > > > > > > > > Jorge.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Reply via email to