Re: [DISCUSS] FLIP-XXX Support currentFetchEventTimeLag and processingLag metrics

jialiang tan Wed, 24 Apr 2024 05:00:32 -0700

Hi lorenzo, thanks for your feedback!


> There can be major discrepancies between the absolute time got by the
> TaskManagers (if clocks are not synchronized via ntp for example), and the
> results of the metric might be quite distant for different TMs.


I don't know much about this. can more experienced contributors help? If it
exists, perhaps `currentEmitEventTimeLag` has the same problem?

Furthermore, comparing the time on the Flink cluster with the event time of
> records might introduce completely inaccurate results.


I think this has been discussed in the FLIP-33 lists thread[2].

2 - I don't think the name `processingLag` represents the processing time
> spent, I would rather see `processingTime` just for the semantics of the
> name itself.


+1, it makes sense to me.

3 - Do you really think the `processingTime` should be a gauge? I
> understand your justification for the fetch lag, but I think the processing
> time should be an histogram. For the inefficiency of this, how about some
> sampling (e.g.: only update the histogram 1 every 1000 events?)


I went back to the discussion of the FLIP-33 lists thread[2]. At first,
`emitEventTimeLag` and `fetchEventTimeLag` were defined as histogram, and
they felt that histogram was too expensive. So they imported
`currentFetchEventTimeLag` and `currentEmitEventTimeLag` instead, and put
`fetchEventTimeLag` and `emitEventTimeLag` in the future work. I think we
need to create another FLIP to discuss it. Now I want to change
`processingTime` to `currentProcessingTime`, still using gauge, and remain
`processingTime` in the future work, WDYT?

4 - At this point, if we have the processing time and number of records, we
> could also add throughput as a metric, so that the user would know how many
> records/second the source is able to produce.


Do you mean `numRecordsInPerSecond` and `numRecordsOutPerSecond`? They were
defined in FLIP-33[1] and used in Flink-1.14.

5 - For the "Kafka Connector" section: can this be generalized for
> connectors in general? Can you provide an example to better understand your
> statement about reflection?


Good idea, I have extended my FLIP.

6 - Does this introduce any UI change for representing the metric?


I'm just adding some new metrics, users can get them from REST API or
collect them into Prometheus, and no change for Flink WEB UI.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
[2] https://lists.apache.org/thread/r47zrqto4k8tsc3xvfg392zblfx6dczl

Best,
TanJiaLiang

<lorenzo.affe...@ververica.com.invalid> 于2024年4月24日周三 15:02写道：

> Hello jialiang tan, and thank you for your contribution!
>
> Here are my questions:
>
> 1 - I don't know if exposing this as metric to the user could create some
> harm. There can be major discrepancies between the absolute time got by the
> TaskManagers (if clocks are not synchronized via ntp for example), and the
> results of the metric might be quite distant for different TMs.
> Furthermore, comparing the time on the Flink cluster with the event time of
> records might introduce completely inaccurate results. I think providing
> this should come with many disclaimers to the user. Maybe, more experienced
> contributors can comment on this as well.
>
> 2 - I don't think the name `processingLag` represents the processing time
> spent, I would rather see `processingTime` just for the semantics of the
> name itself.
>
> 3 - Do you really think the `processingTime` should be a gauge? I
> understand your justification for the fetch lag, but I think the processing
> time should be an histogram. For the inefficiency of this, how about some
> sampling (e.g.: only update the histogram 1 every 1000 events?)
>
> 4 - At this point, if we have the processing time and number of records,
> we could also add throughput as a metric, so that the user would know how
> many records/second the source is able to produce.
>
> 5 - For the "Kafka Connector" section: can this be generalized for
> connectors in general? Can you provide an example to better understand your
> statement about reflection?
>
> 6 - Does this introduce any UI change for representing the metric?
>
> Thank you!
> On Apr 22, 2024 at 12:26 +0200, jialiang tan <tanjialiang1...@gmail.com>,
> wrote:
> > Sorry all, it seems bad formatting in my email message, now I send it
> again
> > gently and hope it work.
> >
> > I would like to start a discussion about FLIP-XXX:
> > SupportcurrentFetchEventTimeLag and processingLag metrics [1].
> >
> > The main motivation for this change was that I had some difficulties
> > inimplementing the currentFetchEventTimeLag metrics for KafkaSource [2].
> >
> > So I proposed to let the SourceReaderMetricGroup provide an interface to
> > capturethe FetchTime so that all the FLIP-27 [3] sources can easily
> > implement thecurrentFetchEventTimeLag metrics.
> >
> > In addition, I propose to support the processingLag metric for the
> > FLIP-27sources to measure the current processing latency of the source.
> >
> > See the FLIP [1] and Jira [2] for more details.
> >
> > Looking forward to your comments and opinions!
> >
> > Thanks,
> > TanJiaLiang.
> >
> > [1]
> >
> https://docs.google.com/document/d/1nPhh1A-v-a7zyQyl1A5-K5DeUqbfxNXdjr2TVBT-QMs/edit?usp=sharing
> > [2] https://issues.apache.org/jira/browse/FLINK-33173
> > [3]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
> >
> > >
>

Re: [DISCUSS] FLIP-XXX Support currentFetchEventTimeLag and processingLag metrics

Reply via email to