Hi Zakelly,

Kindly pinging here to see if you had any remaining concerns regarding the
FLIP.

If there are no further questions or concerns from anyone, I plan to close
this discussion thread and proceed with the vote thread.

Thanks,
Weiqing

On Tue, Mar 3, 2026 at 2:58 PM Weiqing Yang <[email protected]>
wrote:

> Hi Zakelly,
>
>
> Thanks for the feedback and sorry for the late response - I am now picking
> it back up.
>
> You raised a great point about the performance overhead, referencing
> FLINK-16444 <https://issues.apache.org/jira/browse/FLINK-16444>. I've
> updated the FLIP to adopt the same counter-based sampling approach used by
> Flink's state latency tracking (FLINK-21736
> <https://issues.apache.org/jira/browse/FLINK-21736>). Specifically:
>
>   1. New config: table.exec.udf-metric.sample-interval (default: 100 [1])
> - only every Nth invocation is measured
>   2. Fast path: Non-sampled invocations are a single integer increment -
> negligible overhead
>   3. Sampled path: System.nanoTime() around the UDF call, stored in a
> DescriptiveStatisticsHistogram with a bounded 128-entry circular buffer [2]
>   4. Metric type change: udfProcessingTime is now a Histogram (reports
> p50/p75/p95/p99/mean/min/max) instead of the original Gauge
>   5. Exception counting: Not sampled, since exceptions are rare events and
> counting each one has negligible cost
>
> Combined with the existing feature gate (table.exec.udf-metric-enabled
> defaulting to false), users have two layers of protection: the feature is
> off by default, and when enabled, sampling keeps overhead minimal.
> The updated FLIP is here: link
> <https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1>
>
> Would this address your concern? If so, it would be great to have your
> vote on the vote thread [3].
>
> [1] 100: state.latency-track.sample-interval default value
>
> [2] 128: state.latency-track.history-size default value (line 55), which
> is the circular buffer size for the DescriptiveStatisticsHistogram
> [3] https://lists.apache.org/thread/d0sv36839p5h03t3okv89pco2jy6vbg3
>
> Thanks,
> Weiqing
>
> On Thu, Aug 21, 2025 at 12:24 AM Zakelly Lan <[email protected]>
> wrote:
>
>> Hi Weiqing,
>>
>> Sorry for the late reply. And I have one question:
>>
>> I'm wondering whether the UDF processing time is measured for every
>> individual UDF invocation, with the average then reported, or if sampling
>> is used instead? I'm concerned about the potential overhead if we measure
>> every single invocation. We've encountered similar performance issues when
>> implementing state latency tracking [1].
>>
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-16444
>>
>> Best,
>> Zakelly
>>
>> On Fri, Aug 15, 2025 at 5:04 AM Weiqing Yang <[email protected]>
>> wrote:
>>
>> > Cool - I’ll proceed to start the VOTE.
>> > Thanks!
>> >
>> > Weiqing
>> >
>> > On Thu, Aug 14, 2025 at 12:53 AM Shengkai Fang <[email protected]>
>> wrote:
>> >
>> > > I don't have any more comments.
>> > >
>> > > Best,
>> > > Shengkai
>> > >
>> > > Weiqing Yang <[email protected]> 于2025年8月14日周四 14:47写道:
>> > >
>> > > > Thanks, Shengkai. I’ve updated the proposal doc with the recommended
>> > > > configuration name. Please let me know if you have any additional
>> > > feedback.
>> > > >
>> > > > Best,
>> > > > Weiqing
>> > > >
>> > > > On Wed, Aug 13, 2025 at 6:58 PM Shengkai Fang <[email protected]>
>> > wrote:
>> > > >
>> > > > > Sorry for the late response. I prefer to use
>> > > > > `table.exec.udf-metric-enabled` as the option name.
>> > > > >
>> > > > > Best,
>> > > > > Shengkai
>> > > > >
>> > > > > Weiqing Yang <[email protected]> 于2025年8月13日周三 23:54写道:
>> > > > >
>> > > > > > Hi Shengkai, Alan, Xuyang, and all,
>> > > > > >
>> > > > > > Since there have been no further objections, I’ll proceed to
>> start
>> > > the
>> > > > > VOTE
>> > > > > > on this proposal shortly.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Weiqing
>> > > > > >
>> > > > > > On Thu, Jul 31, 2025 at 10:26 PM Weiqing Yang <
>> > > > [email protected]>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi Shengkai, Alan and Xuyang,
>> > > > > > >
>> > > > > > > Just checking in - do you have any concerns or feedback?
>> > > > > > >
>> > > > > > > If there are no further objections from anyone, I’ll mark the
>> > FLIP
>> > > as
>> > > > > > > ready for voting.
>> > > > > > >
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Weiqing
>> > > > > > >
>> > > > > > >
>> > > > > > > On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang <
>> > > > [email protected]
>> > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > >> Hi Xuyang,
>> > > > > > >>
>> > > > > > >> Thank you for reviewing the proposal!
>> > > > > > >>
>> > > > > > >> I’m planning to use: *udf.metrics.process-time* and
>> > > > > > >> *udf.metrics.exception-count*. These follow the naming
>> > convention
>> > > > used
>> > > > > > >> in Flink (e.g., RocksDB native metrics
>> > > > > > >> <
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics
>> > > > > > >).
>> > > > > > >> I’ve added these names to the proposal doc.
>> > > > > > >>
>> > > > > > >> Alternatively, I also considered:
>> > > *metrics.udf.process-time.enabled*
>> > > > > and
>> > > > > > >> *metrics.udf.exception-count.enabled. *
>> > > > > > >>
>> > > > > > >> Happy to hear any feedback on which style might be more
>> > > appropriate.
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> Best,
>> > > > > > >> Weiqing
>> > > > > > >>
>> > > > > > >> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <[email protected]>
>> > > wrote:
>> > > > > > >>
>> > > > > > >>> Hi, Weiqing.
>> > > > > > >>>
>> > > > > > >>> Thanks for driving to improve this. I just have one
>> question. I
>> > > > > notice
>> > > > > > a
>> > > > > > >>> new configuration is introduced in this flip. I just wonder
>> > what
>> > > > the
>> > > > > > >>> configuration name is. Could you please include the full
>> name
>> > of
>> > > > this
>> > > > > > >>> configuration? (just similar to the other names in
>> > > MetricOptions?)
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>> --
>> > > > > > >>>
>> > > > > > >>>     Best!
>> > > > > > >>>     Xuyang
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>> 在 2025-07-13 12:03:59,"Weiqing Yang" <
>> [email protected]
>> > >
>> > > > 写道:
>> > > > > > >>> >Hi Alan,
>> > > > > > >>> >
>> > > > > > >>> >Thanks for reviewing the proposal and for highlighting the
>> > > > > ASYNC_TABLE
>> > > > > > >>> work.
>> > > > > > >>> >
>> > > > > > >>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR
>> and
>> > > > > > >>> ASYNC_TABLE.
>> > > > > > >>> >For async UDFs, the plan is to instrument both the
>> > invokeAsync()
>> > > > > call
>> > > > > > >>> and
>> > > > > > >>> >the async callback handler to measure the full end-to-end
>> > > latency
>> > > > > > until
>> > > > > > >>> the
>> > > > > > >>> >result or error is returned from the future.
>> > > > > > >>> >
>> > > > > > >>> >Let me know if you have any further questions or
>> suggestions.
>> > > > > > >>> >
>> > > > > > >>> >Best,
>> > > > > > >>> >Weiqing
>> > > > > > >>> >
>> > > > > > >>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg
>> > > > > > >>> ><[email protected]> wrote:
>> > > > > > >>> >
>> > > > > > >>> >> Hi Weiqing,
>> > > > > > >>> >>
>> > > > > > >>> >> From your doc, the entrypoint for UDF calls in the
>> codegen
>> > is
>> > > > > > >>> >> ExprCodeGenerator which should invoke
>> > > > BridgingSqlFunctionCallGen,
>> > > > > > >>> which
>> > > > > > >>> >> could be instrumented with metrics.  This works well for
>> > > > > synchronous
>> > > > > > >>> calls,
>> > > > > > >>> >> but what about ASYNC_SCALAR and the soon to be merged
>> > > > ASYNC_TABLE
>> > > > > (
>> > > > > > >>> >> https://github.com/apache/flink/pull/26567)?  Timing
>> > metrics
>> > > > > would
>> > > > > > >>> only
>> > > > > > >>> >> account for what it takes to call invokeAsync, not for
>> the
>> > > > result
>> > > > > to
>> > > > > > >>> >> complete (with a result or error from the future object).
>> > > > > > >>> >>
>> > > > > > >>> >> There are appropriate places which can handle the async
>> > > > callbacks,
>> > > > > > >>> but they
>> > > > > > >>> >> are in other locations.  Will you be able to support
>> those
>> > as
>> > > > > well?
>> > > > > > >>> >>
>> > > > > > >>> >> Thanks,
>> > > > > > >>> >> Alan
>> > > > > > >>> >>
>> > > > > > >>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <
>> > > [email protected]
>> > > > >
>> > > > > > >>> wrote:
>> > > > > > >>> >>
>> > > > > > >>> >> > I just have some questions:
>> > > > > > >>> >> >
>> > > > > > >>> >> > 1. The current metrics hierarchy shows that the UDF
>> metric
>> > > > group
>> > > > > > >>> belongs
>> > > > > > >>> >> to
>> > > > > > >>> >> > the TaskMetricGroup. I think it would be better for the
>> > UDF
>> > > > > metric
>> > > > > > >>> group
>> > > > > > >>> >> to
>> > > > > > >>> >> > belong to the OperatorMetricGroup instead, because a
>> UDF
>> > > might
>> > > > > be
>> > > > > > >>> used by
>> > > > > > >>> >> > multiple operators.
>> > > > > > >>> >> > 2. What are the naming conventions for UDF metrics?
>> Could
>> > > you
>> > > > > > >>> provide an
>> > > > > > >>> >> > example? Do the metric name contains the UDF name?
>> > > > > > >>> >> > 3. Why is the UDFExceptionCount metric introduced? If a
>> > UDF
>> > > > > throws
>> > > > > > >>> an
>> > > > > > >>> >> > exception, the job fails immediately. Why do we need to
>> > > track
>> > > > > this
>> > > > > > >>> value?
>> > > > > > >>> >> >
>> > > > > > >>> >> > Best
>> > > > > > >>> >> > Shengkai
>> > > > > > >>> >> >
>> > > > > > >>> >> >
>> > > > > > >>> >> > Weiqing Yang <[email protected]> 于2025年7月9日周三
>> > > 12:59写道:
>> > > > > > >>> >> >
>> > > > > > >>> >> > > Hi all,
>> > > > > > >>> >> > >
>> > > > > > >>> >> > > I’d like to initiate a discussion about adding UDF
>> > > metrics.
>> > > > > > >>> >> > >
>> > > > > > >>> >> > > *Motivation*
>> > > > > > >>> >> > >
>> > > > > > >>> >> > > User-defined functions (UDFs) are essential for
>> custom
>> > > logic
>> > > > > in
>> > > > > > >>> Flink
>> > > > > > >>> >> > jobs
>> > > > > > >>> >> > > but often act as black boxes, making debugging and
>> > > > performance
>> > > > > > >>> tuning
>> > > > > > >>> >> > > difficult. When issues like high latency or frequent
>> > > > > exceptions
>> > > > > > >>> occur,
>> > > > > > >>> >> > it's
>> > > > > > >>> >> > > hard to pinpoint the root cause inside UDFs.
>> > > > > > >>> >> > >
>> > > > > > >>> >> > > Flink currently lacks built-in metrics for key UDF
>> > aspects
>> > > > > such
>> > > > > > as
>> > > > > > >>> >> > > per-record processing time or exception count. This
>> > limits
>> > > > > > >>> >> observability
>> > > > > > >>> >> > > and complicates:
>> > > > > > >>> >> > >
>> > > > > > >>> >> > >    - Debugging production issues
>> > > > > > >>> >> > >    - Performance tuning and resource allocation
>> > > > > > >>> >> > >    - Supplying reliable signals to autoscaling
>> systems
>> > > > > > >>> >> > >
>> > > > > > >>> >> > > Introducing standard, opt-in UDF metrics will improve
>> > > > platform
>> > > > > > >>> >> > > observability and overall health.
>> > > > > > >>> >> > > Here’s the proposal document: Link
>> > > > > > >>> >> > > <
>> > > > > > >>> >> > >
>> > > > > > >>> >> >
>> > > > > > >>> >>
>> > > > > > >>>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
>> > > > > > >>> >> > > >
>> > > > > > >>> >> > >
>> > > > > > >>> >> > > Your feedback and ideas are welcome to refine this
>> > > feature.
>> > > > > > >>> >> > >
>> > > > > > >>> >> > >
>> > > > > > >>> >> > > Thanks,
>> > > > > > >>> >> > > Weiqing
>> > > > > > >>> >> > >
>> > > > > > >>> >> >
>> > > > > > >>> >>
>> > > > > > >>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to