Hi Zakelly, Kindly pinging here to see if you had any remaining concerns regarding the FLIP.
If there are no further questions or concerns from anyone, I plan to close this discussion thread and proceed with the vote thread. Thanks, Weiqing On Tue, Mar 3, 2026 at 2:58 PM Weiqing Yang <[email protected]> wrote: > Hi Zakelly, > > > Thanks for the feedback and sorry for the late response - I am now picking > it back up. > > You raised a great point about the performance overhead, referencing > FLINK-16444 <https://issues.apache.org/jira/browse/FLINK-16444>. I've > updated the FLIP to adopt the same counter-based sampling approach used by > Flink's state latency tracking (FLINK-21736 > <https://issues.apache.org/jira/browse/FLINK-21736>). Specifically: > > 1. New config: table.exec.udf-metric.sample-interval (default: 100 [1]) > - only every Nth invocation is measured > 2. Fast path: Non-sampled invocations are a single integer increment - > negligible overhead > 3. Sampled path: System.nanoTime() around the UDF call, stored in a > DescriptiveStatisticsHistogram with a bounded 128-entry circular buffer [2] > 4. Metric type change: udfProcessingTime is now a Histogram (reports > p50/p75/p95/p99/mean/min/max) instead of the original Gauge > 5. Exception counting: Not sampled, since exceptions are rare events and > counting each one has negligible cost > > Combined with the existing feature gate (table.exec.udf-metric-enabled > defaulting to false), users have two layers of protection: the feature is > off by default, and when enabled, sampling keeps overhead minimal. > The updated FLIP is here: link > <https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1> > > Would this address your concern? If so, it would be great to have your > vote on the vote thread [3]. > > [1] 100: state.latency-track.sample-interval default value > > [2] 128: state.latency-track.history-size default value (line 55), which > is the circular buffer size for the DescriptiveStatisticsHistogram > [3] https://lists.apache.org/thread/d0sv36839p5h03t3okv89pco2jy6vbg3 > > Thanks, > Weiqing > > On Thu, Aug 21, 2025 at 12:24 AM Zakelly Lan <[email protected]> > wrote: > >> Hi Weiqing, >> >> Sorry for the late reply. And I have one question: >> >> I'm wondering whether the UDF processing time is measured for every >> individual UDF invocation, with the average then reported, or if sampling >> is used instead? I'm concerned about the potential overhead if we measure >> every single invocation. We've encountered similar performance issues when >> implementing state latency tracking [1]. >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-16444 >> >> Best, >> Zakelly >> >> On Fri, Aug 15, 2025 at 5:04 AM Weiqing Yang <[email protected]> >> wrote: >> >> > Cool - I’ll proceed to start the VOTE. >> > Thanks! >> > >> > Weiqing >> > >> > On Thu, Aug 14, 2025 at 12:53 AM Shengkai Fang <[email protected]> >> wrote: >> > >> > > I don't have any more comments. >> > > >> > > Best, >> > > Shengkai >> > > >> > > Weiqing Yang <[email protected]> 于2025年8月14日周四 14:47写道: >> > > >> > > > Thanks, Shengkai. I’ve updated the proposal doc with the recommended >> > > > configuration name. Please let me know if you have any additional >> > > feedback. >> > > > >> > > > Best, >> > > > Weiqing >> > > > >> > > > On Wed, Aug 13, 2025 at 6:58 PM Shengkai Fang <[email protected]> >> > wrote: >> > > > >> > > > > Sorry for the late response. I prefer to use >> > > > > `table.exec.udf-metric-enabled` as the option name. >> > > > > >> > > > > Best, >> > > > > Shengkai >> > > > > >> > > > > Weiqing Yang <[email protected]> 于2025年8月13日周三 23:54写道: >> > > > > >> > > > > > Hi Shengkai, Alan, Xuyang, and all, >> > > > > > >> > > > > > Since there have been no further objections, I’ll proceed to >> start >> > > the >> > > > > VOTE >> > > > > > on this proposal shortly. >> > > > > > >> > > > > > Thanks, >> > > > > > Weiqing >> > > > > > >> > > > > > On Thu, Jul 31, 2025 at 10:26 PM Weiqing Yang < >> > > > [email protected]> >> > > > > > wrote: >> > > > > > >> > > > > > > Hi Shengkai, Alan and Xuyang, >> > > > > > > >> > > > > > > Just checking in - do you have any concerns or feedback? >> > > > > > > >> > > > > > > If there are no further objections from anyone, I’ll mark the >> > FLIP >> > > as >> > > > > > > ready for voting. >> > > > > > > >> > > > > > > >> > > > > > > Best, >> > > > > > > Weiqing >> > > > > > > >> > > > > > > >> > > > > > > On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang < >> > > > [email protected] >> > > > > > >> > > > > > > wrote: >> > > > > > > >> > > > > > >> Hi Xuyang, >> > > > > > >> >> > > > > > >> Thank you for reviewing the proposal! >> > > > > > >> >> > > > > > >> I’m planning to use: *udf.metrics.process-time* and >> > > > > > >> *udf.metrics.exception-count*. These follow the naming >> > convention >> > > > used >> > > > > > >> in Flink (e.g., RocksDB native metrics >> > > > > > >> < >> > > > > > >> > > > > >> > > > >> > > >> > >> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics >> > > > > > >). >> > > > > > >> I’ve added these names to the proposal doc. >> > > > > > >> >> > > > > > >> Alternatively, I also considered: >> > > *metrics.udf.process-time.enabled* >> > > > > and >> > > > > > >> *metrics.udf.exception-count.enabled. * >> > > > > > >> >> > > > > > >> Happy to hear any feedback on which style might be more >> > > appropriate. >> > > > > > >> >> > > > > > >> >> > > > > > >> Best, >> > > > > > >> Weiqing >> > > > > > >> >> > > > > > >> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <[email protected]> >> > > wrote: >> > > > > > >> >> > > > > > >>> Hi, Weiqing. >> > > > > > >>> >> > > > > > >>> Thanks for driving to improve this. I just have one >> question. I >> > > > > notice >> > > > > > a >> > > > > > >>> new configuration is introduced in this flip. I just wonder >> > what >> > > > the >> > > > > > >>> configuration name is. Could you please include the full >> name >> > of >> > > > this >> > > > > > >>> configuration? (just similar to the other names in >> > > MetricOptions?) >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> -- >> > > > > > >>> >> > > > > > >>> Best! >> > > > > > >>> Xuyang >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> 在 2025-07-13 12:03:59,"Weiqing Yang" < >> [email protected] >> > > >> > > > 写道: >> > > > > > >>> >Hi Alan, >> > > > > > >>> > >> > > > > > >>> >Thanks for reviewing the proposal and for highlighting the >> > > > > ASYNC_TABLE >> > > > > > >>> work. >> > > > > > >>> > >> > > > > > >>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR >> and >> > > > > > >>> ASYNC_TABLE. >> > > > > > >>> >For async UDFs, the plan is to instrument both the >> > invokeAsync() >> > > > > call >> > > > > > >>> and >> > > > > > >>> >the async callback handler to measure the full end-to-end >> > > latency >> > > > > > until >> > > > > > >>> the >> > > > > > >>> >result or error is returned from the future. >> > > > > > >>> > >> > > > > > >>> >Let me know if you have any further questions or >> suggestions. >> > > > > > >>> > >> > > > > > >>> >Best, >> > > > > > >>> >Weiqing >> > > > > > >>> > >> > > > > > >>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg >> > > > > > >>> ><[email protected]> wrote: >> > > > > > >>> > >> > > > > > >>> >> Hi Weiqing, >> > > > > > >>> >> >> > > > > > >>> >> From your doc, the entrypoint for UDF calls in the >> codegen >> > is >> > > > > > >>> >> ExprCodeGenerator which should invoke >> > > > BridgingSqlFunctionCallGen, >> > > > > > >>> which >> > > > > > >>> >> could be instrumented with metrics. This works well for >> > > > > synchronous >> > > > > > >>> calls, >> > > > > > >>> >> but what about ASYNC_SCALAR and the soon to be merged >> > > > ASYNC_TABLE >> > > > > ( >> > > > > > >>> >> https://github.com/apache/flink/pull/26567)? Timing >> > metrics >> > > > > would >> > > > > > >>> only >> > > > > > >>> >> account for what it takes to call invokeAsync, not for >> the >> > > > result >> > > > > to >> > > > > > >>> >> complete (with a result or error from the future object). >> > > > > > >>> >> >> > > > > > >>> >> There are appropriate places which can handle the async >> > > > callbacks, >> > > > > > >>> but they >> > > > > > >>> >> are in other locations. Will you be able to support >> those >> > as >> > > > > well? >> > > > > > >>> >> >> > > > > > >>> >> Thanks, >> > > > > > >>> >> Alan >> > > > > > >>> >> >> > > > > > >>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang < >> > > [email protected] >> > > > > >> > > > > > >>> wrote: >> > > > > > >>> >> >> > > > > > >>> >> > I just have some questions: >> > > > > > >>> >> > >> > > > > > >>> >> > 1. The current metrics hierarchy shows that the UDF >> metric >> > > > group >> > > > > > >>> belongs >> > > > > > >>> >> to >> > > > > > >>> >> > the TaskMetricGroup. I think it would be better for the >> > UDF >> > > > > metric >> > > > > > >>> group >> > > > > > >>> >> to >> > > > > > >>> >> > belong to the OperatorMetricGroup instead, because a >> UDF >> > > might >> > > > > be >> > > > > > >>> used by >> > > > > > >>> >> > multiple operators. >> > > > > > >>> >> > 2. What are the naming conventions for UDF metrics? >> Could >> > > you >> > > > > > >>> provide an >> > > > > > >>> >> > example? Do the metric name contains the UDF name? >> > > > > > >>> >> > 3. Why is the UDFExceptionCount metric introduced? If a >> > UDF >> > > > > throws >> > > > > > >>> an >> > > > > > >>> >> > exception, the job fails immediately. Why do we need to >> > > track >> > > > > this >> > > > > > >>> value? >> > > > > > >>> >> > >> > > > > > >>> >> > Best >> > > > > > >>> >> > Shengkai >> > > > > > >>> >> > >> > > > > > >>> >> > >> > > > > > >>> >> > Weiqing Yang <[email protected]> 于2025年7月9日周三 >> > > 12:59写道: >> > > > > > >>> >> > >> > > > > > >>> >> > > Hi all, >> > > > > > >>> >> > > >> > > > > > >>> >> > > I’d like to initiate a discussion about adding UDF >> > > metrics. >> > > > > > >>> >> > > >> > > > > > >>> >> > > *Motivation* >> > > > > > >>> >> > > >> > > > > > >>> >> > > User-defined functions (UDFs) are essential for >> custom >> > > logic >> > > > > in >> > > > > > >>> Flink >> > > > > > >>> >> > jobs >> > > > > > >>> >> > > but often act as black boxes, making debugging and >> > > > performance >> > > > > > >>> tuning >> > > > > > >>> >> > > difficult. When issues like high latency or frequent >> > > > > exceptions >> > > > > > >>> occur, >> > > > > > >>> >> > it's >> > > > > > >>> >> > > hard to pinpoint the root cause inside UDFs. >> > > > > > >>> >> > > >> > > > > > >>> >> > > Flink currently lacks built-in metrics for key UDF >> > aspects >> > > > > such >> > > > > > as >> > > > > > >>> >> > > per-record processing time or exception count. This >> > limits >> > > > > > >>> >> observability >> > > > > > >>> >> > > and complicates: >> > > > > > >>> >> > > >> > > > > > >>> >> > > - Debugging production issues >> > > > > > >>> >> > > - Performance tuning and resource allocation >> > > > > > >>> >> > > - Supplying reliable signals to autoscaling >> systems >> > > > > > >>> >> > > >> > > > > > >>> >> > > Introducing standard, opt-in UDF metrics will improve >> > > > platform >> > > > > > >>> >> > > observability and overall health. >> > > > > > >>> >> > > Here’s the proposal document: Link >> > > > > > >>> >> > > < >> > > > > > >>> >> > > >> > > > > > >>> >> > >> > > > > > >>> >> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1 >> > > > > > >>> >> > > > >> > > > > > >>> >> > > >> > > > > > >>> >> > > Your feedback and ideas are welcome to refine this >> > > feature. >> > > > > > >>> >> > > >> > > > > > >>> >> > > >> > > > > > >>> >> > > Thanks, >> > > > > > >>> >> > > Weiqing >> > > > > > >>> >> > > >> > > > > > >>> >> > >> > > > > > >>> >> >> > > > > > >>> >> > > > > > >> >> > > > > > >> > > > > >> > > > >> > > >> > >> >
