Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Yufei Gu Mon, 16 Jun 2025 17:42:18 -0700

Thanks for the summary, Ajantha!

Multi-statement UDFs are definitely useful, but whether those statements
run within a single transaction should be treated as an engine-level
concern. The Iceberg UDF spec can spell out the expectation, yet the actual
guarantee still depends on the runtime. Even if a UDF declares itself
transactional, the engine may or may not enforce it.


One more thing: should we also introduce a “secure UDF” option supported by
some engines[1], so the body and any sensitive details stay hidden from
callers?

[1] https://docs.snowflake.com/en/developer-guide/secure-udf-procedure

Yufei


On Mon, Jun 16, 2025 at 12:02 PM Ajantha Bhat <[email protected]> wrote:

> Thanks to everyone who joined the sync.
> Here is the meeting recording:
> https://drive.google.com/file/d/10_Getaasv6tDMGzeZQUgcUVwCUAaFxiz/view?usp=sharing
> Summary:
>
>    - We have gone through the SQL UDF syntax supported by different
>    engines (Snowflake, databricks, Dremio, Trino, OSS spark 4.0).
>    - Each engine uses its own block separator, like $$ or '' or none.
>    Action item was to check whether engines support multi-statement
>    (transactional) UDF bodies.
>    - Discussed about function overloading. Need to check whether these
>    engines support function overloading for SQL UDFs. Postgres supports it! If
>    yes, need to adopt the spec to handle it.
>    - Started online spec review and discussed the deterministic flag and
>    concluded that we keep the independent fields (like deterministic) in spec
>    only if the majority of engines supports it. Else it will be passed in a
>    property bag (engine specific). And it is the engine's responsibility to
>    honor those optional properties.
>
> Feel free to review the current proposal document here
> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing>.
>
> Final spec will be put to review and vote once it is ready.
>
> Details for next Iceberg UDF sync:
>
> *Monday, June 30 · 9:00 – 10:00am*Time zone: America/Los_Angeles
> Google Meet joining info
> Video call link: https://meet.google.com/aui-czix-nbh
>
> - Ajantha
>
> On Wed, Jun 4, 2025 at 9:00 PM Ajantha Bhat <[email protected]> wrote:
>
>> Thanks to everyone who joined the sync.
>> Here is the meeting recording:
>> https://drive.google.com/file/d/1WItItsNs3m3-no7_qWPHftGqVNOdpw5C/view?usp=sharing
>>
>> Summary:
>>
>>    -
>>
>>    We discussed including Python support; the majority agreed *not to*
>>    (see recording for details).
>>    -
>>
>>    No strong opposition to versioning — it will be included to support
>>    change tracking and similar use cases.
>>    -
>>
>>    Suggestions were made to document how each catalog resolves UDFs,
>>    similar to views and tables.
>>    -
>>
>>    We agreed not to deviate from the existing table/view spec — e.g.,
>>    location will remain *required* for cross-catalog compatibility.
>>    -
>>
>>    We also discussed a bit about view interoperability as the same
>>    things are applicable here.
>>
>>    Feel free to review the proposal document
>>    
>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?pli=1&tab=t.0>
>>  here.
>>    With the current scope, it is similar to the view/table spec now.
>>    Final spec will be put to review and vote once it is ready.
>>
>> Details for next Iceberg UDF sync:
>>
>> *Monday, June 16 · 9:00 – 10:00am*Time zone: America/Los_Angeles
>> Google Meet joining info
>> Video call link: https://meet.google.com/aui-czix-nbh
>>
>> - Ajantha
>>
>> On Wed, May 21, 2025 at 3:33 AM Yufei Gu <[email protected]> wrote:
>>
>>> Hi folks,
>>>
>>> We’ve set up a dedicated bi-weekly community sync for the UDF project.
>>> Everyone’s welcome to drop in and share ideas! Here is the meeting link:
>>>
>>> Iceberg UDF sync
>>> Monday, June 2 · 9:00 – 10:00am
>>> Time zone: America/Los_Angeles
>>> Google Meet joining info
>>> Video call link: https://meet.google.com/aui-czix-nbh
>>>
>>> Yufei
>>>
>>>
>>> On Fri, May 16, 2025 at 10:45 AM Ajantha Bhat <[email protected]>
>>> wrote:
>>>
>>>> Update on the progress.
>>>>
>>>> I had a meeting today with Yufei and Yun.zou to discuss the UDF
>>>> proposal. We covered several key points, though some are still open for
>>>> further discussion:
>>>>
>>>> a) *UDF Versioning*: Do we truly need versioning for UDFs at this
>>>> stage? We explored the possibility of simplifying the specification by
>>>> avoiding view replication, and potentially introducing versioning support
>>>> later. UDTFs, being a superset of views in some ways, may not require
>>>> versioning initially.
>>>>
>>>> b) *VarArgs Support*: While some query engines may not support vararg
>>>> syntax in CREATE FUNCTION, Iceberg UDFs could represent such arguments
>>>> as lists when supported by the engine.
>>>>
>>>> c) *Generics in UDFs*: Since Iceberg currently doesn’t support generic
>>>> types (e.g., object), we can only map engine-specific types to Iceberg
>>>> types. As a result, generic data types will not be supported in the initial
>>>> version.
>>>>
>>>> d) *Python Support*: Incorporating Python as a language for SQL UDFs
>>>> seems promising, especially given its potential to resolve interoperability
>>>> challenges. Some engines, however, require platform version and package
>>>> dependency details to execute Python code—this should be captured in the
>>>> specification.
>>>>
>>>> *Next Steps*
>>>> I will update the proposal document with two primary UDF use cases:
>>>>
>>>>    -
>>>>
>>>>    Policy exchange between engines
>>>>    -
>>>>
>>>>    UDTF as a superset of view functionality
>>>>
>>>> The update will include corresponding syntax examples in both SQL and
>>>> Python, and detail how each use case is represented in Iceberg metadata.
>>>>
>>>> We also plan to set up regular syncs (open to more interested
>>>> participants) to continue refining and finalizing the UDF specification.
>>>> - Ajantha
>>>>
>>>>
>>>> On Wed, Mar 12, 2025 at 9:16 PM Ajantha Bhat <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I've updated the design document[1] based on the previous comments.
>>>>> Additionally, I've included the SQL UDF syntax supported by various
>>>>> vendors, including Dremio, Snowflake, Databricks, and Trino.
>>>>>
>>>>> I'm happy to schedule a separate sync if a deeper discussion is
>>>>> needed. Let's keep moving forward, especially with the renewed interest
>>>>> from the community.
>>>>>
>>>>> [1]
>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing
>>>>>
>>>>> On Thu, Feb 13, 2025 at 11:17 PM Ajantha Bhat <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> During the last catalog community sync, there was significant
>>>>>> interest in storing UDFs in Iceberg and adding endpoints for UDF handling
>>>>>> in the REST catalog spec.
>>>>>>
>>>>>> I recently discussed this with Yufei to better understand the new
>>>>>> requirement of using UDFs for fine-grained access control policies. This
>>>>>> expands the use cases beyond just versioned and interoperable UDFs.
>>>>>> Additionally, I learnt that many vendors are interested in this feature.
>>>>>>
>>>>>> Given the strong community interest and support, I’d like to take
>>>>>> ownership of this effort and revive the work. I'll be revisiting the
>>>>>> document I proposed long back and will share an updated proposal by next
>>>>>> week.
>>>>>>
>>>>>> Looking forward to storing UDFs in Iceberg!
>>>>>> - Ajantha
>>>>>>
>>>>>> On Thu, Aug 8, 2024 at 2:55 PM Dmitri Bourlatchkov
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> The UDF spec does not require representations to be SQL. It merely
>>>>>>> does not specify (in this revision) how other representations are to be
>>>>>>> written.
>>>>>>>
>>>>>>> This seems like an easy extension (adding a new type in the
>>>>>>> "Representations" section).
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Dmitri.
>>>>>>>
>>>>>>> On Thu, Aug 8, 2024 at 3:47 PM Ryan Blue <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Right now, SQL is an explicit requirement of the spec. It leaves a
>>>>>>>> way for future versions to add different representations later, but 
>>>>>>>> only
>>>>>>>> SQL is supported. That was also the feedback to my initial skepticism 
>>>>>>>> about
>>>>>>>> how it would work to add functions.
>>>>>>>>
>>>>>>>> On Thu, Aug 8, 2024 at 12:44 PM Dmitri Bourlatchkov
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> I do not think the spec is meant to allow only SQL
>>>>>>>>> representations, although it is certainly faviouring SQL in 
>>>>>>>>> examples... It
>>>>>>>>> would be nice to add a non-SQL example, indeed.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Dmitri.
>>>>>>>>>
>>>>>>>>> On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Coming from PyIceberg, I have concerns as this proposal focuses
>>>>>>>>>> on SQL-based engines, while Python-based systems often work with data
>>>>>>>>>> frames. Adding imperative languages like Python would make this 
>>>>>>>>>> proposal
>>>>>>>>>> more inclusive.
>>>>>>>>>>
>>>>>>>>>> Kind regards,
>>>>>>>>>> Fokko
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen <
>>>>>>>>>> [email protected]>:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Walaa, thanks for asking!
>>>>>>>>>>> In the design doc linked before  in this thread [1] i read
>>>>>>>>>>> "Without a common standard, the UDFs are hard to share among
>>>>>>>>>>> different engines."
>>>>>>>>>>> ("Background and Motivation" section).
>>>>>>>>>>> I agree with this statement. I don't fully understand yet how
>>>>>>>>>>> the proposed design addresses shareability between the engines 
>>>>>>>>>>> though.
>>>>>>>>>>> I would use some help to understand this better.
>>>>>>>>>>>
>>>>>>>>>>> Best
>>>>>>>>>>> Piotr
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [1] SQL User-Defined Function Spec
>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Piotr, what do you mean by making user-created functions
>>>>>>>>>>>> shareable
>>>>>>>>>>>> between engines? Do you mean UDFs written in imperative code?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> >
>>>>>>>>>>>> > Hi,
>>>>>>>>>>>> >
>>>>>>>>>>>> > Thank you Ajantha for creating this thread. The Iceberg UDFs
>>>>>>>>>>>> are an interesting idea!
>>>>>>>>>>>> > Is there a plan to make the user-created functions sharable
>>>>>>>>>>>> between the engines?
>>>>>>>>>>>> > If so, how would a CREATE FUNCTION statement look like in
>>>>>>>>>>>> e..g Spark or Trino?
>>>>>>>>>>>> >
>>>>>>>>>>>> > Meanwhile, added a few comments in the doc.
>>>>>>>>>>>> >
>>>>>>>>>>>> > Best
>>>>>>>>>>>> > Piotr
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> I just looked through the proposal and added comments. I
>>>>>>>>>>>> think it would be helpful to also have a design doc that covers 
>>>>>>>>>>>> the choices
>>>>>>>>>>>> from the draft spec. For instance, the choice to enumerate all 
>>>>>>>>>>>> possible
>>>>>>>>>>>> function input struts rather than allowing generics and varargs.
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> Here’s a quick summary of my feedback:
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> I think that the choice to enumerate function signatures is
>>>>>>>>>>>> limiting. It would be nice to see a discussion of the trade-offs 
>>>>>>>>>>>> and a
>>>>>>>>>>>> rationale for the choice. I think it would also be very helpful to 
>>>>>>>>>>>> have a
>>>>>>>>>>>> few representative use cases for this included in the doc. That 
>>>>>>>>>>>> way the
>>>>>>>>>>>> proposal can demonstrate that it solves those use cases with 
>>>>>>>>>>>> reasonable
>>>>>>>>>>>> trade-offs.
>>>>>>>>>>>> >> There are a few instances where this is inconsistent with
>>>>>>>>>>>> conventions in other specs. For example, using string IDs rather 
>>>>>>>>>>>> than an
>>>>>>>>>>>> integer.
>>>>>>>>>>>> >> This uses a very different model for spec versioning than
>>>>>>>>>>>> the Iceberg view and table specs. It requires readers to fail if 
>>>>>>>>>>>> there are
>>>>>>>>>>>> any unknown fields, which prevents the spec from adding things 
>>>>>>>>>>>> that are
>>>>>>>>>>>> fully backward-compatible. Other Iceberg specs only require a 
>>>>>>>>>>>> version
>>>>>>>>>>>> change to introduce forward-incompatible changes and I think that 
>>>>>>>>>>>> this
>>>>>>>>>>>> should do the same to avoid confusion.
>>>>>>>>>>>> >> It looks like the intent is to allow multiple function
>>>>>>>>>>>> signatures per verison, but it is unclear how to encode them 
>>>>>>>>>>>> because a
>>>>>>>>>>>> version is associated with a single function signature.
>>>>>>>>>>>> >> There is no review of SQL syntax for creating functions
>>>>>>>>>>>> across engines, so this doesn’t show that the metadata proposed is
>>>>>>>>>>>> sufficient for cross-engine use cases.
>>>>>>>>>>>> >> The example for a table-valued function shows a SELECT
>>>>>>>>>>>> statement and it isn’t clear how this is distinct from a view
>>>>>>>>>>>> >>
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> Thanks Walaa and Robert for the review on this.
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> We didn't find any blocker for the spec.
>>>>>>>>>>>> >>> I will wait for a week and If no more review comments, I
>>>>>>>>>>>> will raise a PR for spec addition next week.
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> If anyone else is interested, please have a look at the
>>>>>>>>>>>> proposal
>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> - Ajantha
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>>
>>>>>>>>>>>> >>>> Hi Ajantha,
>>>>>>>>>>>> >>>>
>>>>>>>>>>>> >>>> I have left some comments. It is an interesting direction,
>>>>>>>>>>>> but there might be some details that need to be fine tuned.
>>>>>>>>>>>> >>>>
>>>>>>>>>>>> >>>> The doc is here [1] for others who might be interested.
>>>>>>>>>>>> Resharing since I do not think it was directly linked in the 
>>>>>>>>>>>> thread.
>>>>>>>>>>>> >>>>
>>>>>>>>>>>> >>>> [1]
>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>>>>>>>>>>> >>>>
>>>>>>>>>>>> >>>> Thanks,
>>>>>>>>>>>> >>>> Walaa.
>>>>>>>>>>>> >>>>
>>>>>>>>>>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>>>
>>>>>>>>>>>> >>>>> Hi, just another reminder since we didn't get any review
>>>>>>>>>>>> on the proposal.
>>>>>>>>>>>> >>>>> Initially proposed on June 4.
>>>>>>>>>>>> >>>>>
>>>>>>>>>>>> >>>>> - Ajantha
>>>>>>>>>>>> >>>>>
>>>>>>>>>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>> >>>>>> Hi everyone,
>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>> >>>>>> We've only received one review so far (from Benny).
>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>> >>>>>> We would appreciate more eyes on this.
>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>> >>>>>> - Ajantha
>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>> >>>>>>> Hi All,
>>>>>>>>>>>> >>>>>>> Please find the proposal link
>>>>>>>>>>>> >>>>>>> https://github.com/apache/iceberg/issues/10432
>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>> >>>>>>> Google doc link is attached in the proposal.
>>>>>>>>>>>> >>>>>>> And Thanks Stephen Lin for working on it.
>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>> >>>>>>> Hope it gives more clarity to take the decisions and
>>>>>>>>>>>> how we want to implement it.
>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>> >>>>>>> - Ajantha
>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>> >>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table
>>>>>>>>>>>> user defined functions. Here are some examples of what I meant in 
>>>>>>>>>>>> (2):
>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>> >>>>>>>> Hive GenericUDF:
>>>>>>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>>>>>>>>>>>> >>>>>>>> Trino user defined functions:
>>>>>>>>>>>> https://trino.io/docs/current/develop/functions.html
>>>>>>>>>>>> >>>>>>>> Flink user defined functions:
>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>> >>>>>>>> Probably what you referred to is a variation of (1)
>>>>>>>>>>>> where the API is data flow/data pipeline API instead of SQL (e.g., 
>>>>>>>>>>>> Spark
>>>>>>>>>>>> Scala). Yes, that is also possible in the very long run :)
>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>> >>>>>>>> Thanks,
>>>>>>>>>>>> >>>>>>>> Walaa.
>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>> >>>>>>>>> > (2) Custom code written in imperative function
>>>>>>>>>>>> according to a Java/Scala/Python API, etc.
>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>> >>>>>>>>> I think we could still explore some long term
>>>>>>>>>>>> opportunities in this case. Consider you register a Spark temp 
>>>>>>>>>>>> view as some
>>>>>>>>>>>> sort of data frame read, then it could still be resolved to a 
>>>>>>>>>>>> Spark plan
>>>>>>>>>>>> that is representable by an intermediate representation. But I 
>>>>>>>>>>>> agree this
>>>>>>>>>>>> gets very complicated very soon, and just having the case (1) 
>>>>>>>>>>>> covered would
>>>>>>>>>>>> already be a huge step forward.
>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>> >>>>>>>>> -Jack
>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>> It's interesting to note that a tabular SQL UDF can
>>>>>>>>>>>> be used to build a parameterized view.  So, there's definitely a 
>>>>>>>>>>>> lot in
>>>>>>>>>>>> common between UDFs and views.
>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>> Thanks
>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>> I think there is a disconnect about what is
>>>>>>>>>>>> perceived as a "UDF". There are 2 flavors:
>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>> (1) Functions that are defined by the user whose
>>>>>>>>>>>> definition is a composition of other built-in functions/SQL 
>>>>>>>>>>>> expressions.
>>>>>>>>>>>> >>>>>>>>>>> (2) Custom code written in imperative function
>>>>>>>>>>>> according to a Java/Scala/Python API, etc.
>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>> All the examples in Ajantha's references are pretty
>>>>>>>>>>>> much from (1) and I think those have more analogy to views due to 
>>>>>>>>>>>> their SQL
>>>>>>>>>>>> nature. Agree (2) is not practical to maintain by Iceberg, but I 
>>>>>>>>>>>> think
>>>>>>>>>>>> Ajantha's use cases are around (1), and may be worth evaluating.
>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>> Thanks,
>>>>>>>>>>>> >>>>>>>>>>> Walaa.
>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you post the
>>>>>>>>>>>> proposal, but I think this would be a very difficult area to 
>>>>>>>>>>>> tackle across
>>>>>>>>>>>> engines, languages, and memory models without having a huge 
>>>>>>>>>>>> performance
>>>>>>>>>>>> penalty.
>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL
>>>>>>>>>>>> representations of UDFs (similar to views as shared by the 
>>>>>>>>>>>> reference links
>>>>>>>>>>>> above), the complexity involved will be similar to managing views.
>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>>>>>>>>>>> >>>>>>>>>>>> We will work on publishing the draft spec
>>>>>>>>>>>> (inspired by the view spec) this week to facilitate further 
>>>>>>>>>>>> discussions.
>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>> - Ajantha
>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>> > While it would be great to have a common set of
>>>>>>>>>>>> functions across engines, I don't see how that is practical when 
>>>>>>>>>>>> those
>>>>>>>>>>>> engines are implemented so differently. Plugging in code -- and 
>>>>>>>>>>>> especially
>>>>>>>>>>>> custom user-supplied code -- seems inherently specialized to me 
>>>>>>>>>>>> and should
>>>>>>>>>>>> be part of the engines' design.
>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>> How is this different from the views? I feel we
>>>>>>>>>>>> can say exactly the same thing for Iceberg views, but yet we have 
>>>>>>>>>>>> Iceberg
>>>>>>>>>>>> multi-dialect views implemented. Maybe it sounds like we are 
>>>>>>>>>>>> trying to draw
>>>>>>>>>>>> a line between SQL vs other programming language as "code"? but I 
>>>>>>>>>>>> think SQL
>>>>>>>>>>>> is just another type of code, and we are already talking about 
>>>>>>>>>>>> compiling
>>>>>>>>>>>> all these different code dialects to an intermediate 
>>>>>>>>>>>> representation (using
>>>>>>>>>>>> projects like Coral, Substrait), which will be stored as another 
>>>>>>>>>>>> type of
>>>>>>>>>>>> representation of Iceberg view. I think the same functionality can 
>>>>>>>>>>>> be used
>>>>>>>>>>>> for UDFs if developed.
>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>> I actually hink adding UDF support is a good
>>>>>>>>>>>> idea, even just a multi-dialect one like view, and that can allow 
>>>>>>>>>>>> engines
>>>>>>>>>>>> to for example parse a view SQL, and when a function referenced 
>>>>>>>>>>>> cannot be
>>>>>>>>>>>> resolved, try to seek for a multi-dialect UDF definition.
>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>> I guess we can discuss more when we have the
>>>>>>>>>>>> actual proposal published.
>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>> Best,
>>>>>>>>>>>> >>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and portable and
>>>>>>>>>>>> "non-centralized" as views are. The same performance concerns 
>>>>>>>>>>>> apply to
>>>>>>>>>>>> views as well.
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> Iceberg should define a common base upon which
>>>>>>>>>>>> engines can build, so the argument that UDFs aren't practical, 
>>>>>>>>>>>> because
>>>>>>>>>>>> engines are different, is probably only a temporary concern.
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg should also try to
>>>>>>>>>>>> tackle the idea to make views portable, which is conceptually not 
>>>>>>>>>>>> that much
>>>>>>>>>>>> different from portable UDFs.
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to
>>>>>>>>>>>> the idea of having UDFs in Iceberg, especially not in this early 
>>>>>>>>>>>> stage.
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha.
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea to
>>>>>>>>>>>> add UDFs tracked by Iceberg catalogs. I think that Iceberg 
>>>>>>>>>>>> primarily deals
>>>>>>>>>>>> with things that are centralized, like tables of data. While it 
>>>>>>>>>>>> would be
>>>>>>>>>>>> great to have a common set of functions across engines, I don't 
>>>>>>>>>>>> see how
>>>>>>>>>>>> that is practical when those engines are implemented so 
>>>>>>>>>>>> differently.
>>>>>>>>>>>> Plugging in code -- and especially custom user-supplied code -- 
>>>>>>>>>>>> seems
>>>>>>>>>>>> inherently specialized to me and should be part of the engines' 
>>>>>>>>>>>> design.
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> I guess we'll know more when you post the
>>>>>>>>>>>> proposal, but I think this would be a very difficult area to 
>>>>>>>>>>>> tackle across
>>>>>>>>>>>> engines, languages, and memory models without having a huge 
>>>>>>>>>>>> performance
>>>>>>>>>>>> penalty.
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge the community
>>>>>>>>>>>> interest in storing the Versioned SQL UDFs in Iceberg.
>>>>>>>>>>>> >>>>>>>>>>>>>>> We want to propose the spec addition for
>>>>>>>>>>>> storing the versioned UDFs in Iceberg (inspired by view spec).
>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>>> These UDFs can operate similarly to views in
>>>>>>>>>>>> that they are associated with tables, but they can accept 
>>>>>>>>>>>> arguments and
>>>>>>>>>>>> produce return values, or even function as inline expressions.
>>>>>>>>>>>> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino,
>>>>>>>>>>>> Snowflake, Databricks Spark supports SQL UDFs at catalog level [1].
>>>>>>>>>>>> >>>>>>>>>>>>>>> But storing them in Iceberg can enable
>>>>>>>>>>>> >>>>>>>>>>>>>>> - Versioning of these UDFs.
>>>>>>>>>>>> >>>>>>>>>>>>>>> - Interoperability between the engines.
>>>>>>>>>>>> Potentially engines can understand the UDFs written by other 
>>>>>>>>>>>> engines (with
>>>>>>>>>>>> the translate layer).
>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that integrating this feature into
>>>>>>>>>>>> Iceberg would be a valuable addition, and we're eager to 
>>>>>>>>>>>> collaborate with
>>>>>>>>>>>> the community to develop a UDF specification.
>>>>>>>>>>>> >>>>>>>>>>>>>>> Stephen has already begun drafting a
>>>>>>>>>>>> specification to propose to the community.
>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on this.
>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>>> [1]
>>>>>>>>>>>> >>>>>>>>>>>>>>> Dremio -
>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>>>>>>> >>>>>>>>>>>>>>> Trino -
>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html
>>>>>>>>>>>> >>>>>>>>>>>>>>> Snowflake -
>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>>>>>>> >>>>>>>>>>>>>>> Databricks -
>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> --
>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> >>>>>>>>>>>>>> Tabular
>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>>>>>>>>>> --
>>>>>>>>>>>> >>>>>>>>>>>>>> Robert Stupp
>>>>>>>>>>>> >>>>>>>>>>>>>> @snazy
>>>>>>>>>>>> >>
>>>>>>>>>>>> >>
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> --
>>>>>>>>>>>> >> Ryan Blue
>>>>>>>>>>>> >> Databricks
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Databricks
>>>>>>>>
>>>>>>>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to