Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Yufei Gu Fri, 20 Jun 2025 15:57:47 -0700

Hi Ryan,

Here are the main use cases for secure UDFs:


   1.

   Hiding UDF Definitions: This includes concealing the UDF body and
   details like the list of imports, some of them aren’t applicable to SQL
   UDFs.
   2.

   Sandboxed Execution: Ensuring the UDF runs in an isolated environment.
   Again, this typically doesn’t apply to SQL UDFs.
   3.

   Preventing Data Leakage at Execution Time: For example, secure UDFs may
   disable certain optimizations—such as predicate pushdown—to avoid exposing
   sensitive data indirectly. [1]

Given these scenarios, I agree with your point that the secure flag is
primarily an instruction to the engine to behave differently. While it's
largely an engine-side behavior, we still need to include this flag in the
UDF definition to indicate whether a UDF is secure, especially considering
the perf penalty introduced by scenario #3. We should clearly recommend
that users avoid marking UDFs as secure unless it's truly necessary.

[1]
https://docs.snowflake.com/en/developer-guide/pushdown-optimization#example-of-indirect-data-exposure-through-pushdown
Yufei


On Wed, Jun 18, 2025 at 12:32 PM Ryan Blue <rdb...@gmail.com> wrote:

> Yufei, could you make the argument for supporting a "secure" UDF? What use
> case are you addressing and what specifically changes about how the UDF is
> handled? If the idea is to hide the UDF definition, do we need to include
> it?
>
> I think this would be a signal to a "trusted engine". When the engine
> interacts with the catalog it sends authorization information about itself
> in addition to the user that it is acting on behalf of. That way the
> catalog knows that the secure UDF can be sent to the engine and won't be
> shown to the user. The majority of this logic is on the REST server side,
> and the only part that is communicated to the client is the request not to
> show the UDF to the user, right? In that case should this be a property
> rather than part of the definition? Even if we state that the client "must"
> suppress the UDF definition, it's really just a request. Only trusted
> engines can be passed the UDF definition, so a spec requirement to suppress
> the definition isn't very meaningful.
>
> On Mon, Jun 16, 2025 at 5:42 PM Yufei Gu <flyrain...@gmail.com> wrote:
>
>> Thanks for the summary, Ajantha!
>>
>> Multi-statement UDFs are definitely useful, but whether those statements
>> run within a single transaction should be treated as an engine-level
>> concern. The Iceberg UDF spec can spell out the expectation, yet the actual
>> guarantee still depends on the runtime. Even if a UDF declares itself
>> transactional, the engine may or may not enforce it.
>>
>> One more thing: should we also introduce a “secure UDF” option supported
>> by some engines[1], so the body and any sensitive details stay hidden from
>> callers?
>>
>> [1] https://docs.snowflake.com/en/developer-guide/secure-udf-procedure
>>
>> Yufei
>>
>>
>> On Mon, Jun 16, 2025 at 12:02 PM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>>
>>> Thanks to everyone who joined the sync.
>>> Here is the meeting recording:
>>> https://drive.google.com/file/d/10_Getaasv6tDMGzeZQUgcUVwCUAaFxiz/view?usp=sharing
>>> Summary:
>>>
>>>    - We have gone through the SQL UDF syntax supported by different
>>>    engines (Snowflake, databricks, Dremio, Trino, OSS spark 4.0).
>>>    - Each engine uses its own block separator, like $$ or '' or none.
>>>    Action item was to check whether engines support multi-statement
>>>    (transactional) UDF bodies.
>>>    - Discussed about function overloading. Need to check whether these
>>>    engines support function overloading for SQL UDFs. Postgres supports it! 
>>> If
>>>    yes, need to adopt the spec to handle it.
>>>    - Started online spec review and discussed the deterministic flag
>>>    and concluded that we keep the independent fields (like deterministic) in
>>>    spec only if the majority of engines supports it. Else it will be passed 
>>> in
>>>    a property bag (engine specific). And it is the engine's responsibility 
>>> to
>>>    honor those optional properties.
>>>
>>> Feel free to review the current proposal document here
>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing>.
>>>
>>> Final spec will be put to review and vote once it is ready.
>>>
>>> Details for next Iceberg UDF sync:
>>>
>>> *Monday, June 30 · 9:00 – 10:00am*Time zone: America/Los_Angeles
>>> Google Meet joining info
>>> Video call link: https://meet.google.com/aui-czix-nbh
>>>
>>> - Ajantha
>>>
>>> On Wed, Jun 4, 2025 at 9:00 PM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>>
>>>> Thanks to everyone who joined the sync.
>>>> Here is the meeting recording:
>>>> https://drive.google.com/file/d/1WItItsNs3m3-no7_qWPHftGqVNOdpw5C/view?usp=sharing
>>>>
>>>> Summary:
>>>>
>>>>    -
>>>>
>>>>    We discussed including Python support; the majority agreed *not to*
>>>>    (see recording for details).
>>>>    -
>>>>
>>>>    No strong opposition to versioning — it will be included to support
>>>>    change tracking and similar use cases.
>>>>    -
>>>>
>>>>    Suggestions were made to document how each catalog resolves UDFs,
>>>>    similar to views and tables.
>>>>    -
>>>>
>>>>    We agreed not to deviate from the existing table/view spec — e.g.,
>>>>    location will remain *required* for cross-catalog compatibility.
>>>>    -
>>>>
>>>>    We also discussed a bit about view interoperability as the same
>>>>    things are applicable here.
>>>>
>>>>    Feel free to review the proposal document
>>>>    
>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?pli=1&tab=t.0>
>>>>  here.
>>>>    With the current scope, it is similar to the view/table spec now.
>>>>    Final spec will be put to review and vote once it is ready.
>>>>
>>>> Details for next Iceberg UDF sync:
>>>>
>>>> *Monday, June 16 · 9:00 – 10:00am*Time zone: America/Los_Angeles
>>>> Google Meet joining info
>>>> Video call link: https://meet.google.com/aui-czix-nbh
>>>>
>>>> - Ajantha
>>>>
>>>> On Wed, May 21, 2025 at 3:33 AM Yufei Gu <flyrain...@gmail.com> wrote:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> We’ve set up a dedicated bi-weekly community sync for the UDF project.
>>>>> Everyone’s welcome to drop in and share ideas! Here is the meeting link:
>>>>>
>>>>> Iceberg UDF sync
>>>>> Monday, June 2 · 9:00 – 10:00am
>>>>> Time zone: America/Los_Angeles
>>>>> Google Meet joining info
>>>>> Video call link: https://meet.google.com/aui-czix-nbh
>>>>>
>>>>> Yufei
>>>>>
>>>>>
>>>>> On Fri, May 16, 2025 at 10:45 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Update on the progress.
>>>>>>
>>>>>> I had a meeting today with Yufei and Yun.zou to discuss the UDF
>>>>>> proposal. We covered several key points, though some are still open for
>>>>>> further discussion:
>>>>>>
>>>>>> a) *UDF Versioning*: Do we truly need versioning for UDFs at this
>>>>>> stage? We explored the possibility of simplifying the specification by
>>>>>> avoiding view replication, and potentially introducing versioning support
>>>>>> later. UDTFs, being a superset of views in some ways, may not require
>>>>>> versioning initially.
>>>>>>
>>>>>> b) *VarArgs Support*: While some query engines may not support
>>>>>> vararg syntax in CREATE FUNCTION, Iceberg UDFs could represent such
>>>>>> arguments as lists when supported by the engine.
>>>>>>
>>>>>> c) *Generics in UDFs*: Since Iceberg currently doesn’t support
>>>>>> generic types (e.g., object), we can only map engine-specific types
>>>>>> to Iceberg types. As a result, generic data types will not be supported 
>>>>>> in
>>>>>> the initial version.
>>>>>>
>>>>>> d) *Python Support*: Incorporating Python as a language for SQL UDFs
>>>>>> seems promising, especially given its potential to resolve 
>>>>>> interoperability
>>>>>> challenges. Some engines, however, require platform version and package
>>>>>> dependency details to execute Python code—this should be captured in the
>>>>>> specification.
>>>>>>
>>>>>> *Next Steps*
>>>>>> I will update the proposal document with two primary UDF use cases:
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Policy exchange between engines
>>>>>>    -
>>>>>>
>>>>>>    UDTF as a superset of view functionality
>>>>>>
>>>>>> The update will include corresponding syntax examples in both SQL and
>>>>>> Python, and detail how each use case is represented in Iceberg metadata.
>>>>>>
>>>>>> We also plan to set up regular syncs (open to more interested
>>>>>> participants) to continue refining and finalizing the UDF specification.
>>>>>> - Ajantha
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 12, 2025 at 9:16 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I've updated the design document[1] based on the previous comments.
>>>>>>> Additionally, I've included the SQL UDF syntax supported by various
>>>>>>> vendors, including Dremio, Snowflake, Databricks, and Trino.
>>>>>>>
>>>>>>> I'm happy to schedule a separate sync if a deeper discussion is
>>>>>>> needed. Let's keep moving forward, especially with the renewed interest
>>>>>>> from the community.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing
>>>>>>>
>>>>>>> On Thu, Feb 13, 2025 at 11:17 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>> During the last catalog community sync, there was significant
>>>>>>>> interest in storing UDFs in Iceberg and adding endpoints for UDF 
>>>>>>>> handling
>>>>>>>> in the REST catalog spec.
>>>>>>>>
>>>>>>>> I recently discussed this with Yufei to better understand the new
>>>>>>>> requirement of using UDFs for fine-grained access control policies. 
>>>>>>>> This
>>>>>>>> expands the use cases beyond just versioned and interoperable UDFs.
>>>>>>>> Additionally, I learnt that many vendors are interested in this 
>>>>>>>> feature.
>>>>>>>>
>>>>>>>> Given the strong community interest and support, I’d like to take
>>>>>>>> ownership of this effort and revive the work. I'll be revisiting the
>>>>>>>> document I proposed long back and will share an updated proposal by 
>>>>>>>> next
>>>>>>>> week.
>>>>>>>>
>>>>>>>> Looking forward to storing UDFs in Iceberg!
>>>>>>>> - Ajantha
>>>>>>>>
>>>>>>>> On Thu, Aug 8, 2024 at 2:55 PM Dmitri Bourlatchkov
>>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> The UDF spec does not require representations to be SQL. It merely
>>>>>>>>> does not specify (in this revision) how other representations are to 
>>>>>>>>> be
>>>>>>>>> written.
>>>>>>>>>
>>>>>>>>> This seems like an easy extension (adding a new type in the
>>>>>>>>> "Representations" section).
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Dmitri.
>>>>>>>>>
>>>>>>>>> On Thu, Aug 8, 2024 at 3:47 PM Ryan Blue
>>>>>>>>> <b...@databricks.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Right now, SQL is an explicit requirement of the spec. It leaves
>>>>>>>>>> a way for future versions to add different representations later, 
>>>>>>>>>> but only
>>>>>>>>>> SQL is supported. That was also the feedback to my initial 
>>>>>>>>>> skepticism about
>>>>>>>>>> how it would work to add functions.
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 8, 2024 at 12:44 PM Dmitri Bourlatchkov
>>>>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> I do not think the spec is meant to allow only SQL
>>>>>>>>>>> representations, although it is certainly faviouring SQL in 
>>>>>>>>>>> examples... It
>>>>>>>>>>> would be nice to add a non-SQL example, indeed.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Dmitri.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong <
>>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Coming from PyIceberg, I have concerns as this proposal focuses
>>>>>>>>>>>> on SQL-based engines, while Python-based systems often work with 
>>>>>>>>>>>> data
>>>>>>>>>>>> frames. Adding imperative languages like Python would make this 
>>>>>>>>>>>> proposal
>>>>>>>>>>>> more inclusive.
>>>>>>>>>>>>
>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>> Fokko
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen <
>>>>>>>>>>>> piotr.findei...@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Walaa, thanks for asking!
>>>>>>>>>>>>> In the design doc linked before  in this thread [1] i read
>>>>>>>>>>>>> "Without a common standard, the UDFs are hard to share among
>>>>>>>>>>>>> different engines."
>>>>>>>>>>>>> ("Background and Motivation" section).
>>>>>>>>>>>>> I agree with this statement. I don't fully understand yet how
>>>>>>>>>>>>> the proposed design addresses shareability between the engines 
>>>>>>>>>>>>> though.
>>>>>>>>>>>>> I would use some help to understand this better.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best
>>>>>>>>>>>>> Piotr
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] SQL User-Defined Function Spec
>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa <
>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Piotr, what do you mean by making user-created functions
>>>>>>>>>>>>>> shareable
>>>>>>>>>>>>>> between engines? Do you mean UDFs written in imperative code?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen
>>>>>>>>>>>>>> <piotr.findei...@gmail.com> wrote:
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Hi,
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Thank you Ajantha for creating this thread. The Iceberg
>>>>>>>>>>>>>> UDFs are an interesting idea!
>>>>>>>>>>>>>> > Is there a plan to make the user-created functions sharable
>>>>>>>>>>>>>> between the engines?
>>>>>>>>>>>>>> > If so, how would a CREATE FUNCTION statement look like in
>>>>>>>>>>>>>> e..g Spark or Trino?
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Meanwhile, added a few comments in the doc.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Best
>>>>>>>>>>>>>> > Piotr
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue
>>>>>>>>>>>>>> <b...@databricks.com.invalid> wrote:
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> I just looked through the proposal and added comments. I
>>>>>>>>>>>>>> think it would be helpful to also have a design doc that covers 
>>>>>>>>>>>>>> the choices
>>>>>>>>>>>>>> from the draft spec. For instance, the choice to enumerate all 
>>>>>>>>>>>>>> possible
>>>>>>>>>>>>>> function input struts rather than allowing generics and varargs.
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Here’s a quick summary of my feedback:
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> I think that the choice to enumerate function signatures
>>>>>>>>>>>>>> is limiting. It would be nice to see a discussion of the 
>>>>>>>>>>>>>> trade-offs and a
>>>>>>>>>>>>>> rationale for the choice. I think it would also be very helpful 
>>>>>>>>>>>>>> to have a
>>>>>>>>>>>>>> few representative use cases for this included in the doc. That 
>>>>>>>>>>>>>> way the
>>>>>>>>>>>>>> proposal can demonstrate that it solves those use cases with 
>>>>>>>>>>>>>> reasonable
>>>>>>>>>>>>>> trade-offs.
>>>>>>>>>>>>>> >> There are a few instances where this is inconsistent with
>>>>>>>>>>>>>> conventions in other specs. For example, using string IDs rather 
>>>>>>>>>>>>>> than an
>>>>>>>>>>>>>> integer.
>>>>>>>>>>>>>> >> This uses a very different model for spec versioning than
>>>>>>>>>>>>>> the Iceberg view and table specs. It requires readers to fail if 
>>>>>>>>>>>>>> there are
>>>>>>>>>>>>>> any unknown fields, which prevents the spec from adding things 
>>>>>>>>>>>>>> that are
>>>>>>>>>>>>>> fully backward-compatible. Other Iceberg specs only require a 
>>>>>>>>>>>>>> version
>>>>>>>>>>>>>> change to introduce forward-incompatible changes and I think 
>>>>>>>>>>>>>> that this
>>>>>>>>>>>>>> should do the same to avoid confusion.
>>>>>>>>>>>>>> >> It looks like the intent is to allow multiple function
>>>>>>>>>>>>>> signatures per verison, but it is unclear how to encode them 
>>>>>>>>>>>>>> because a
>>>>>>>>>>>>>> version is associated with a single function signature.
>>>>>>>>>>>>>> >> There is no review of SQL syntax for creating functions
>>>>>>>>>>>>>> across engines, so this doesn’t show that the metadata proposed 
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> sufficient for cross-engine use cases.
>>>>>>>>>>>>>> >> The example for a table-valued function shows a SELECT
>>>>>>>>>>>>>> statement and it isn’t clear how this is distinct from a view
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <
>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> Thanks Walaa and Robert for the review on this.
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> We didn't find any blocker for the spec.
>>>>>>>>>>>>>> >>> I will wait for a week and If no more review comments, I
>>>>>>>>>>>>>> will raise a PR for spec addition next week.
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> If anyone else is interested, please have a look at the
>>>>>>>>>>>>>> proposal
>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> - Ajantha
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Hi Ajantha,
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> I have left some comments. It is an interesting
>>>>>>>>>>>>>> direction, but there might be some details that need to be fine 
>>>>>>>>>>>>>> tuned.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> The doc is here [1] for others who might be interested.
>>>>>>>>>>>>>> Resharing since I do not think it was directly linked in the 
>>>>>>>>>>>>>> thread.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> [1]
>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Thanks,
>>>>>>>>>>>>>> >>>> Walaa.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <
>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> Hi, just another reminder since we didn't get any
>>>>>>>>>>>>>> review on the proposal.
>>>>>>>>>>>>>> >>>>> Initially proposed on June 4.
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> - Ajantha
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <
>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> Hi everyone,
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> We've only received one review so far (from Benny).
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> We would appreciate more eyes on this.
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> - Ajantha
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <
>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>> >>>>>>> Hi All,
>>>>>>>>>>>>>> >>>>>>> Please find the proposal link
>>>>>>>>>>>>>> >>>>>>> https://github.com/apache/iceberg/issues/10432
>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>> >>>>>>> Google doc link is attached in the proposal.
>>>>>>>>>>>>>> >>>>>>> And Thanks Stephen Lin for working on it.
>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>> >>>>>>> Hope it gives more clarity to take the decisions and
>>>>>>>>>>>>>> how we want to implement it.
>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>> >>>>>>> - Ajantha
>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <
>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table
>>>>>>>>>>>>>> user defined functions. Here are some examples of what I meant 
>>>>>>>>>>>>>> in (2):
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>> Hive GenericUDF:
>>>>>>>>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>>>>>>>>>>>>>> >>>>>>>> Trino user defined functions:
>>>>>>>>>>>>>> https://trino.io/docs/current/develop/functions.html
>>>>>>>>>>>>>> >>>>>>>> Flink user defined functions:
>>>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>> Probably what you referred to is a variation of (1)
>>>>>>>>>>>>>> where the API is data flow/data pipeline API instead of SQL 
>>>>>>>>>>>>>> (e.g., Spark
>>>>>>>>>>>>>> Scala). Yes, that is also possible in the very long run :)
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>> Thanks,
>>>>>>>>>>>>>> >>>>>>>> Walaa.
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <
>>>>>>>>>>>>>> yezhao...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>> > (2) Custom code written in imperative function
>>>>>>>>>>>>>> according to a Java/Scala/Python API, etc.
>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>> I think we could still explore some long term
>>>>>>>>>>>>>> opportunities in this case. Consider you register a Spark temp 
>>>>>>>>>>>>>> view as some
>>>>>>>>>>>>>> sort of data frame read, then it could still be resolved to a 
>>>>>>>>>>>>>> Spark plan
>>>>>>>>>>>>>> that is representable by an intermediate representation. But I 
>>>>>>>>>>>>>> agree this
>>>>>>>>>>>>>> gets very complicated very soon, and just having the case (1) 
>>>>>>>>>>>>>> covered would
>>>>>>>>>>>>>> already be a huge step forward.
>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>> -Jack
>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <
>>>>>>>>>>>>>> btc...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>> It's interesting to note that a tabular SQL UDF
>>>>>>>>>>>>>> can be used to build a parameterized view.  So, there's 
>>>>>>>>>>>>>> definitely a lot in
>>>>>>>>>>>>>> common between UDFs and views.
>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>> Thanks
>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin
>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>> I think there is a disconnect about what is
>>>>>>>>>>>>>> perceived as a "UDF". There are 2 flavors:
>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>> (1) Functions that are defined by the user whose
>>>>>>>>>>>>>> definition is a composition of other built-in functions/SQL 
>>>>>>>>>>>>>> expressions.
>>>>>>>>>>>>>> >>>>>>>>>>> (2) Custom code written in imperative function
>>>>>>>>>>>>>> according to a Java/Scala/Python API, etc.
>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>> All the examples in Ajantha's references are
>>>>>>>>>>>>>> pretty much from (1) and I think those have more analogy to 
>>>>>>>>>>>>>> views due to
>>>>>>>>>>>>>> their SQL nature. Agree (2) is not practical to maintain by 
>>>>>>>>>>>>>> Iceberg, but I
>>>>>>>>>>>>>> think Ajantha's use cases are around (1), and may be worth 
>>>>>>>>>>>>>> evaluating.
>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> >>>>>>>>>>> Walaa.
>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <
>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you post the
>>>>>>>>>>>>>> proposal, but I think this would be a very difficult area to 
>>>>>>>>>>>>>> tackle across
>>>>>>>>>>>>>> engines, languages, and memory models without having a huge 
>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>> penalty.
>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL
>>>>>>>>>>>>>> representations of UDFs (similar to views as shared by the 
>>>>>>>>>>>>>> reference links
>>>>>>>>>>>>>> above), the complexity involved will be similar to managing 
>>>>>>>>>>>>>> views.
>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>>>>>>>>>>>>> >>>>>>>>>>>> We will work on publishing the draft spec
>>>>>>>>>>>>>> (inspired by the view spec) this week to facilitate further 
>>>>>>>>>>>>>> discussions.
>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <
>>>>>>>>>>>>>> yezhao...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>> > While it would be great to have a common set
>>>>>>>>>>>>>> of functions across engines, I don't see how that is practical 
>>>>>>>>>>>>>> when those
>>>>>>>>>>>>>> engines are implemented so differently. Plugging in code -- and 
>>>>>>>>>>>>>> especially
>>>>>>>>>>>>>> custom user-supplied code -- seems inherently specialized to me 
>>>>>>>>>>>>>> and should
>>>>>>>>>>>>>> be part of the engines' design.
>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>> How is this different from the views? I feel we
>>>>>>>>>>>>>> can say exactly the same thing for Iceberg views, but yet we 
>>>>>>>>>>>>>> have Iceberg
>>>>>>>>>>>>>> multi-dialect views implemented. Maybe it sounds like we are 
>>>>>>>>>>>>>> trying to draw
>>>>>>>>>>>>>> a line between SQL vs other programming language as "code"? but 
>>>>>>>>>>>>>> I think SQL
>>>>>>>>>>>>>> is just another type of code, and we are already talking about 
>>>>>>>>>>>>>> compiling
>>>>>>>>>>>>>> all these different code dialects to an intermediate 
>>>>>>>>>>>>>> representation (using
>>>>>>>>>>>>>> projects like Coral, Substrait), which will be stored as another 
>>>>>>>>>>>>>> type of
>>>>>>>>>>>>>> representation of Iceberg view. I think the same functionality 
>>>>>>>>>>>>>> can be used
>>>>>>>>>>>>>> for UDFs if developed.
>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>> I actually hink adding UDF support is a good
>>>>>>>>>>>>>> idea, even just a multi-dialect one like view, and that can 
>>>>>>>>>>>>>> allow engines
>>>>>>>>>>>>>> to for example parse a view SQL, and when a function referenced 
>>>>>>>>>>>>>> cannot be
>>>>>>>>>>>>>> resolved, try to seek for a multi-dialect UDF definition.
>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we can discuss more when we have the
>>>>>>>>>>>>>> actual proposal published.
>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> >>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <
>>>>>>>>>>>>>> sn...@snazy.de> wrote:
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and portable and
>>>>>>>>>>>>>> "non-centralized" as views are. The same performance concerns 
>>>>>>>>>>>>>> apply to
>>>>>>>>>>>>>> views as well.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> Iceberg should define a common base upon which
>>>>>>>>>>>>>> engines can build, so the argument that UDFs aren't practical, 
>>>>>>>>>>>>>> because
>>>>>>>>>>>>>> engines are different, is probably only a temporary concern.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg should also try to
>>>>>>>>>>>>>> tackle the idea to make views portable, which is conceptually 
>>>>>>>>>>>>>> not that much
>>>>>>>>>>>>>> different from portable UDFs.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch
>>>>>>>>>>>>>> to the idea of having UDFs in Iceberg, especially not in this 
>>>>>>>>>>>>>> early stage.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea
>>>>>>>>>>>>>> to add UDFs tracked by Iceberg catalogs. I think that Iceberg 
>>>>>>>>>>>>>> primarily
>>>>>>>>>>>>>> deals with things that are centralized, like tables of data. 
>>>>>>>>>>>>>> While it would
>>>>>>>>>>>>>> be great to have a common set of functions across engines, I 
>>>>>>>>>>>>>> don't see how
>>>>>>>>>>>>>> that is practical when those engines are implemented so 
>>>>>>>>>>>>>> differently.
>>>>>>>>>>>>>> Plugging in code -- and especially custom user-supplied code -- 
>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>> inherently specialized to me and should be part of the engines' 
>>>>>>>>>>>>>> design.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> I guess we'll know more when you post the
>>>>>>>>>>>>>> proposal, but I think this would be a very difficult area to 
>>>>>>>>>>>>>> tackle across
>>>>>>>>>>>>>> engines, languages, and memory models without having a huge 
>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>> penalty.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <
>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge the community
>>>>>>>>>>>>>> interest in storing the Versioned SQL UDFs in Iceberg.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We want to propose the spec addition for
>>>>>>>>>>>>>> storing the versioned UDFs in Iceberg (inspired by view spec).
>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> These UDFs can operate similarly to views in
>>>>>>>>>>>>>> that they are associated with tables, but they can accept 
>>>>>>>>>>>>>> arguments and
>>>>>>>>>>>>>> produce return values, or even function as inline expressions.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino,
>>>>>>>>>>>>>> Snowflake, Databricks Spark supports SQL UDFs at catalog level 
>>>>>>>>>>>>>> [1].
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But storing them in Iceberg can enable
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Versioning of these UDFs.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Interoperability between the engines.
>>>>>>>>>>>>>> Potentially engines can understand the UDFs written by other 
>>>>>>>>>>>>>> engines (with
>>>>>>>>>>>>>> the translate layer).
>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that integrating this feature into
>>>>>>>>>>>>>> Iceberg would be a valuable addition, and we're eager to 
>>>>>>>>>>>>>> collaborate with
>>>>>>>>>>>>>> the community to develop a UDF specification.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stephen has already begun drafting a
>>>>>>>>>>>>>> specification to propose to the community.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on this.
>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dremio -
>>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Trino -
>>>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Snowflake -
>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Databricks -
>>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> --
>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> >>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>>>> --
>>>>>>>>>>>>>> >>>>>>>>>>>>>> Robert Stupp
>>>>>>>>>>>>>> >>>>>>>>>>>>>> @snazy
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> --
>>>>>>>>>>>>>> >> Ryan Blue
>>>>>>>>>>>>>> >> Databricks
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Databricks
>>>>>>>>>>
>>>>>>>>>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to