Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Yufei Gu Mon, 30 Jun 2025 08:51:43 -0700

Yes, hiding the definition and disabling pushdown are required.We will need
a named key(e.g., secure) somewhere, no matter if it is a top level
property or a key as a part of the UDF properties. So that both UDF creator
and consumer can recognize it.


Yufei


On Thu, Jun 26, 2025 at 4:27 PM Ryan Blue <rdb...@gmail.com> wrote:

> Thanks for the extra detail. What do you think the spec would require?
> Would it require hiding the UDF definition from users and require specific
> pushdown cases be disabled? The use cases seem valid, but I'm trying to
> understand the requirements this places on engines and why it needs to be
> part of the spec, rather than part of the properties of the UDF.
>
> On Fri, Jun 20, 2025 at 3:56 PM Yufei Gu <flyrain...@gmail.com> wrote:
>
>> Hi Ryan,
>>
>> Here are the main use cases for secure UDFs:
>>
>>    1.
>>
>>    Hiding UDF Definitions: This includes concealing the UDF body and
>>    details like the list of imports, some of them aren’t applicable to SQL
>>    UDFs.
>>    2.
>>
>>    Sandboxed Execution: Ensuring the UDF runs in an isolated
>>    environment. Again, this typically doesn’t apply to SQL UDFs.
>>    3.
>>
>>    Preventing Data Leakage at Execution Time: For example, secure UDFs
>>    may disable certain optimizations—such as predicate pushdown—to avoid
>>    exposing sensitive data indirectly. [1]
>>
>> Given these scenarios, I agree with your point that the secure flag is
>> primarily an instruction to the engine to behave differently. While it's
>> largely an engine-side behavior, we still need to include this flag in the
>> UDF definition to indicate whether a UDF is secure, especially considering
>> the perf penalty introduced by scenario #3. We should clearly recommend
>> that users avoid marking UDFs as secure unless it's truly necessary.
>>
>> [1]
>> https://docs.snowflake.com/en/developer-guide/pushdown-optimization#example-of-indirect-data-exposure-through-pushdown
>> Yufei
>>
>>
>> On Wed, Jun 18, 2025 at 12:32 PM Ryan Blue <rdb...@gmail.com> wrote:
>>
>>> Yufei, could you make the argument for supporting a "secure" UDF? What
>>> use case are you addressing and what specifically changes about how the UDF
>>> is handled? If the idea is to hide the UDF definition, do we need to
>>> include it?
>>>
>>> I think this would be a signal to a "trusted engine". When the engine
>>> interacts with the catalog it sends authorization information about itself
>>> in addition to the user that it is acting on behalf of. That way the
>>> catalog knows that the secure UDF can be sent to the engine and won't be
>>> shown to the user. The majority of this logic is on the REST server side,
>>> and the only part that is communicated to the client is the request not to
>>> show the UDF to the user, right? In that case should this be a property
>>> rather than part of the definition? Even if we state that the client "must"
>>> suppress the UDF definition, it's really just a request. Only trusted
>>> engines can be passed the UDF definition, so a spec requirement to suppress
>>> the definition isn't very meaningful.
>>>
>>> On Mon, Jun 16, 2025 at 5:42 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>>
>>>> Thanks for the summary, Ajantha!
>>>>
>>>> Multi-statement UDFs are definitely useful, but whether those
>>>> statements run within a single transaction should be treated as an
>>>> engine-level concern. The Iceberg UDF spec can spell out the expectation,
>>>> yet the actual guarantee still depends on the runtime. Even if a UDF
>>>> declares itself transactional, the engine may or may not enforce it.
>>>>
>>>> One more thing: should we also introduce a “secure UDF” option
>>>> supported by some engines[1], so the body and any sensitive details stay
>>>> hidden from callers?
>>>>
>>>> [1] https://docs.snowflake.com/en/developer-guide/secure-udf-procedure
>>>>
>>>> Yufei
>>>>
>>>>
>>>> On Mon, Jun 16, 2025 at 12:02 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks to everyone who joined the sync.
>>>>> Here is the meeting recording:
>>>>> https://drive.google.com/file/d/10_Getaasv6tDMGzeZQUgcUVwCUAaFxiz/view?usp=sharing
>>>>> Summary:
>>>>>
>>>>>    - We have gone through the SQL UDF syntax supported by different
>>>>>    engines (Snowflake, databricks, Dremio, Trino, OSS spark 4.0).
>>>>>    - Each engine uses its own block separator, like $$ or '' or none.
>>>>>    Action item was to check whether engines support multi-statement
>>>>>    (transactional) UDF bodies.
>>>>>    - Discussed about function overloading. Need to check whether
>>>>>    these engines support function overloading for SQL UDFs. Postgres 
>>>>> supports
>>>>>    it! If yes, need to adopt the spec to handle it.
>>>>>    - Started online spec review and discussed the deterministic flag
>>>>>    and concluded that we keep the independent fields (like deterministic) 
>>>>> in
>>>>>    spec only if the majority of engines supports it. Else it will be 
>>>>> passed in
>>>>>    a property bag (engine specific). And it is the engine's 
>>>>> responsibility to
>>>>>    honor those optional properties.
>>>>>
>>>>> Feel free to review the current proposal document here
>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing>.
>>>>>
>>>>> Final spec will be put to review and vote once it is ready.
>>>>>
>>>>> Details for next Iceberg UDF sync:
>>>>>
>>>>> *Monday, June 30 · 9:00 – 10:00am*Time zone: America/Los_Angeles
>>>>> Google Meet joining info
>>>>> Video call link: https://meet.google.com/aui-czix-nbh
>>>>>
>>>>> - Ajantha
>>>>>
>>>>> On Wed, Jun 4, 2025 at 9:00 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks to everyone who joined the sync.
>>>>>> Here is the meeting recording:
>>>>>> https://drive.google.com/file/d/1WItItsNs3m3-no7_qWPHftGqVNOdpw5C/view?usp=sharing
>>>>>>
>>>>>> Summary:
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    We discussed including Python support; the majority agreed *not
>>>>>>    to* (see recording for details).
>>>>>>    -
>>>>>>
>>>>>>    No strong opposition to versioning — it will be included to
>>>>>>    support change tracking and similar use cases.
>>>>>>    -
>>>>>>
>>>>>>    Suggestions were made to document how each catalog resolves UDFs,
>>>>>>    similar to views and tables.
>>>>>>    -
>>>>>>
>>>>>>    We agreed not to deviate from the existing table/view spec —
>>>>>>    e.g., location will remain *required* for cross-catalog
>>>>>>    compatibility.
>>>>>>    -
>>>>>>
>>>>>>    We also discussed a bit about view interoperability as the same
>>>>>>    things are applicable here.
>>>>>>
>>>>>>    Feel free to review the proposal document
>>>>>>    
>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?pli=1&tab=t.0>
>>>>>>  here.
>>>>>>    With the current scope, it is similar to the view/table spec now.
>>>>>>    Final spec will be put to review and vote once it is ready.
>>>>>>
>>>>>> Details for next Iceberg UDF sync:
>>>>>>
>>>>>> *Monday, June 16 · 9:00 – 10:00am*Time zone: America/Los_Angeles
>>>>>> Google Meet joining info
>>>>>> Video call link: https://meet.google.com/aui-czix-nbh
>>>>>>
>>>>>> - Ajantha
>>>>>>
>>>>>> On Wed, May 21, 2025 at 3:33 AM Yufei Gu <flyrain...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> We’ve set up a dedicated bi-weekly community sync for the UDF
>>>>>>> project. Everyone’s welcome to drop in and share ideas! Here is the 
>>>>>>> meeting
>>>>>>> link:
>>>>>>>
>>>>>>> Iceberg UDF sync
>>>>>>> Monday, June 2 · 9:00 – 10:00am
>>>>>>> Time zone: America/Los_Angeles
>>>>>>> Google Meet joining info
>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh
>>>>>>>
>>>>>>> Yufei
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 16, 2025 at 10:45 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Update on the progress.
>>>>>>>>
>>>>>>>> I had a meeting today with Yufei and Yun.zou to discuss the UDF
>>>>>>>> proposal. We covered several key points, though some are still open for
>>>>>>>> further discussion:
>>>>>>>>
>>>>>>>> a) *UDF Versioning*: Do we truly need versioning for UDFs at this
>>>>>>>> stage? We explored the possibility of simplifying the specification by
>>>>>>>> avoiding view replication, and potentially introducing versioning 
>>>>>>>> support
>>>>>>>> later. UDTFs, being a superset of views in some ways, may not require
>>>>>>>> versioning initially.
>>>>>>>>
>>>>>>>> b) *VarArgs Support*: While some query engines may not support
>>>>>>>> vararg syntax in CREATE FUNCTION, Iceberg UDFs could represent
>>>>>>>> such arguments as lists when supported by the engine.
>>>>>>>>
>>>>>>>> c) *Generics in UDFs*: Since Iceberg currently doesn’t support
>>>>>>>> generic types (e.g., object), we can only map engine-specific
>>>>>>>> types to Iceberg types. As a result, generic data types will not be
>>>>>>>> supported in the initial version.
>>>>>>>>
>>>>>>>> d) *Python Support*: Incorporating Python as a language for SQL
>>>>>>>> UDFs seems promising, especially given its potential to resolve
>>>>>>>> interoperability challenges. Some engines, however, require platform
>>>>>>>> version and package dependency details to execute Python code—this 
>>>>>>>> should
>>>>>>>> be captured in the specification.
>>>>>>>>
>>>>>>>> *Next Steps*
>>>>>>>> I will update the proposal document with two primary UDF use cases:
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Policy exchange between engines
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    UDTF as a superset of view functionality
>>>>>>>>
>>>>>>>> The update will include corresponding syntax examples in both SQL
>>>>>>>> and Python, and detail how each use case is represented in Iceberg 
>>>>>>>> metadata.
>>>>>>>>
>>>>>>>> We also plan to set up regular syncs (open to more interested
>>>>>>>> participants) to continue refining and finalizing the UDF 
>>>>>>>> specification.
>>>>>>>> - Ajantha
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 12, 2025 at 9:16 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I've updated the design document[1] based on the previous
>>>>>>>>> comments. Additionally, I've included the SQL UDF syntax supported by
>>>>>>>>> various vendors, including Dremio, Snowflake, Databricks, and Trino.
>>>>>>>>>
>>>>>>>>> I'm happy to schedule a separate sync if a deeper discussion is
>>>>>>>>> needed. Let's keep moving forward, especially with the renewed 
>>>>>>>>> interest
>>>>>>>>> from the community.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing
>>>>>>>>>
>>>>>>>>> On Thu, Feb 13, 2025 at 11:17 PM Ajantha Bhat <
>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hey everyone,
>>>>>>>>>>
>>>>>>>>>> During the last catalog community sync, there was significant
>>>>>>>>>> interest in storing UDFs in Iceberg and adding endpoints for UDF 
>>>>>>>>>> handling
>>>>>>>>>> in the REST catalog spec.
>>>>>>>>>>
>>>>>>>>>> I recently discussed this with Yufei to better understand the new
>>>>>>>>>> requirement of using UDFs for fine-grained access control policies. 
>>>>>>>>>> This
>>>>>>>>>> expands the use cases beyond just versioned and interoperable UDFs.
>>>>>>>>>> Additionally, I learnt that many vendors are interested in this 
>>>>>>>>>> feature.
>>>>>>>>>>
>>>>>>>>>> Given the strong community interest and support, I’d like to take
>>>>>>>>>> ownership of this effort and revive the work. I'll be revisiting the
>>>>>>>>>> document I proposed long back and will share an updated proposal by 
>>>>>>>>>> next
>>>>>>>>>> week.
>>>>>>>>>>
>>>>>>>>>> Looking forward to storing UDFs in Iceberg!
>>>>>>>>>> - Ajantha
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 8, 2024 at 2:55 PM Dmitri Bourlatchkov
>>>>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> The UDF spec does not require representations to be SQL. It
>>>>>>>>>>> merely does not specify (in this revision) how other 
>>>>>>>>>>> representations are to
>>>>>>>>>>> be written.
>>>>>>>>>>>
>>>>>>>>>>> This seems like an easy extension (adding a new type in the
>>>>>>>>>>> "Representations" section).
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Dmitri.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 8, 2024 at 3:47 PM Ryan Blue
>>>>>>>>>>> <b...@databricks.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Right now, SQL is an explicit requirement of the spec. It
>>>>>>>>>>>> leaves a way for future versions to add different representations 
>>>>>>>>>>>> later,
>>>>>>>>>>>> but only SQL is supported. That was also the feedback to my initial
>>>>>>>>>>>> skepticism about how it would work to add functions.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 8, 2024 at 12:44 PM Dmitri Bourlatchkov
>>>>>>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I do not think the spec is meant to allow only SQL
>>>>>>>>>>>>> representations, although it is certainly faviouring SQL in 
>>>>>>>>>>>>> examples... It
>>>>>>>>>>>>> would be nice to add a non-SQL example, indeed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Dmitri.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong <
>>>>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Coming from PyIceberg, I have concerns as this proposal
>>>>>>>>>>>>>> focuses on SQL-based engines, while Python-based systems often 
>>>>>>>>>>>>>> work with
>>>>>>>>>>>>>> data frames. Adding imperative languages like Python would make 
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> proposal more inclusive.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen <
>>>>>>>>>>>>>> piotr.findei...@gmail.com>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Walaa, thanks for asking!
>>>>>>>>>>>>>>> In the design doc linked before  in this thread [1] i read
>>>>>>>>>>>>>>> "Without a common standard, the UDFs are hard to share among
>>>>>>>>>>>>>>> different engines."
>>>>>>>>>>>>>>> ("Background and Motivation" section).
>>>>>>>>>>>>>>> I agree with this statement. I don't fully understand yet
>>>>>>>>>>>>>>> how the proposed design addresses shareability between the 
>>>>>>>>>>>>>>> engines though.
>>>>>>>>>>>>>>> I would use some help to understand this better.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>>> Piotr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1] SQL User-Defined Function Spec
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa <
>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Piotr, what do you mean by making user-created functions
>>>>>>>>>>>>>>>> shareable
>>>>>>>>>>>>>>>> between engines? Do you mean UDFs written in imperative
>>>>>>>>>>>>>>>> code?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen
>>>>>>>>>>>>>>>> <piotr.findei...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Hi,
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Thank you Ajantha for creating this thread. The Iceberg
>>>>>>>>>>>>>>>> UDFs are an interesting idea!
>>>>>>>>>>>>>>>> > Is there a plan to make the user-created functions
>>>>>>>>>>>>>>>> sharable between the engines?
>>>>>>>>>>>>>>>> > If so, how would a CREATE FUNCTION statement look like in
>>>>>>>>>>>>>>>> e..g Spark or Trino?
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Meanwhile, added a few comments in the doc.
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Best
>>>>>>>>>>>>>>>> > Piotr
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue
>>>>>>>>>>>>>>>> <b...@databricks.com.invalid> wrote:
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> I just looked through the proposal and added comments. I
>>>>>>>>>>>>>>>> think it would be helpful to also have a design doc that 
>>>>>>>>>>>>>>>> covers the choices
>>>>>>>>>>>>>>>> from the draft spec. For instance, the choice to enumerate all 
>>>>>>>>>>>>>>>> possible
>>>>>>>>>>>>>>>> function input struts rather than allowing generics and 
>>>>>>>>>>>>>>>> varargs.
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> Here’s a quick summary of my feedback:
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> I think that the choice to enumerate function signatures
>>>>>>>>>>>>>>>> is limiting. It would be nice to see a discussion of the 
>>>>>>>>>>>>>>>> trade-offs and a
>>>>>>>>>>>>>>>> rationale for the choice. I think it would also be very 
>>>>>>>>>>>>>>>> helpful to have a
>>>>>>>>>>>>>>>> few representative use cases for this included in the doc. 
>>>>>>>>>>>>>>>> That way the
>>>>>>>>>>>>>>>> proposal can demonstrate that it solves those use cases with 
>>>>>>>>>>>>>>>> reasonable
>>>>>>>>>>>>>>>> trade-offs.
>>>>>>>>>>>>>>>> >> There are a few instances where this is inconsistent
>>>>>>>>>>>>>>>> with conventions in other specs. For example, using string IDs 
>>>>>>>>>>>>>>>> rather than
>>>>>>>>>>>>>>>> an integer.
>>>>>>>>>>>>>>>> >> This uses a very different model for spec versioning
>>>>>>>>>>>>>>>> than the Iceberg view and table specs. It requires readers to 
>>>>>>>>>>>>>>>> fail if there
>>>>>>>>>>>>>>>> are any unknown fields, which prevents the spec from adding 
>>>>>>>>>>>>>>>> things that are
>>>>>>>>>>>>>>>> fully backward-compatible. Other Iceberg specs only require a 
>>>>>>>>>>>>>>>> version
>>>>>>>>>>>>>>>> change to introduce forward-incompatible changes and I think 
>>>>>>>>>>>>>>>> that this
>>>>>>>>>>>>>>>> should do the same to avoid confusion.
>>>>>>>>>>>>>>>> >> It looks like the intent is to allow multiple function
>>>>>>>>>>>>>>>> signatures per verison, but it is unclear how to encode them 
>>>>>>>>>>>>>>>> because a
>>>>>>>>>>>>>>>> version is associated with a single function signature.
>>>>>>>>>>>>>>>> >> There is no review of SQL syntax for creating functions
>>>>>>>>>>>>>>>> across engines, so this doesn’t show that the metadata 
>>>>>>>>>>>>>>>> proposed is
>>>>>>>>>>>>>>>> sufficient for cross-engine use cases.
>>>>>>>>>>>>>>>> >> The example for a table-valued function shows a SELECT
>>>>>>>>>>>>>>>> statement and it isn’t clear how this is distinct from a view
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <
>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>> >>> Thanks Walaa and Robert for the review on this.
>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>> >>> We didn't find any blocker for the spec.
>>>>>>>>>>>>>>>> >>> I will wait for a week and If no more review comments,
>>>>>>>>>>>>>>>> I will raise a PR for spec addition next week.
>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>> >>> If anyone else is interested, please have a look at the
>>>>>>>>>>>>>>>> proposal
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>> >>> - Ajantha
>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>> >>>> Hi Ajantha,
>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>> >>>> I have left some comments. It is an interesting
>>>>>>>>>>>>>>>> direction, but there might be some details that need to be 
>>>>>>>>>>>>>>>> fine tuned.
>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>> >>>> The doc is here [1] for others who might be
>>>>>>>>>>>>>>>> interested. Resharing since I do not think it was directly 
>>>>>>>>>>>>>>>> linked in the
>>>>>>>>>>>>>>>> thread.
>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>> >>>> [1]
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>> >>>> Thanks,
>>>>>>>>>>>>>>>> >>>> Walaa.
>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <
>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>>>> >>>>> Hi, just another reminder since we didn't get any
>>>>>>>>>>>>>>>> review on the proposal.
>>>>>>>>>>>>>>>> >>>>> Initially proposed on June 4.
>>>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>>>> >>>>> - Ajantha
>>>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <
>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>> >>>>>> Hi everyone,
>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>> >>>>>> We've only received one review so far (from Benny).
>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>> >>>>>> We would appreciate more eyes on this.
>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>> >>>>>> - Ajantha
>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <
>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>>>> >>>>>>> Hi All,
>>>>>>>>>>>>>>>> >>>>>>> Please find the proposal link
>>>>>>>>>>>>>>>> >>>>>>> https://github.com/apache/iceberg/issues/10432
>>>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>>>> >>>>>>> Google doc link is attached in the proposal.
>>>>>>>>>>>>>>>> >>>>>>> And Thanks Stephen Lin for working on it.
>>>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>>>> >>>>>>> Hope it gives more clarity to take the decisions
>>>>>>>>>>>>>>>> and how we want to implement it.
>>>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>>>> >>>>>>> - Ajantha
>>>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin
>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>> Thanks Jack. I actually meant
>>>>>>>>>>>>>>>> scalar/aggregate/table user defined functions. Here are some 
>>>>>>>>>>>>>>>> examples of
>>>>>>>>>>>>>>>> what I meant in (2):
>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>> Hive GenericUDF:
>>>>>>>>>>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>>>>>>>>>>>>>>>> >>>>>>>> Trino user defined functions:
>>>>>>>>>>>>>>>> https://trino.io/docs/current/develop/functions.html
>>>>>>>>>>>>>>>> >>>>>>>> Flink user defined functions:
>>>>>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>> Probably what you referred to is a variation of
>>>>>>>>>>>>>>>> (1) where the API is data flow/data pipeline API instead of 
>>>>>>>>>>>>>>>> SQL (e.g.,
>>>>>>>>>>>>>>>> Spark Scala). Yes, that is also possible in the very long run 
>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>> Thanks,
>>>>>>>>>>>>>>>> >>>>>>>> Walaa.
>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <
>>>>>>>>>>>>>>>> yezhao...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>> > (2) Custom code written in imperative function
>>>>>>>>>>>>>>>> according to a Java/Scala/Python API, etc.
>>>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>> I think we could still explore some long term
>>>>>>>>>>>>>>>> opportunities in this case. Consider you register a Spark temp 
>>>>>>>>>>>>>>>> view as some
>>>>>>>>>>>>>>>> sort of data frame read, then it could still be resolved to a 
>>>>>>>>>>>>>>>> Spark plan
>>>>>>>>>>>>>>>> that is representable by an intermediate representation. But I 
>>>>>>>>>>>>>>>> agree this
>>>>>>>>>>>>>>>> gets very complicated very soon, and just having the case (1) 
>>>>>>>>>>>>>>>> covered would
>>>>>>>>>>>>>>>> already be a huge step forward.
>>>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>> -Jack
>>>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <
>>>>>>>>>>>>>>>> btc...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>> It's interesting to note that a tabular SQL UDF
>>>>>>>>>>>>>>>> can be used to build a parameterized view.  So, there's 
>>>>>>>>>>>>>>>> definitely a lot in
>>>>>>>>>>>>>>>> common between UDFs and views.
>>>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin
>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>> I think there is a disconnect about what is
>>>>>>>>>>>>>>>> perceived as a "UDF". There are 2 flavors:
>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>> (1) Functions that are defined by the user
>>>>>>>>>>>>>>>> whose definition is a composition of other built-in 
>>>>>>>>>>>>>>>> functions/SQL
>>>>>>>>>>>>>>>> expressions.
>>>>>>>>>>>>>>>> >>>>>>>>>>> (2) Custom code written in imperative function
>>>>>>>>>>>>>>>> according to a Java/Scala/Python API, etc.
>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>> All the examples in Ajantha's references are
>>>>>>>>>>>>>>>> pretty much from (1) and I think those have more analogy to 
>>>>>>>>>>>>>>>> views due to
>>>>>>>>>>>>>>>> their SQL nature. Agree (2) is not practical to maintain by 
>>>>>>>>>>>>>>>> Iceberg, but I
>>>>>>>>>>>>>>>> think Ajantha's use cases are around (1), and may be worth 
>>>>>>>>>>>>>>>> evaluating.
>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> >>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <
>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you post the
>>>>>>>>>>>>>>>> proposal, but I think this would be a very difficult area to 
>>>>>>>>>>>>>>>> tackle across
>>>>>>>>>>>>>>>> engines, languages, and memory models without having a huge 
>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>> penalty.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL
>>>>>>>>>>>>>>>> representations of UDFs (similar to views as shared by the 
>>>>>>>>>>>>>>>> reference links
>>>>>>>>>>>>>>>> above), the complexity involved will be similar to managing 
>>>>>>>>>>>>>>>> views.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>>>>>>>>>>>>>>> >>>>>>>>>>>> We will work on publishing the draft spec
>>>>>>>>>>>>>>>> (inspired by the view spec) this week to facilitate further 
>>>>>>>>>>>>>>>> discussions.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <
>>>>>>>>>>>>>>>> yezhao...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>> > While it would be great to have a common
>>>>>>>>>>>>>>>> set of functions across engines, I don't see how that is 
>>>>>>>>>>>>>>>> practical when
>>>>>>>>>>>>>>>> those engines are implemented so differently. Plugging in code 
>>>>>>>>>>>>>>>> -- and
>>>>>>>>>>>>>>>> especially custom user-supplied code -- seems inherently 
>>>>>>>>>>>>>>>> specialized to me
>>>>>>>>>>>>>>>> and should be part of the engines' design.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>> How is this different from the views? I feel
>>>>>>>>>>>>>>>> we can say exactly the same thing for Iceberg views, but yet 
>>>>>>>>>>>>>>>> we have
>>>>>>>>>>>>>>>> Iceberg multi-dialect views implemented. Maybe it sounds like 
>>>>>>>>>>>>>>>> we are trying
>>>>>>>>>>>>>>>> to draw a line between SQL vs other programming language as 
>>>>>>>>>>>>>>>> "code"? but I
>>>>>>>>>>>>>>>> think SQL is just another type of code, and we are already 
>>>>>>>>>>>>>>>> talking about
>>>>>>>>>>>>>>>> compiling all these different code dialects to an intermediate
>>>>>>>>>>>>>>>> representation (using projects like Coral, Substrait), which 
>>>>>>>>>>>>>>>> will be stored
>>>>>>>>>>>>>>>> as another type of representation of Iceberg view. I think the 
>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>> functionality can be used for UDFs if developed.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I actually hink adding UDF support is a good
>>>>>>>>>>>>>>>> idea, even just a multi-dialect one like view, and that can 
>>>>>>>>>>>>>>>> allow engines
>>>>>>>>>>>>>>>> to for example parse a view SQL, and when a function 
>>>>>>>>>>>>>>>> referenced cannot be
>>>>>>>>>>>>>>>> resolved, try to seek for a multi-dialect UDF definition.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we can discuss more when we have the
>>>>>>>>>>>>>>>> actual proposal published.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <
>>>>>>>>>>>>>>>> sn...@snazy.de> wrote:
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and portable and
>>>>>>>>>>>>>>>> "non-centralized" as views are. The same performance concerns 
>>>>>>>>>>>>>>>> apply to
>>>>>>>>>>>>>>>> views as well.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Iceberg should define a common base upon
>>>>>>>>>>>>>>>> which engines can build, so the argument that UDFs aren't 
>>>>>>>>>>>>>>>> practical,
>>>>>>>>>>>>>>>> because engines are different, is probably only a temporary 
>>>>>>>>>>>>>>>> concern.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg should also try to
>>>>>>>>>>>>>>>> tackle the idea to make views portable, which is conceptually 
>>>>>>>>>>>>>>>> not that much
>>>>>>>>>>>>>>>> different from portable UDFs.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch
>>>>>>>>>>>>>>>> to the idea of having UDFs in Iceberg, especially not in this 
>>>>>>>>>>>>>>>> early stage.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea
>>>>>>>>>>>>>>>> to add UDFs tracked by Iceberg catalogs. I think that Iceberg 
>>>>>>>>>>>>>>>> primarily
>>>>>>>>>>>>>>>> deals with things that are centralized, like tables of data. 
>>>>>>>>>>>>>>>> While it would
>>>>>>>>>>>>>>>> be great to have a common set of functions across engines, I 
>>>>>>>>>>>>>>>> don't see how
>>>>>>>>>>>>>>>> that is practical when those engines are implemented so 
>>>>>>>>>>>>>>>> differently.
>>>>>>>>>>>>>>>> Plugging in code -- and especially custom user-supplied code 
>>>>>>>>>>>>>>>> -- seems
>>>>>>>>>>>>>>>> inherently specialized to me and should be part of the 
>>>>>>>>>>>>>>>> engines' design.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I guess we'll know more when you post the
>>>>>>>>>>>>>>>> proposal, but I think this would be a very difficult area to 
>>>>>>>>>>>>>>>> tackle across
>>>>>>>>>>>>>>>> engines, languages, and memory models without having a huge 
>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>> penalty.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat
>>>>>>>>>>>>>>>> <ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge the community
>>>>>>>>>>>>>>>> interest in storing the Versioned SQL UDFs in Iceberg.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We want to propose the spec addition for
>>>>>>>>>>>>>>>> storing the versioned UDFs in Iceberg (inspired by view spec).
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> These UDFs can operate similarly to views
>>>>>>>>>>>>>>>> in that they are associated with tables, but they can accept 
>>>>>>>>>>>>>>>> arguments and
>>>>>>>>>>>>>>>> produce return values, or even function as inline expressions.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino,
>>>>>>>>>>>>>>>> Snowflake, Databricks Spark supports SQL UDFs at catalog level 
>>>>>>>>>>>>>>>> [1].
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But storing them in Iceberg can enable
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Versioning of these UDFs.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Interoperability between the engines.
>>>>>>>>>>>>>>>> Potentially engines can understand the UDFs written by other 
>>>>>>>>>>>>>>>> engines (with
>>>>>>>>>>>>>>>> the translate layer).
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that integrating this feature
>>>>>>>>>>>>>>>> into Iceberg would be a valuable addition, and we're eager to 
>>>>>>>>>>>>>>>> collaborate
>>>>>>>>>>>>>>>> with the community to develop a UDF specification.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stephen has already begun drafting a
>>>>>>>>>>>>>>>> specification to propose to the community.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on this.
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dremio -
>>>>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Trino -
>>>>>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Snowflake -
>>>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Databricks -
>>>>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Robert Stupp
>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> @snazy
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> --
>>>>>>>>>>>>>>>> >> Ryan Blue
>>>>>>>>>>>>>>>> >> Databricks
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Databricks
>>>>>>>>>>>>
>>>>>>>>>>>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to