Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Yufei Gu Fri, 19 Sep 2025 09:42:54 -0700

Hi folks,

Really appreciated feedback from you all over the past few months. I've
filed the initial PR for the UDF spec:
https://github.com/apache/iceberg/pull/14117. It captures the consensus
we've built and addresses the write amplification concern raised in our
last discussion.


Please take a look and share your thoughts. Happy to discuss it further
during Monday's meeting as well.

Yufei


On Mon, Sep 8, 2025 at 6:33 PM Yufei Gu <[email protected]> wrote:

> Hi folks, thanks for joining today’s UDF sync.
>
> We covered the UDF metadata structure, captured in this doc:
> https://docs.google.com/document/d/1khPKL6zvWjYc5Is8HeVau6sff8FD-jNc2eLKXgit3X8/edit?usp=sharing
> .
>
> We also discussed a way to avoid copying every overload into the new
> metadata JSON when creating a new version. One of ideas is to introduce a
> global version array, this is not yet reflected in the doc, but I’ll update
> it shortly. Other key points:
>
>    - The latest UDF version will typically be used in most scenarios, but
>    engines retain the flexibility to choose which version to execute.
>    - Keeping the version while referring to an UDF probably isn't a good
>    idea. Users are responsible for updating downstream views if they reference
>    older UDF versions.
>
> You can watch the recording here:
> https://www.youtube.com/watch?v=6ResT-ODelI&ab_channel=ApacheIceberg
>
> Yufei
>
>
> On Mon, Aug 25, 2025 at 6:36 PM Yufei Gu <[email protected]> wrote:
>
>> Hi folks, thanks for attending today’s UDF sync. In general, we discussed
>> the UDF metadata structure, captured at this doc(
>> https://docs.google.com/document/d/1khPKL6zvWjYc5Is8HeVau6sff8FD-jNc2eLKXgit3X8/edit?usp=sharing
>> ). Here is the detailed summary:
>>
>>    1. Each UDF overload has its own return type. e.g., `add(int, int)`
>>    returns `int`, while `add(long, long)`  returns `long`
>>    2. Return type should be explicitly specified, no implicit or
>>    statement-based return type inference should be allowed.
>>    3. Adding explicit properties like deterministic, doc properties at
>>    the overload level.
>>    4. Adding property “secure” at the top level.
>>    5. Introducing a dedicated signature definitions section to
>>    centralize metadata (Function parameters, Return type, Parameter
>>    descriptions). Each overload would reference a signature definition by ID.
>>    This decoupling allows signature-related updates (like modifying parameter
>>    descriptions) without requiring a new UDF version, similar to how updating
>>    a table schema doesn’t create a new snapshot.
>>    6. Whether to have versioned open properties or not. Versioned
>>    properties can lead to unnecessary copying of a bag of properties into 
>> each
>>    version, while it provides a clear history of properties for any future
>>    debugging and understanding of the UDF behavior at a specific point in
>>    time.
>>
>> Watch the recording here,
>> https://www.youtube.com/watch?v=p7CvuGZKLSo&list=PLkifVhhWtccwzc3oRWjy5XiYJl0R6kdQL
>>
>> Yufei
>>
>>
>> On Thu, Aug 21, 2025 at 4:18 PM Yufei Gu <[email protected]> wrote:
>>
>>> Hi everyone, here’s the summary from our last sync on 8/11. Apologies
>>> for the delay!
>>>
>>>    - One UDF entity for all overloads
>>>       - We agreed to combine overloads with the same name into a single
>>>       UDF entity, which shares a common metadata.json file.
>>>       - Listing UDFs will return a list of UDF names, not a list of
>>>       individual signatures.
>>>       - Loading a UDF by name will return all of its overloads.
>>>    - Versioning Strategy
>>>       - A global version number will track changes across the entire
>>>       UDF entity, it increments monolithically.
>>>       - Each overload will also maintain its own version (e.g.,
>>>       updated_at_version) to trace changes specific to that overload.
>>>    - For simplicity, the load API will not support argument-based
>>>    filtering in the initial release. It will always return all overloads 
>>> for a
>>>    given UDF name, overload-level loading is not supported at this stage.
>>>
>>> Watch the recording here,
>>> https://drive.google.com/file/d/10G2HjUH2DaKSjGufEOjMu0bBuNd7sCzO/view
>>>
>>> Yufei
>>>
>>>
>>> On Fri, Aug 8, 2025 at 3:11 PM Yufei Gu <[email protected]> wrote:
>>>
>>>> To recap and add my thoughts, we want to support UDFs with multiple
>>>> signatures under the same name, which can serve both overload-aware and
>>>> overload-naive engines.
>>>>
>>>> Per my investigation[1], most engines support overloading by arguments
>>>> and allow implicit conversions like numeric widening (e.g., INT →
>>>> BIGINT/FLOAT). The resolution approach causes issues like silent behavior
>>>> change. Here is an example:
>>>>
>>>>    - Initially, only foo(DOUBLE) exists.
>>>>    - foo(42::INT) widens INT → DOUBLE and runs expected code.
>>>>    - Later: malicious user creates foo(BIGINT).
>>>>    - Engine’s best-match resolution now binds the same call to the new
>>>>    overload, changing behavior without modifying the query.
>>>>
>>>> To mitigate this issue, we have to choose between these two access
>>>> control models:
>>>>
>>>>    1. Model A – Name-Level ACL: Grants apply to all overloads of a
>>>>    function name.
>>>>    2. Model B – Signature-Level ACL: Grants tied to specific
>>>>    signatures.
>>>>
>>>> The general recommendation is to adopt *Model A.* It trades some
>>>> precision for safety and simplicity, while eliminating the silent behavior
>>>> change problem. More details are in this doc[1].
>>>>
>>>> 1.
>>>> https://docs.google.com/document/d/1E8mR-vInbQ8LDa5Lv3f22i6f8sceHojnEzxEJ6s6cvc/edit?tab=t.0
>>>>
>>>> Yufei
>>>>
>>>>
>>>> On Tue, Jul 29, 2025 at 1:07 AM Ajantha Bhat <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks to everyone who joined the sync.
>>>>> Here is the meeting recording:
>>>>> https://drive.google.com/file/d/1L5S6nb-C_pzBwFlClwO_sG1AVBA_ROKo/view
>>>>>
>>>>> Summary:
>>>>> We have discussed how to define function identifiers (should also
>>>>> handle function overloading). Ryan suggested that we should check how 
>>>>> Spark
>>>>> does it. We can refer to functions using an identifier and then bind the
>>>>> different signatures to it. So that access policies can be applied per
>>>>> identifier. This is also linked to how we want to version the functions
>>>>> when overloading is supported.
>>>>>
>>>>> I will check more about this and update the proposal doc.
>>>>>
>>>>> Please check/subscribe to the dev events calendar for the next
>>>>> meeting link (Aug 11).
>>>>>
>>>>> - Ajantha
>>>>>
>>>>> On Sun, Jul 27, 2025 at 10:46 PM Kevin Liu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Ajantha,
>>>>>>
>>>>>> I see that the UDF Sync is scheduled in the "Iceberg Dev Events"
>>>>>> calendar for tomorrow 7/28 at 9AM PT. I missed the last one, but
>>>>>> i'll be at this one.
>>>>>>
>>>>>> Best,
>>>>>> Kevin Liu
>>>>>>
>>>>>> On Mon, Jul 14, 2025 at 9:22 AM Ajantha Bhat <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>> No one joined the sync today. I came to know that Yufei is on
>>>>>>> holiday, and Ryan and others couldn't make it, similar to the last 
>>>>>>> sync. It
>>>>>>> seems Yufei might have forgotten to transfer meeting ownership as well, 
>>>>>>> as
>>>>>>> new members needed admin approval and couldn't join automatically this
>>>>>>> week. Also, I can understand it is summer holiday season for many.
>>>>>>>
>>>>>>> I've updated the function signature schema and other open points. I
>>>>>>> believe we're very close to the final version of the spec. A meeting is
>>>>>>> indeed necessary to finalize this, but we don't have to wait for it to
>>>>>>> finish the review process. We had many meetings on this in the past
>>>>>>> already. So, please review the document at your earliest convenience. 
>>>>>>> If we
>>>>>>> agree on the spec by next week, I can raise a PR.
>>>>>>>
>>>>>>> - Ajantha
>>>>>>>
>>>>>>> On Thu, Jul 3, 2025 at 4:03 AM Yufei Gu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I’d propose to move the field `properties` from a top level field
>>>>>>>> to a field inside “version” along with a representation, so that 
>>>>>>>> properties
>>>>>>>> are versioned. A property like “deterministic” could change along with
>>>>>>>> representation over time. For example, we need to change 
>>>>>>>> “deterministic”
>>>>>>>> from true to false in case of adding a non-deterministic SQL
>>>>>>>> expression/function(e.g., now()) inside an UDF. Otherwise, rollback 
>>>>>>>> won't
>>>>>>>> be safe.
>>>>>>>>
>>>>>>>> That said, it's still an open question whether we need any
>>>>>>>> non-versioned properties. We can introduce them later if a use case 
>>>>>>>> arises.
>>>>>>>>
>>>>>>>> Yufei
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 2, 2025 at 3:06 PM Yufei Gu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the summary, Ajantha!
>>>>>>>>>
>>>>>>>>> I’d prefer to keep the signature list separate from the
>>>>>>>>> representation history. Here are reasons:
>>>>>>>>>
>>>>>>>>>    1. Each version still enforces a single signature. Although
>>>>>>>>>    the signatures array is global to the UDF, each version references 
>>>>>>>>> just one
>>>>>>>>>    signature ID. Rollbacks to historical versions remain safe.
>>>>>>>>>    2. We’ve separated the less frequently changing component
>>>>>>>>>    (signatures) from the more dynamic one (representations) to reduce 
>>>>>>>>> metadata
>>>>>>>>>    file size.
>>>>>>>>>    3. Since signatures use Iceberg data types, they should remain
>>>>>>>>>    unaffected by multi-dialect representation differences.
>>>>>>>>>
>>>>>>>>> Yufei
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 30, 2025 at 11:28 AM Ajantha Bhat <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks to everyone who joined the sync.
>>>>>>>>>> Here is the meeting recording:
>>>>>>>>>> https://drive.google.com/file/d/1FcOSbHo9ZIVeZXdUlmoG42o-chB7Q15P/view?usp=sharing
>>>>>>>>>>
>>>>>>>>>> Summary:
>>>>>>>>>> We have discussed the action items from the last sync (*see
>>>>>>>>>> Appendix C* in the proposal doc)
>>>>>>>>>>
>>>>>>>>>>    - Function overloading: Supported by few of the engines and
>>>>>>>>>>    in the roadmaps of many engines. Iceberg will support it. We will 
>>>>>>>>>> maintain
>>>>>>>>>>    the `FunctionIdentifier` (extends `TableIdentifer` but also have 
>>>>>>>>>> a member
>>>>>>>>>>    containing the function argument's type list). And all operations 
>>>>>>>>>> like
>>>>>>>>>>    load, rename, list, create and drop are based on 
>>>>>>>>>> `FunctionIdentifier`.
>>>>>>>>>>    - Secure UDF: If we store it as a property in a bag, we need
>>>>>>>>>>    to standardize the property name. Iceberg encryption may be 
>>>>>>>>>> orthogonal to
>>>>>>>>>>    this discussion.
>>>>>>>>>>    - UDF with multi statement and procedural bodies are
>>>>>>>>>>    supported by some engines. Iceberg will support it. Store the 
>>>>>>>>>> body as it is
>>>>>>>>>>    while creating function by the engine.
>>>>>>>>>>
>>>>>>>>>> new discussions around
>>>>>>>>>>
>>>>>>>>>>    - Standardizing the property names (deterministic, secure).
>>>>>>>>>>    - About the rename function.
>>>>>>>>>>    - Replace function. To check upto what level replace is
>>>>>>>>>>    supported (considering function overloading) .
>>>>>>>>>>    - Signature should be associated with representation?
>>>>>>>>>>
>>>>>>>>>>    I think we are close on the spec. Please review the proposal
>>>>>>>>>>    
>>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing>
>>>>>>>>>>    .
>>>>>>>>>>
>>>>>>>>>> Details for next Iceberg UDF sync:
>>>>>>>>>>
>>>>>>>>>> *Monday, July 14 · 9:00 – 10:00am*Time zone: America/Los_Angeles
>>>>>>>>>> Google Meet joining info
>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh
>>>>>>>>>>
>>>>>>>>>> - Ajantha
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 30, 2025 at 9:27 PM Ajantha Bhat <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Can it be handled by Iceberg encryption? If the whole metadata
>>>>>>>>>>> is encrypted, we don't have to worry about just hiding the UDF 
>>>>>>>>>>> body? Let us
>>>>>>>>>>> discuss more on the sync today.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jun 30, 2025 at 9:22 PM Yufei Gu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, hiding the definition and disabling pushdown are
>>>>>>>>>>>> required.We will need a named key(e.g., secure) somewhere, no 
>>>>>>>>>>>> matter if it
>>>>>>>>>>>> is a top level property or a key as a part of the UDF properties. 
>>>>>>>>>>>> So that
>>>>>>>>>>>> both UDF creator and consumer can recognize it.
>>>>>>>>>>>>
>>>>>>>>>>>> Yufei
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jun 26, 2025 at 4:27 PM Ryan Blue <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the extra detail. What do you think the spec would
>>>>>>>>>>>>> require? Would it require hiding the UDF definition from users 
>>>>>>>>>>>>> and require
>>>>>>>>>>>>> specific pushdown cases be disabled? The use cases seem valid, 
>>>>>>>>>>>>> but I'm
>>>>>>>>>>>>> trying to understand the requirements this places on engines and 
>>>>>>>>>>>>> why it
>>>>>>>>>>>>> needs to be part of the spec, rather than part of the properties 
>>>>>>>>>>>>> of the UDF.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jun 20, 2025 at 3:56 PM Yufei Gu <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here are the main use cases for secure UDFs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Hiding UDF Definitions: This includes concealing the UDF
>>>>>>>>>>>>>>    body and details like the list of imports, some of them 
>>>>>>>>>>>>>> aren’t applicable
>>>>>>>>>>>>>>    to SQL UDFs.
>>>>>>>>>>>>>>    2.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Sandboxed Execution: Ensuring the UDF runs in an isolated
>>>>>>>>>>>>>>    environment. Again, this typically doesn’t apply to SQL UDFs.
>>>>>>>>>>>>>>    3.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Preventing Data Leakage at Execution Time: For example,
>>>>>>>>>>>>>>    secure UDFs may disable certain optimizations—such as 
>>>>>>>>>>>>>> predicate pushdown—to
>>>>>>>>>>>>>>    avoid exposing sensitive data indirectly. [1]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Given these scenarios, I agree with your point that the
>>>>>>>>>>>>>> secure flag is primarily an instruction to the engine to
>>>>>>>>>>>>>> behave differently. While it's largely an engine-side behavior, 
>>>>>>>>>>>>>> we still
>>>>>>>>>>>>>> need to include this flag in the UDF definition to indicate 
>>>>>>>>>>>>>> whether a UDF
>>>>>>>>>>>>>> is secure, especially considering the perf penalty introduced by 
>>>>>>>>>>>>>> scenario
>>>>>>>>>>>>>> #3. We should clearly recommend that users avoid marking UDFs as 
>>>>>>>>>>>>>> secure
>>>>>>>>>>>>>> unless it's truly necessary.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/pushdown-optimization#example-of-indirect-data-exposure-through-pushdown
>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jun 18, 2025 at 12:32 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yufei, could you make the argument for supporting a "secure"
>>>>>>>>>>>>>>> UDF? What use case are you addressing and what specifically 
>>>>>>>>>>>>>>> changes about
>>>>>>>>>>>>>>> how the UDF is handled? If the idea is to hide the UDF 
>>>>>>>>>>>>>>> definition, do we
>>>>>>>>>>>>>>> need to include it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think this would be a signal to a "trusted engine". When
>>>>>>>>>>>>>>> the engine interacts with the catalog it sends authorization 
>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>> about itself in addition to the user that it is acting on 
>>>>>>>>>>>>>>> behalf of. That
>>>>>>>>>>>>>>> way the catalog knows that the secure UDF can be sent to the 
>>>>>>>>>>>>>>> engine and
>>>>>>>>>>>>>>> won't be shown to the user. The majority of this logic is on 
>>>>>>>>>>>>>>> the REST
>>>>>>>>>>>>>>> server side, and the only part that is communicated to the 
>>>>>>>>>>>>>>> client is the
>>>>>>>>>>>>>>> request not to show the UDF to the user, right? In that case 
>>>>>>>>>>>>>>> should this be
>>>>>>>>>>>>>>> a property rather than part of the definition? Even if we state 
>>>>>>>>>>>>>>> that the
>>>>>>>>>>>>>>> client "must" suppress the UDF definition, it's really just a 
>>>>>>>>>>>>>>> request. Only
>>>>>>>>>>>>>>> trusted engines can be passed the UDF definition, so a spec 
>>>>>>>>>>>>>>> requirement to
>>>>>>>>>>>>>>> suppress the definition isn't very meaningful.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Jun 16, 2025 at 5:42 PM Yufei Gu <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the summary, Ajantha!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Multi-statement UDFs are definitely useful, but whether
>>>>>>>>>>>>>>>> those statements run within a single transaction should be 
>>>>>>>>>>>>>>>> treated as an
>>>>>>>>>>>>>>>> engine-level concern. The Iceberg UDF spec can spell out the 
>>>>>>>>>>>>>>>> expectation,
>>>>>>>>>>>>>>>> yet the actual guarantee still depends on the runtime. Even if 
>>>>>>>>>>>>>>>> a UDF
>>>>>>>>>>>>>>>> declares itself transactional, the engine may or may not 
>>>>>>>>>>>>>>>> enforce it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One more thing: should we also introduce a “secure UDF”
>>>>>>>>>>>>>>>> option supported by some engines[1], so the body and any 
>>>>>>>>>>>>>>>> sensitive details
>>>>>>>>>>>>>>>> stay hidden from callers?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/secure-udf-procedure
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Jun 16, 2025 at 12:02 PM Ajantha Bhat <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks to everyone who joined the sync.
>>>>>>>>>>>>>>>>> Here is the meeting recording:
>>>>>>>>>>>>>>>>> https://drive.google.com/file/d/10_Getaasv6tDMGzeZQUgcUVwCUAaFxiz/view?usp=sharing
>>>>>>>>>>>>>>>>> Summary:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - We have gone through the SQL UDF syntax supported by
>>>>>>>>>>>>>>>>>    different engines (Snowflake, databricks, Dremio, Trino, 
>>>>>>>>>>>>>>>>> OSS spark 4.0).
>>>>>>>>>>>>>>>>>    - Each engine uses its own block separator, like $$ or
>>>>>>>>>>>>>>>>>    '' or none. Action item was to check whether engines 
>>>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>>    multi-statement (transactional) UDF bodies.
>>>>>>>>>>>>>>>>>    - Discussed about function overloading. Need to check
>>>>>>>>>>>>>>>>>    whether these engines support function overloading for SQL 
>>>>>>>>>>>>>>>>> UDFs. Postgres
>>>>>>>>>>>>>>>>>    supports it! If yes, need to adopt the spec to handle it.
>>>>>>>>>>>>>>>>>    - Started online spec review and discussed the
>>>>>>>>>>>>>>>>>    deterministic flag and concluded that we keep the 
>>>>>>>>>>>>>>>>> independent fields (like
>>>>>>>>>>>>>>>>>    deterministic) in spec only if the majority of engines 
>>>>>>>>>>>>>>>>> supports it. Else it
>>>>>>>>>>>>>>>>>    will be passed in a property bag (engine specific). And it 
>>>>>>>>>>>>>>>>> is the engine's
>>>>>>>>>>>>>>>>>    responsibility to honor those optional properties.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Feel free to review the current proposal document here
>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing>.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Final spec will be put to review and vote once it is
>>>>>>>>>>>>>>>>> ready.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Details for next Iceberg UDF sync:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Monday, June 30 · 9:00 – 10:00am*Time zone:
>>>>>>>>>>>>>>>>> America/Los_Angeles
>>>>>>>>>>>>>>>>> Google Meet joining info
>>>>>>>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Jun 4, 2025 at 9:00 PM Ajantha Bhat <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks to everyone who joined the sync.
>>>>>>>>>>>>>>>>>> Here is the meeting recording:
>>>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1WItItsNs3m3-no7_qWPHftGqVNOdpw5C/view?usp=sharing
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Summary:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    We discussed including Python support; the majority
>>>>>>>>>>>>>>>>>>    agreed *not to* (see recording for details).
>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    No strong opposition to versioning — it will be
>>>>>>>>>>>>>>>>>>    included to support change tracking and similar use cases.
>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Suggestions were made to document how each catalog
>>>>>>>>>>>>>>>>>>    resolves UDFs, similar to views and tables.
>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    We agreed not to deviate from the existing table/view
>>>>>>>>>>>>>>>>>>    spec — e.g., location will remain *required* for
>>>>>>>>>>>>>>>>>>    cross-catalog compatibility.
>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    We also discussed a bit about view interoperability
>>>>>>>>>>>>>>>>>>    as the same things are applicable here.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Feel free to review the proposal document
>>>>>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?pli=1&tab=t.0>
>>>>>>>>>>>>>>>>>>  here.
>>>>>>>>>>>>>>>>>>    With the current scope, it is similar to the view/table 
>>>>>>>>>>>>>>>>>> spec now.
>>>>>>>>>>>>>>>>>>    Final spec will be put to review and vote once it is
>>>>>>>>>>>>>>>>>>    ready.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Details for next Iceberg UDF sync:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *Monday, June 16 · 9:00 – 10:00am*Time zone:
>>>>>>>>>>>>>>>>>> America/Los_Angeles
>>>>>>>>>>>>>>>>>> Google Meet joining info
>>>>>>>>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, May 21, 2025 at 3:33 AM Yufei Gu <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We’ve set up a dedicated bi-weekly community sync for
>>>>>>>>>>>>>>>>>>> the UDF project. Everyone’s welcome to drop in and share 
>>>>>>>>>>>>>>>>>>> ideas! Here is the
>>>>>>>>>>>>>>>>>>> meeting link:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Iceberg UDF sync
>>>>>>>>>>>>>>>>>>> Monday, June 2 · 9:00 – 10:00am
>>>>>>>>>>>>>>>>>>> Time zone: America/Los_Angeles
>>>>>>>>>>>>>>>>>>> Google Meet joining info
>>>>>>>>>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, May 16, 2025 at 10:45 AM Ajantha Bhat <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Update on the progress.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I had a meeting today with Yufei and Yun.zou to discuss
>>>>>>>>>>>>>>>>>>>> the UDF proposal. We covered several key points, though 
>>>>>>>>>>>>>>>>>>>> some are still open
>>>>>>>>>>>>>>>>>>>> for further discussion:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> a) *UDF Versioning*: Do we truly need versioning for
>>>>>>>>>>>>>>>>>>>> UDFs at this stage? We explored the possibility of 
>>>>>>>>>>>>>>>>>>>> simplifying the
>>>>>>>>>>>>>>>>>>>> specification by avoiding view replication, and 
>>>>>>>>>>>>>>>>>>>> potentially introducing
>>>>>>>>>>>>>>>>>>>> versioning support later. UDTFs, being a superset of views 
>>>>>>>>>>>>>>>>>>>> in some ways,
>>>>>>>>>>>>>>>>>>>> may not require versioning initially.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> b) *VarArgs Support*: While some query engines may not
>>>>>>>>>>>>>>>>>>>> support vararg syntax in CREATE FUNCTION, Iceberg UDFs
>>>>>>>>>>>>>>>>>>>> could represent such arguments as lists when supported by 
>>>>>>>>>>>>>>>>>>>> the engine.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> c) *Generics in UDFs*: Since Iceberg currently doesn’t
>>>>>>>>>>>>>>>>>>>> support generic types (e.g., object), we can only map
>>>>>>>>>>>>>>>>>>>> engine-specific types to Iceberg types. As a result, 
>>>>>>>>>>>>>>>>>>>> generic data types
>>>>>>>>>>>>>>>>>>>> will not be supported in the initial version.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> d) *Python Support*: Incorporating Python as a
>>>>>>>>>>>>>>>>>>>> language for SQL UDFs seems promising, especially given 
>>>>>>>>>>>>>>>>>>>> its potential to
>>>>>>>>>>>>>>>>>>>> resolve interoperability challenges. Some engines, 
>>>>>>>>>>>>>>>>>>>> however, require
>>>>>>>>>>>>>>>>>>>> platform version and package dependency details to execute 
>>>>>>>>>>>>>>>>>>>> Python code—this
>>>>>>>>>>>>>>>>>>>> should be captured in the specification.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *Next Steps*
>>>>>>>>>>>>>>>>>>>> I will update the proposal document with two primary
>>>>>>>>>>>>>>>>>>>> UDF use cases:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>    Policy exchange between engines
>>>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>    UDTF as a superset of view functionality
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The update will include corresponding syntax examples
>>>>>>>>>>>>>>>>>>>> in both SQL and Python, and detail how each use case is 
>>>>>>>>>>>>>>>>>>>> represented in
>>>>>>>>>>>>>>>>>>>> Iceberg metadata.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We also plan to set up regular syncs (open to more
>>>>>>>>>>>>>>>>>>>> interested participants) to continue refining and 
>>>>>>>>>>>>>>>>>>>> finalizing the UDF
>>>>>>>>>>>>>>>>>>>> specification.
>>>>>>>>>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Mar 12, 2025 at 9:16 PM Ajantha Bhat <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I've updated the design document[1] based on the
>>>>>>>>>>>>>>>>>>>>> previous comments. Additionally, I've included the SQL 
>>>>>>>>>>>>>>>>>>>>> UDF syntax supported
>>>>>>>>>>>>>>>>>>>>> by various vendors, including Dremio, Snowflake, 
>>>>>>>>>>>>>>>>>>>>> Databricks, and Trino.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm happy to schedule a separate sync if a deeper
>>>>>>>>>>>>>>>>>>>>> discussion is needed. Let's keep moving forward, 
>>>>>>>>>>>>>>>>>>>>> especially with the
>>>>>>>>>>>>>>>>>>>>> renewed interest from the community.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 13, 2025 at 11:17 PM Ajantha Bhat <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> During the last catalog community sync, there was
>>>>>>>>>>>>>>>>>>>>>> significant interest in storing UDFs in Iceberg and 
>>>>>>>>>>>>>>>>>>>>>> adding endpoints for
>>>>>>>>>>>>>>>>>>>>>> UDF handling in the REST catalog spec.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I recently discussed this with Yufei to better
>>>>>>>>>>>>>>>>>>>>>> understand the new requirement of using UDFs for 
>>>>>>>>>>>>>>>>>>>>>> fine-grained access
>>>>>>>>>>>>>>>>>>>>>> control policies. This expands the use cases beyond just 
>>>>>>>>>>>>>>>>>>>>>> versioned and
>>>>>>>>>>>>>>>>>>>>>> interoperable UDFs. Additionally, I learnt that many 
>>>>>>>>>>>>>>>>>>>>>> vendors are interested
>>>>>>>>>>>>>>>>>>>>>> in this feature.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Given the strong community interest and support, I’d
>>>>>>>>>>>>>>>>>>>>>> like to take ownership of this effort and revive the 
>>>>>>>>>>>>>>>>>>>>>> work. I'll be
>>>>>>>>>>>>>>>>>>>>>> revisiting the document I proposed long back and will 
>>>>>>>>>>>>>>>>>>>>>> share an updated
>>>>>>>>>>>>>>>>>>>>>> proposal by next week.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Looking forward to storing UDFs in Iceberg!
>>>>>>>>>>>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 2:55 PM Dmitri Bourlatchkov
>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> The UDF spec does not require representations to be
>>>>>>>>>>>>>>>>>>>>>>> SQL. It merely does not specify (in this revision) how 
>>>>>>>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>> representations are to be written.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> This seems like an easy extension (adding a new type
>>>>>>>>>>>>>>>>>>>>>>> in the "Representations" section).
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>> Dmitri.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 3:47 PM Ryan Blue
>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Right now, SQL is an explicit requirement of the
>>>>>>>>>>>>>>>>>>>>>>>> spec. It leaves a way for future versions to add 
>>>>>>>>>>>>>>>>>>>>>>>> different representations
>>>>>>>>>>>>>>>>>>>>>>>> later, but only SQL is supported. That was also the 
>>>>>>>>>>>>>>>>>>>>>>>> feedback to my initial
>>>>>>>>>>>>>>>>>>>>>>>> skepticism about how it would work to add functions.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 12:44 PM Dmitri Bourlatchkov
>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I do not think the spec is meant to allow only SQL
>>>>>>>>>>>>>>>>>>>>>>>>> representations, although it is certainly faviouring 
>>>>>>>>>>>>>>>>>>>>>>>>> SQL in examples... It
>>>>>>>>>>>>>>>>>>>>>>>>> would be nice to add a non-SQL example, indeed.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>> Dmitri.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Coming from PyIceberg, I have concerns as this
>>>>>>>>>>>>>>>>>>>>>>>>>> proposal focuses on SQL-based engines, while 
>>>>>>>>>>>>>>>>>>>>>>>>>> Python-based systems often
>>>>>>>>>>>>>>>>>>>>>>>>>> work with data frames. Adding imperative languages 
>>>>>>>>>>>>>>>>>>>>>>>>>> like Python would make
>>>>>>>>>>>>>>>>>>>>>>>>>> this proposal more inclusive.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen
>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Walaa, thanks for asking!
>>>>>>>>>>>>>>>>>>>>>>>>>>> In the design doc linked before  in this thread
>>>>>>>>>>>>>>>>>>>>>>>>>>> [1] i read
>>>>>>>>>>>>>>>>>>>>>>>>>>> "Without a common standard, the UDFs are hard to
>>>>>>>>>>>>>>>>>>>>>>>>>>> share among different engines."
>>>>>>>>>>>>>>>>>>>>>>>>>>> ("Background and Motivation" section).
>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with this statement. I don't fully
>>>>>>>>>>>>>>>>>>>>>>>>>>> understand yet how the proposed design addresses 
>>>>>>>>>>>>>>>>>>>>>>>>>>> shareability between the
>>>>>>>>>>>>>>>>>>>>>>>>>>> engines though.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I would use some help to understand this better.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotr
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> [1] SQL User-Defined Function Spec
>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin
>>>>>>>>>>>>>>>>>>>>>>>>>>> Moustafa <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotr, what do you mean by making user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>> functions shareable
>>>>>>>>>>>>>>>>>>>>>>>>>>>> between engines? Do you mean UDFs written in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> imperative code?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Thank you Ajantha for creating this thread.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The Iceberg UDFs are an interesting idea!
>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Is there a plan to make the user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>> functions sharable between the engines?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> > If so, how would a CREATE FUNCTION statement
>>>>>>>>>>>>>>>>>>>>>>>>>>>> look like in e..g Spark or Trino?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Meanwhile, added a few comments in the doc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Best
>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Piotr
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> I just looked through the proposal and added
>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments. I think it would be helpful to also have 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a design doc that covers
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the choices from the draft spec. For instance, the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> choice to enumerate all
>>>>>>>>>>>>>>>>>>>>>>>>>>>> possible function input struts rather than 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> allowing generics and varargs.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> Here’s a quick summary of my feedback:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> I think that the choice to enumerate
>>>>>>>>>>>>>>>>>>>>>>>>>>>> function signatures is limiting. It would be nice 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to see a discussion of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the trade-offs and a rationale for the choice. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> think it would also be
>>>>>>>>>>>>>>>>>>>>>>>>>>>> very helpful to have a few representative use 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cases for this included in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the doc. That way the proposal can demonstrate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that it solves those use
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cases with reasonable trade-offs.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> There are a few instances where this is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> inconsistent with conventions in other specs. For 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> example, using string IDs
>>>>>>>>>>>>>>>>>>>>>>>>>>>> rather than an integer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> This uses a very different model for spec
>>>>>>>>>>>>>>>>>>>>>>>>>>>> versioning than the Iceberg view and table specs. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> It requires readers to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> fail if there are any unknown fields, which 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> prevents the spec from adding
>>>>>>>>>>>>>>>>>>>>>>>>>>>> things that are fully backward-compatible. Other 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg specs only require
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a version change to introduce forward-incompatible 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> changes and I think that
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this should do the same to avoid confusion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> It looks like the intent is to allow
>>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple function signatures per verison, but it 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> is unclear how to encode
>>>>>>>>>>>>>>>>>>>>>>>>>>>> them because a version is associated with a single 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> function signature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> There is no review of SQL syntax for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> creating functions across engines, so this doesn’t 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> show that the metadata
>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed is sufficient for cross-engine use cases.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> The example for a table-valued function
>>>>>>>>>>>>>>>>>>>>>>>>>>>> shows a SELECT statement and it isn’t clear how 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this is distinct from a view
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> Thanks Walaa and Robert for the review on
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> We didn't find any blocker for the spec.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> I will wait for a week and If no more
>>>>>>>>>>>>>>>>>>>>>>>>>>>> review comments, I will raise a PR for spec 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition next week.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> If anyone else is interested, please have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> look at the proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> - Ajantha
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Moustafa <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> Hi Ajantha,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> I have left some comments. It is an
>>>>>>>>>>>>>>>>>>>>>>>>>>>> interesting direction, but there might be some 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> details that need to be fine
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tuned.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> The doc is here [1] for others who might
>>>>>>>>>>>>>>>>>>>>>>>>>>>> be interested. Resharing since I do not think it 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> was directly linked in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> thread.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> Walaa.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bhat <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> Hi, just another reminder since we didn't
>>>>>>>>>>>>>>>>>>>>>>>>>>>> get any review on the proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> Initially proposed on June 4.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> - Ajantha
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bhat <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> We've only received one review so far
>>>>>>>>>>>>>>>>>>>>>>>>>>>> (from Benny).
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> We would appreciate more eyes on this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> - Ajantha
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bhat <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Please find the proposal link
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/10432
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Google doc link is attached in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> And Thanks Stephen Lin for working on
>>>>>>>>>>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Hope it gives more clarity to take the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> decisions and how we want to implement it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> - Ajantha
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eldin Moustafa <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Thanks Jack. I actually meant
>>>>>>>>>>>>>>>>>>>>>>>>>>>> scalar/aggregate/table user defined functions. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here are some examples of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> what I meant in (2):
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Hive GenericUDF:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Trino user defined functions:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://trino.io/docs/current/develop/functions.html
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Flink user defined functions:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Probably what you referred to is a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> variation of (1) where the API is data flow/data 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> pipeline API instead of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> SQL (e.g., Spark Scala). Yes, that is also 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> possible in the very long run :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ye <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> > (2) Custom code written in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> imperative function according to a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Java/Scala/Python API, etc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> I think we could still explore some
>>>>>>>>>>>>>>>>>>>>>>>>>>>> long term opportunities in this case. Consider you 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> register a Spark temp
>>>>>>>>>>>>>>>>>>>>>>>>>>>> view as some sort of data frame read, then it 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> could still be resolved to a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Spark plan that is representable by an 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate representation. But I
>>>>>>>>>>>>>>>>>>>>>>>>>>>> agree this gets very complicated very soon, and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> just having the case (1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> covered would already be a huge step forward.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> -Jack
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Chow <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> It's interesting to note that a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tabular SQL UDF can be used to build a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> parameterized view.  So, there's
>>>>>>>>>>>>>>>>>>>>>>>>>>>> definitely a lot in common between UDFs and views.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Walaa Eldin Moustafa <[email protected]>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> I think there is a disconnect about
>>>>>>>>>>>>>>>>>>>>>>>>>>>> what is perceived as a "UDF". There are 2 flavors:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> (1) Functions that are defined by
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user whose definition is a composition of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> other built-in functions/SQL
>>>>>>>>>>>>>>>>>>>>>>>>>>>> expressions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> (2) Custom code written in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> imperative function according to a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Java/Scala/Python API, etc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> All the examples in Ajantha's
>>>>>>>>>>>>>>>>>>>>>>>>>>>> references are pretty much from (1) and I think 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> those have more analogy to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> views due to their SQL nature. Agree (2) is not 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> practical to maintain by
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg, but I think Ajantha's use cases are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> around (1), and may be worth
>>>>>>>>>>>>>>>>>>>>>>>>>>>> evaluating.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ajantha Bhat <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you
>>>>>>>>>>>>>>>>>>>>>>>>>>>> post the proposal, but I think this would be a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> very difficult area to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tackle across engines, languages, and memory 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> models without having a huge
>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance penalty.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Assuming Iceberg initially
>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports SQL representations of UDFs (similar to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> views as shared by the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> reference links above), the complexity involved 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be similar to managing
>>>>>>>>>>>>>>>>>>>>>>>>>>>> views.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for your input.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> We will work on publishing the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> draft spec (inspired by the view spec) this week 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to facilitate further
>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jack Ye <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> > While it would be great to have
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a common set of functions across engines, I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> see how that is practical
>>>>>>>>>>>>>>>>>>>>>>>>>>>> when those engines are implemented so differently. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Plugging in code -- and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> especially custom user-supplied code -- seems 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> inherently specialized to me
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and should be part of the engines' design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> How is this different from the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> views? I feel we can say exactly the same thing 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for Iceberg views, but yet
>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have Iceberg multi-dialect views implemented. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Maybe it sounds like we
>>>>>>>>>>>>>>>>>>>>>>>>>>>> are trying to draw a line between SQL vs other 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> programming language as
>>>>>>>>>>>>>>>>>>>>>>>>>>>> "code"? but I think SQL is just another type of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> code, and we are already
>>>>>>>>>>>>>>>>>>>>>>>>>>>> talking about compiling all these different code 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> dialects to an
>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate representation (using projects like 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Coral, Substrait), which
>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be stored as another type of representation 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> of Iceberg view. I think
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the same functionality can be used for UDFs if 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> developed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I actually hink adding UDF
>>>>>>>>>>>>>>>>>>>>>>>>>>>> support is a good idea, even just a multi-dialect 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> one like view, and that
>>>>>>>>>>>>>>>>>>>>>>>>>>>> can allow engines to for example parse a view SQL, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and when a function
>>>>>>>>>>>>>>>>>>>>>>>>>>>> referenced cannot be resolved, try to seek for a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> multi-dialect UDF
>>>>>>>>>>>>>>>>>>>>>>>>>>>> definition.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we can discuss more when
>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have the actual proposal published.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Robert Stupp <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> portable and "non-centralized" as views are. The 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> same performance concerns
>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply to views as well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Iceberg should define a common
>>>>>>>>>>>>>>>>>>>>>>>>>>>> base upon which engines can build, so the argument 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that UDFs aren't
>>>>>>>>>>>>>>>>>>>>>>>>>>>> practical, because engines are different, is 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably only a temporary
>>>>>>>>>>>>>>>>>>>>>>>>>>>> concern.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg should
>>>>>>>>>>>>>>>>>>>>>>>>>>>> also try to tackle the idea to make views 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> portable, which is conceptually
>>>>>>>>>>>>>>>>>>>>>>>>>>>> not that much different from portable UDFs.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> negative touch to the idea of having UDFs in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg, especially not in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this early stage.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether it's
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a good idea to add UDFs tracked by Iceberg 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> catalogs. I think that Iceberg
>>>>>>>>>>>>>>>>>>>>>>>>>>>> primarily deals with things that are centralized, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> like tables of data.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> While it would be great to have a common set of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> functions across engines, I
>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't see how that is practical when those engines 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> are implemented so
>>>>>>>>>>>>>>>>>>>>>>>>>>>> differently. Plugging in code -- and especially 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> custom user-supplied code
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- seems inherently specialized to me and should 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> be part of the engines'
>>>>>>>>>>>>>>>>>>>>>>>>>>>> design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I guess we'll know more when you
>>>>>>>>>>>>>>>>>>>>>>>>>>>> post the proposal, but I think this would be a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> very difficult area to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tackle across engines, languages, and memory 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> models without having a huge
>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance penalty.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ajantha Bhat <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the community interest in storing the Versioned 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> SQL UDFs in Iceberg.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We want to propose the spec
>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition for storing the versioned UDFs in Iceberg 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> (inspired by view spec).
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> These UDFs can operate
>>>>>>>>>>>>>>>>>>>>>>>>>>>> similarly to views in that they are associated 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> with tables, but they can
>>>>>>>>>>>>>>>>>>>>>>>>>>>> accept arguments and produce return values, or 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> even function as inline
>>>>>>>>>>>>>>>>>>>>>>>>>>>> expressions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Many Query engines like Dremio,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Trino, Snowflake, Databricks Spark supports SQL 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> UDFs at catalog level [1].
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But storing them in Iceberg can
>>>>>>>>>>>>>>>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Versioning of these UDFs.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Interoperability between the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> engines. Potentially engines can understand the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> UDFs written by other
>>>>>>>>>>>>>>>>>>>>>>>>>>>> engines (with the translate layer).
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that integrating
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this feature into Iceberg would be a valuable 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition, and we're eager to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> collaborate with the community to develop a UDF 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> specification.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stephen has already begun
>>>>>>>>>>>>>>>>>>>>>>>>>>>> drafting a specification to propose to the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> community.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dremio -
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Trino -
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Snowflake -
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Databricks -
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Robert Stupp
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> @snazy
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> Ryan Blue
>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> Databricks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>>>>>>> Databricks
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to