Hi folks, Thanks a lot for joining today's UDF sync. Here is the summary:
1. Instead of relying on dynamic inference, the return table’s schema for user-defined table functions (UDTFs) should be explicitly defined. 2. Whether a function is a UDTF should be captured as a dedicated attribute, rather than being inferred indirectly from the return type. 3. The interpretation of a UDF body (whether it is treated as a partial SQL expression or as a full SELECT statement) should be determined by engines. Example: `SELECT x +1` vs. `x + 1`. Different engines have different takes. 4. User-defined aggregation functions (UDAFs) are out of scope for now. 5. Each overload should include its own current-version field. This avoids relying solely on the global `definition-versions` when querying the current version of one overload. You can watch the recording here:https://www.youtube.com/watch?v=9t2xev8WfAw I will update the PR(https://github.com/apache/iceberg/pull/14117) shortly. Yufei On Fri, Sep 19, 2025 at 9:42 AM Yufei Gu <[email protected]> wrote: > Hi folks, > > Really appreciated feedback from you all over the past few months. I've > filed the initial PR for the UDF spec: > https://github.com/apache/iceberg/pull/14117. It captures the consensus > we've built and addresses the write amplification concern raised in our > last discussion. > > Please take a look and share your thoughts. Happy to discuss it further > during Monday's meeting as well. > > Yufei > > > On Mon, Sep 8, 2025 at 6:33 PM Yufei Gu <[email protected]> wrote: > >> Hi folks, thanks for joining today’s UDF sync. >> >> We covered the UDF metadata structure, captured in this doc: >> https://docs.google.com/document/d/1khPKL6zvWjYc5Is8HeVau6sff8FD-jNc2eLKXgit3X8/edit?usp=sharing >> . >> >> We also discussed a way to avoid copying every overload into the new >> metadata JSON when creating a new version. One of ideas is to introduce a >> global version array, this is not yet reflected in the doc, but I’ll update >> it shortly. Other key points: >> >> - The latest UDF version will typically be used in most scenarios, >> but engines retain the flexibility to choose which version to execute. >> - Keeping the version while referring to an UDF probably isn't a good >> idea. Users are responsible for updating downstream views if they >> reference >> older UDF versions. >> >> You can watch the recording here: >> https://www.youtube.com/watch?v=6ResT-ODelI&ab_channel=ApacheIceberg >> >> Yufei >> >> >> On Mon, Aug 25, 2025 at 6:36 PM Yufei Gu <[email protected]> wrote: >> >>> Hi folks, thanks for attending today’s UDF sync. In general, we >>> discussed the UDF metadata structure, captured at this doc( >>> https://docs.google.com/document/d/1khPKL6zvWjYc5Is8HeVau6sff8FD-jNc2eLKXgit3X8/edit?usp=sharing >>> ). Here is the detailed summary: >>> >>> 1. Each UDF overload has its own return type. e.g., `add(int, int)` >>> returns `int`, while `add(long, long)` returns `long` >>> 2. Return type should be explicitly specified, no implicit or >>> statement-based return type inference should be allowed. >>> 3. Adding explicit properties like deterministic, doc properties at >>> the overload level. >>> 4. Adding property “secure” at the top level. >>> 5. Introducing a dedicated signature definitions section to >>> centralize metadata (Function parameters, Return type, Parameter >>> descriptions). Each overload would reference a signature definition by >>> ID. >>> This decoupling allows signature-related updates (like modifying >>> parameter >>> descriptions) without requiring a new UDF version, similar to how >>> updating >>> a table schema doesn’t create a new snapshot. >>> 6. Whether to have versioned open properties or not. Versioned >>> properties can lead to unnecessary copying of a bag of properties into >>> each >>> version, while it provides a clear history of properties for any future >>> debugging and understanding of the UDF behavior at a specific point in >>> time. >>> >>> Watch the recording here, >>> https://www.youtube.com/watch?v=p7CvuGZKLSo&list=PLkifVhhWtccwzc3oRWjy5XiYJl0R6kdQL >>> >>> Yufei >>> >>> >>> On Thu, Aug 21, 2025 at 4:18 PM Yufei Gu <[email protected]> wrote: >>> >>>> Hi everyone, here’s the summary from our last sync on 8/11. Apologies >>>> for the delay! >>>> >>>> - One UDF entity for all overloads >>>> - We agreed to combine overloads with the same name into a >>>> single UDF entity, which shares a common metadata.json file. >>>> - Listing UDFs will return a list of UDF names, not a list of >>>> individual signatures. >>>> - Loading a UDF by name will return all of its overloads. >>>> - Versioning Strategy >>>> - A global version number will track changes across the entire >>>> UDF entity, it increments monolithically. >>>> - Each overload will also maintain its own version (e.g., >>>> updated_at_version) to trace changes specific to that overload. >>>> - For simplicity, the load API will not support argument-based >>>> filtering in the initial release. It will always return all overloads >>>> for a >>>> given UDF name, overload-level loading is not supported at this stage. >>>> >>>> Watch the recording here, >>>> https://drive.google.com/file/d/10G2HjUH2DaKSjGufEOjMu0bBuNd7sCzO/view >>>> >>>> Yufei >>>> >>>> >>>> On Fri, Aug 8, 2025 at 3:11 PM Yufei Gu <[email protected]> wrote: >>>> >>>>> To recap and add my thoughts, we want to support UDFs with multiple >>>>> signatures under the same name, which can serve both overload-aware and >>>>> overload-naive engines. >>>>> >>>>> Per my investigation[1], most engines support overloading by arguments >>>>> and allow implicit conversions like numeric widening (e.g., INT → >>>>> BIGINT/FLOAT). The resolution approach causes issues like silent behavior >>>>> change. Here is an example: >>>>> >>>>> - Initially, only foo(DOUBLE) exists. >>>>> - foo(42::INT) widens INT → DOUBLE and runs expected code. >>>>> - Later: malicious user creates foo(BIGINT). >>>>> - Engine’s best-match resolution now binds the same call to the >>>>> new overload, changing behavior without modifying the query. >>>>> >>>>> To mitigate this issue, we have to choose between these two access >>>>> control models: >>>>> >>>>> 1. Model A – Name-Level ACL: Grants apply to all overloads of a >>>>> function name. >>>>> 2. Model B – Signature-Level ACL: Grants tied to specific >>>>> signatures. >>>>> >>>>> The general recommendation is to adopt *Model A.* It trades some >>>>> precision for safety and simplicity, while eliminating the silent behavior >>>>> change problem. More details are in this doc[1]. >>>>> >>>>> 1. >>>>> https://docs.google.com/document/d/1E8mR-vInbQ8LDa5Lv3f22i6f8sceHojnEzxEJ6s6cvc/edit?tab=t.0 >>>>> >>>>> Yufei >>>>> >>>>> >>>>> On Tue, Jul 29, 2025 at 1:07 AM Ajantha Bhat <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks to everyone who joined the sync. >>>>>> Here is the meeting recording: >>>>>> https://drive.google.com/file/d/1L5S6nb-C_pzBwFlClwO_sG1AVBA_ROKo/view >>>>>> >>>>>> Summary: >>>>>> We have discussed how to define function identifiers (should also >>>>>> handle function overloading). Ryan suggested that we should check how >>>>>> Spark >>>>>> does it. We can refer to functions using an identifier and then bind the >>>>>> different signatures to it. So that access policies can be applied per >>>>>> identifier. This is also linked to how we want to version the functions >>>>>> when overloading is supported. >>>>>> >>>>>> I will check more about this and update the proposal doc. >>>>>> >>>>>> Please check/subscribe to the dev events calendar for the next >>>>>> meeting link (Aug 11). >>>>>> >>>>>> - Ajantha >>>>>> >>>>>> On Sun, Jul 27, 2025 at 10:46 PM Kevin Liu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Ajantha, >>>>>>> >>>>>>> I see that the UDF Sync is scheduled in the "Iceberg Dev Events" >>>>>>> calendar for tomorrow 7/28 at 9AM PT. I missed the last one, but >>>>>>> i'll be at this one. >>>>>>> >>>>>>> Best, >>>>>>> Kevin Liu >>>>>>> >>>>>>> On Mon, Jul 14, 2025 at 9:22 AM Ajantha Bhat <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hey everyone, >>>>>>>> >>>>>>>> No one joined the sync today. I came to know that Yufei is on >>>>>>>> holiday, and Ryan and others couldn't make it, similar to the last >>>>>>>> sync. It >>>>>>>> seems Yufei might have forgotten to transfer meeting ownership as >>>>>>>> well, as >>>>>>>> new members needed admin approval and couldn't join automatically this >>>>>>>> week. Also, I can understand it is summer holiday season for many. >>>>>>>> >>>>>>>> I've updated the function signature schema and other open points. I >>>>>>>> believe we're very close to the final version of the spec. A meeting is >>>>>>>> indeed necessary to finalize this, but we don't have to wait for it to >>>>>>>> finish the review process. We had many meetings on this in the past >>>>>>>> already. So, please review the document at your earliest convenience. >>>>>>>> If we >>>>>>>> agree on the spec by next week, I can raise a PR. >>>>>>>> >>>>>>>> - Ajantha >>>>>>>> >>>>>>>> On Thu, Jul 3, 2025 at 4:03 AM Yufei Gu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I’d propose to move the field `properties` from a top level field >>>>>>>>> to a field inside “version” along with a representation, so that >>>>>>>>> properties >>>>>>>>> are versioned. A property like “deterministic” could change along with >>>>>>>>> representation over time. For example, we need to change >>>>>>>>> “deterministic” >>>>>>>>> from true to false in case of adding a non-deterministic SQL >>>>>>>>> expression/function(e.g., now()) inside an UDF. Otherwise, rollback >>>>>>>>> won't >>>>>>>>> be safe. >>>>>>>>> >>>>>>>>> That said, it's still an open question whether we need any >>>>>>>>> non-versioned properties. We can introduce them later if a use case >>>>>>>>> arises. >>>>>>>>> >>>>>>>>> Yufei >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jul 2, 2025 at 3:06 PM Yufei Gu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks for the summary, Ajantha! >>>>>>>>>> >>>>>>>>>> I’d prefer to keep the signature list separate from the >>>>>>>>>> representation history. Here are reasons: >>>>>>>>>> >>>>>>>>>> 1. Each version still enforces a single signature. Although >>>>>>>>>> the signatures array is global to the UDF, each version >>>>>>>>>> references just one >>>>>>>>>> signature ID. Rollbacks to historical versions remain safe. >>>>>>>>>> 2. We’ve separated the less frequently changing component >>>>>>>>>> (signatures) from the more dynamic one (representations) to >>>>>>>>>> reduce metadata >>>>>>>>>> file size. >>>>>>>>>> 3. Since signatures use Iceberg data types, they should >>>>>>>>>> remain unaffected by multi-dialect representation differences. >>>>>>>>>> >>>>>>>>>> Yufei >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jun 30, 2025 at 11:28 AM Ajantha Bhat < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks to everyone who joined the sync. >>>>>>>>>>> Here is the meeting recording: >>>>>>>>>>> https://drive.google.com/file/d/1FcOSbHo9ZIVeZXdUlmoG42o-chB7Q15P/view?usp=sharing >>>>>>>>>>> >>>>>>>>>>> Summary: >>>>>>>>>>> We have discussed the action items from the last sync (*see >>>>>>>>>>> Appendix C* in the proposal doc) >>>>>>>>>>> >>>>>>>>>>> - Function overloading: Supported by few of the engines and >>>>>>>>>>> in the roadmaps of many engines. Iceberg will support it. We >>>>>>>>>>> will maintain >>>>>>>>>>> the `FunctionIdentifier` (extends `TableIdentifer` but also have >>>>>>>>>>> a member >>>>>>>>>>> containing the function argument's type list). And all >>>>>>>>>>> operations like >>>>>>>>>>> load, rename, list, create and drop are based on >>>>>>>>>>> `FunctionIdentifier`. >>>>>>>>>>> - Secure UDF: If we store it as a property in a bag, we need >>>>>>>>>>> to standardize the property name. Iceberg encryption may be >>>>>>>>>>> orthogonal to >>>>>>>>>>> this discussion. >>>>>>>>>>> - UDF with multi statement and procedural bodies are >>>>>>>>>>> supported by some engines. Iceberg will support it. Store the >>>>>>>>>>> body as it is >>>>>>>>>>> while creating function by the engine. >>>>>>>>>>> >>>>>>>>>>> new discussions around >>>>>>>>>>> >>>>>>>>>>> - Standardizing the property names (deterministic, secure). >>>>>>>>>>> - About the rename function. >>>>>>>>>>> - Replace function. To check upto what level replace is >>>>>>>>>>> supported (considering function overloading) . >>>>>>>>>>> - Signature should be associated with representation? >>>>>>>>>>> >>>>>>>>>>> I think we are close on the spec. Please review the proposal >>>>>>>>>>> >>>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing> >>>>>>>>>>> . >>>>>>>>>>> >>>>>>>>>>> Details for next Iceberg UDF sync: >>>>>>>>>>> >>>>>>>>>>> *Monday, July 14 · 9:00 – 10:00am*Time zone: America/Los_Angeles >>>>>>>>>>> Google Meet joining info >>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>> >>>>>>>>>>> - Ajantha >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 30, 2025 at 9:27 PM Ajantha Bhat < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Can it be handled by Iceberg encryption? If the whole metadata >>>>>>>>>>>> is encrypted, we don't have to worry about just hiding the UDF >>>>>>>>>>>> body? Let us >>>>>>>>>>>> discuss more on the sync today. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jun 30, 2025 at 9:22 PM Yufei Gu <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Yes, hiding the definition and disabling pushdown are >>>>>>>>>>>>> required.We will need a named key(e.g., secure) somewhere, no >>>>>>>>>>>>> matter if it >>>>>>>>>>>>> is a top level property or a key as a part of the UDF properties. >>>>>>>>>>>>> So that >>>>>>>>>>>>> both UDF creator and consumer can recognize it. >>>>>>>>>>>>> >>>>>>>>>>>>> Yufei >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jun 26, 2025 at 4:27 PM Ryan Blue <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for the extra detail. What do you think the spec would >>>>>>>>>>>>>> require? Would it require hiding the UDF definition from users >>>>>>>>>>>>>> and require >>>>>>>>>>>>>> specific pushdown cases be disabled? The use cases seem valid, >>>>>>>>>>>>>> but I'm >>>>>>>>>>>>>> trying to understand the requirements this places on engines and >>>>>>>>>>>>>> why it >>>>>>>>>>>>>> needs to be part of the spec, rather than part of the properties >>>>>>>>>>>>>> of the UDF. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Jun 20, 2025 at 3:56 PM Yufei Gu < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here are the main use cases for secure UDFs: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hiding UDF Definitions: This includes concealing the UDF >>>>>>>>>>>>>>> body and details like the list of imports, some of them >>>>>>>>>>>>>>> aren’t applicable >>>>>>>>>>>>>>> to SQL UDFs. >>>>>>>>>>>>>>> 2. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sandboxed Execution: Ensuring the UDF runs in an >>>>>>>>>>>>>>> isolated environment. Again, this typically doesn’t apply to >>>>>>>>>>>>>>> SQL UDFs. >>>>>>>>>>>>>>> 3. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Preventing Data Leakage at Execution Time: For example, >>>>>>>>>>>>>>> secure UDFs may disable certain optimizations—such as >>>>>>>>>>>>>>> predicate pushdown—to >>>>>>>>>>>>>>> avoid exposing sensitive data indirectly. [1] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Given these scenarios, I agree with your point that the >>>>>>>>>>>>>>> secure flag is primarily an instruction to the engine to >>>>>>>>>>>>>>> behave differently. While it's largely an engine-side behavior, >>>>>>>>>>>>>>> we still >>>>>>>>>>>>>>> need to include this flag in the UDF definition to indicate >>>>>>>>>>>>>>> whether a UDF >>>>>>>>>>>>>>> is secure, especially considering the perf penalty introduced >>>>>>>>>>>>>>> by scenario >>>>>>>>>>>>>>> #3. We should clearly recommend that users avoid marking UDFs >>>>>>>>>>>>>>> as secure >>>>>>>>>>>>>>> unless it's truly necessary. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/pushdown-optimization#example-of-indirect-data-exposure-through-pushdown >>>>>>>>>>>>>>> Yufei >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Jun 18, 2025 at 12:32 PM Ryan Blue <[email protected]> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yufei, could you make the argument for supporting a >>>>>>>>>>>>>>>> "secure" UDF? What use case are you addressing and what >>>>>>>>>>>>>>>> specifically >>>>>>>>>>>>>>>> changes about how the UDF is handled? If the idea is to hide >>>>>>>>>>>>>>>> the UDF >>>>>>>>>>>>>>>> definition, do we need to include it? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think this would be a signal to a "trusted engine". When >>>>>>>>>>>>>>>> the engine interacts with the catalog it sends authorization >>>>>>>>>>>>>>>> information >>>>>>>>>>>>>>>> about itself in addition to the user that it is acting on >>>>>>>>>>>>>>>> behalf of. That >>>>>>>>>>>>>>>> way the catalog knows that the secure UDF can be sent to the >>>>>>>>>>>>>>>> engine and >>>>>>>>>>>>>>>> won't be shown to the user. The majority of this logic is on >>>>>>>>>>>>>>>> the REST >>>>>>>>>>>>>>>> server side, and the only part that is communicated to the >>>>>>>>>>>>>>>> client is the >>>>>>>>>>>>>>>> request not to show the UDF to the user, right? In that case >>>>>>>>>>>>>>>> should this be >>>>>>>>>>>>>>>> a property rather than part of the definition? Even if we >>>>>>>>>>>>>>>> state that the >>>>>>>>>>>>>>>> client "must" suppress the UDF definition, it's really just a >>>>>>>>>>>>>>>> request. Only >>>>>>>>>>>>>>>> trusted engines can be passed the UDF definition, so a spec >>>>>>>>>>>>>>>> requirement to >>>>>>>>>>>>>>>> suppress the definition isn't very meaningful. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Jun 16, 2025 at 5:42 PM Yufei Gu < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for the summary, Ajantha! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Multi-statement UDFs are definitely useful, but whether >>>>>>>>>>>>>>>>> those statements run within a single transaction should be >>>>>>>>>>>>>>>>> treated as an >>>>>>>>>>>>>>>>> engine-level concern. The Iceberg UDF spec can spell out the >>>>>>>>>>>>>>>>> expectation, >>>>>>>>>>>>>>>>> yet the actual guarantee still depends on the runtime. Even >>>>>>>>>>>>>>>>> if a UDF >>>>>>>>>>>>>>>>> declares itself transactional, the engine may or may not >>>>>>>>>>>>>>>>> enforce it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> One more thing: should we also introduce a “secure UDF” >>>>>>>>>>>>>>>>> option supported by some engines[1], so the body and any >>>>>>>>>>>>>>>>> sensitive details >>>>>>>>>>>>>>>>> stay hidden from callers? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/secure-udf-procedure >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yufei >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, Jun 16, 2025 at 12:02 PM Ajantha Bhat < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks to everyone who joined the sync. >>>>>>>>>>>>>>>>>> Here is the meeting recording: >>>>>>>>>>>>>>>>>> https://drive.google.com/file/d/10_Getaasv6tDMGzeZQUgcUVwCUAaFxiz/view?usp=sharing >>>>>>>>>>>>>>>>>> Summary: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - We have gone through the SQL UDF syntax supported >>>>>>>>>>>>>>>>>> by different engines (Snowflake, databricks, Dremio, >>>>>>>>>>>>>>>>>> Trino, OSS spark 4.0). >>>>>>>>>>>>>>>>>> - Each engine uses its own block separator, like $$ >>>>>>>>>>>>>>>>>> or '' or none. Action item was to check whether engines >>>>>>>>>>>>>>>>>> support >>>>>>>>>>>>>>>>>> multi-statement (transactional) UDF bodies. >>>>>>>>>>>>>>>>>> - Discussed about function overloading. Need to check >>>>>>>>>>>>>>>>>> whether these engines support function overloading for >>>>>>>>>>>>>>>>>> SQL UDFs. Postgres >>>>>>>>>>>>>>>>>> supports it! If yes, need to adopt the spec to handle it. >>>>>>>>>>>>>>>>>> - Started online spec review and discussed the >>>>>>>>>>>>>>>>>> deterministic flag and concluded that we keep the >>>>>>>>>>>>>>>>>> independent fields (like >>>>>>>>>>>>>>>>>> deterministic) in spec only if the majority of engines >>>>>>>>>>>>>>>>>> supports it. Else it >>>>>>>>>>>>>>>>>> will be passed in a property bag (engine specific). And >>>>>>>>>>>>>>>>>> it is the engine's >>>>>>>>>>>>>>>>>> responsibility to honor those optional properties. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Feel free to review the current proposal document here >>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing>. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Final spec will be put to review and vote once it is >>>>>>>>>>>>>>>>>> ready. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Details for next Iceberg UDF sync: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *Monday, June 30 · 9:00 – 10:00am*Time zone: >>>>>>>>>>>>>>>>>> America/Los_Angeles >>>>>>>>>>>>>>>>>> Google Meet joining info >>>>>>>>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Jun 4, 2025 at 9:00 PM Ajantha Bhat < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks to everyone who joined the sync. >>>>>>>>>>>>>>>>>>> Here is the meeting recording: >>>>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1WItItsNs3m3-no7_qWPHftGqVNOdpw5C/view?usp=sharing >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Summary: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> We discussed including Python support; the majority >>>>>>>>>>>>>>>>>>> agreed *not to* (see recording for details). >>>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> No strong opposition to versioning — it will be >>>>>>>>>>>>>>>>>>> included to support change tracking and similar use >>>>>>>>>>>>>>>>>>> cases. >>>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Suggestions were made to document how each catalog >>>>>>>>>>>>>>>>>>> resolves UDFs, similar to views and tables. >>>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> We agreed not to deviate from the existing >>>>>>>>>>>>>>>>>>> table/view spec — e.g., location will remain >>>>>>>>>>>>>>>>>>> *required* for cross-catalog compatibility. >>>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> We also discussed a bit about view interoperability >>>>>>>>>>>>>>>>>>> as the same things are applicable here. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Feel free to review the proposal document >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?pli=1&tab=t.0> >>>>>>>>>>>>>>>>>>> here. >>>>>>>>>>>>>>>>>>> With the current scope, it is similar to the view/table >>>>>>>>>>>>>>>>>>> spec now. >>>>>>>>>>>>>>>>>>> Final spec will be put to review and vote once it is >>>>>>>>>>>>>>>>>>> ready. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Details for next Iceberg UDF sync: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *Monday, June 16 · 9:00 – 10:00am*Time zone: >>>>>>>>>>>>>>>>>>> America/Los_Angeles >>>>>>>>>>>>>>>>>>> Google Meet joining info >>>>>>>>>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, May 21, 2025 at 3:33 AM Yufei Gu < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi folks, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> We’ve set up a dedicated bi-weekly community sync for >>>>>>>>>>>>>>>>>>>> the UDF project. Everyone’s welcome to drop in and share >>>>>>>>>>>>>>>>>>>> ideas! Here is the >>>>>>>>>>>>>>>>>>>> meeting link: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Iceberg UDF sync >>>>>>>>>>>>>>>>>>>> Monday, June 2 · 9:00 – 10:00am >>>>>>>>>>>>>>>>>>>> Time zone: America/Los_Angeles >>>>>>>>>>>>>>>>>>>> Google Meet joining info >>>>>>>>>>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Yufei >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, May 16, 2025 at 10:45 AM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Update on the progress. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I had a meeting today with Yufei and Yun.zou to >>>>>>>>>>>>>>>>>>>>> discuss the UDF proposal. We covered several key points, >>>>>>>>>>>>>>>>>>>>> though some are >>>>>>>>>>>>>>>>>>>>> still open for further discussion: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> a) *UDF Versioning*: Do we truly need versioning for >>>>>>>>>>>>>>>>>>>>> UDFs at this stage? We explored the possibility of >>>>>>>>>>>>>>>>>>>>> simplifying the >>>>>>>>>>>>>>>>>>>>> specification by avoiding view replication, and >>>>>>>>>>>>>>>>>>>>> potentially introducing >>>>>>>>>>>>>>>>>>>>> versioning support later. UDTFs, being a superset of >>>>>>>>>>>>>>>>>>>>> views in some ways, >>>>>>>>>>>>>>>>>>>>> may not require versioning initially. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> b) *VarArgs Support*: While some query engines may >>>>>>>>>>>>>>>>>>>>> not support vararg syntax in CREATE FUNCTION, Iceberg >>>>>>>>>>>>>>>>>>>>> UDFs could represent such arguments as lists when >>>>>>>>>>>>>>>>>>>>> supported by the engine. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> c) *Generics in UDFs*: Since Iceberg currently >>>>>>>>>>>>>>>>>>>>> doesn’t support generic types (e.g., object), we can >>>>>>>>>>>>>>>>>>>>> only map engine-specific types to Iceberg types. As a >>>>>>>>>>>>>>>>>>>>> result, generic data >>>>>>>>>>>>>>>>>>>>> types will not be supported in the initial version. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> d) *Python Support*: Incorporating Python as a >>>>>>>>>>>>>>>>>>>>> language for SQL UDFs seems promising, especially given >>>>>>>>>>>>>>>>>>>>> its potential to >>>>>>>>>>>>>>>>>>>>> resolve interoperability challenges. Some engines, >>>>>>>>>>>>>>>>>>>>> however, require >>>>>>>>>>>>>>>>>>>>> platform version and package dependency details to >>>>>>>>>>>>>>>>>>>>> execute Python code—this >>>>>>>>>>>>>>>>>>>>> should be captured in the specification. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> *Next Steps* >>>>>>>>>>>>>>>>>>>>> I will update the proposal document with two primary >>>>>>>>>>>>>>>>>>>>> UDF use cases: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Policy exchange between engines >>>>>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> UDTF as a superset of view functionality >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The update will include corresponding syntax examples >>>>>>>>>>>>>>>>>>>>> in both SQL and Python, and detail how each use case is >>>>>>>>>>>>>>>>>>>>> represented in >>>>>>>>>>>>>>>>>>>>> Iceberg metadata. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> We also plan to set up regular syncs (open to more >>>>>>>>>>>>>>>>>>>>> interested participants) to continue refining and >>>>>>>>>>>>>>>>>>>>> finalizing the UDF >>>>>>>>>>>>>>>>>>>>> specification. >>>>>>>>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Mar 12, 2025 at 9:16 PM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I've updated the design document[1] based on the >>>>>>>>>>>>>>>>>>>>>> previous comments. Additionally, I've included the SQL >>>>>>>>>>>>>>>>>>>>>> UDF syntax supported >>>>>>>>>>>>>>>>>>>>>> by various vendors, including Dremio, Snowflake, >>>>>>>>>>>>>>>>>>>>>> Databricks, and Trino. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I'm happy to schedule a separate sync if a deeper >>>>>>>>>>>>>>>>>>>>>> discussion is needed. Let's keep moving forward, >>>>>>>>>>>>>>>>>>>>>> especially with the >>>>>>>>>>>>>>>>>>>>>> renewed interest from the community. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 13, 2025 at 11:17 PM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hey everyone, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> During the last catalog community sync, there was >>>>>>>>>>>>>>>>>>>>>>> significant interest in storing UDFs in Iceberg and >>>>>>>>>>>>>>>>>>>>>>> adding endpoints for >>>>>>>>>>>>>>>>>>>>>>> UDF handling in the REST catalog spec. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I recently discussed this with Yufei to better >>>>>>>>>>>>>>>>>>>>>>> understand the new requirement of using UDFs for >>>>>>>>>>>>>>>>>>>>>>> fine-grained access >>>>>>>>>>>>>>>>>>>>>>> control policies. This expands the use cases beyond >>>>>>>>>>>>>>>>>>>>>>> just versioned and >>>>>>>>>>>>>>>>>>>>>>> interoperable UDFs. Additionally, I learnt that many >>>>>>>>>>>>>>>>>>>>>>> vendors are interested >>>>>>>>>>>>>>>>>>>>>>> in this feature. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Given the strong community interest and support, I’d >>>>>>>>>>>>>>>>>>>>>>> like to take ownership of this effort and revive the >>>>>>>>>>>>>>>>>>>>>>> work. I'll be >>>>>>>>>>>>>>>>>>>>>>> revisiting the document I proposed long back and will >>>>>>>>>>>>>>>>>>>>>>> share an updated >>>>>>>>>>>>>>>>>>>>>>> proposal by next week. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Looking forward to storing UDFs in Iceberg! >>>>>>>>>>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 2:55 PM Dmitri Bourlatchkov >>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The UDF spec does not require representations to be >>>>>>>>>>>>>>>>>>>>>>>> SQL. It merely does not specify (in this revision) how >>>>>>>>>>>>>>>>>>>>>>>> other >>>>>>>>>>>>>>>>>>>>>>>> representations are to be written. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> This seems like an easy extension (adding a new >>>>>>>>>>>>>>>>>>>>>>>> type in the "Representations" section). >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>>>> Dmitri. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 3:47 PM Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Right now, SQL is an explicit requirement of the >>>>>>>>>>>>>>>>>>>>>>>>> spec. It leaves a way for future versions to add >>>>>>>>>>>>>>>>>>>>>>>>> different representations >>>>>>>>>>>>>>>>>>>>>>>>> later, but only SQL is supported. That was also the >>>>>>>>>>>>>>>>>>>>>>>>> feedback to my initial >>>>>>>>>>>>>>>>>>>>>>>>> skepticism about how it would work to add functions. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 12:44 PM Dmitri >>>>>>>>>>>>>>>>>>>>>>>>> Bourlatchkov <[email protected]> >>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I do not think the spec is meant to allow only >>>>>>>>>>>>>>>>>>>>>>>>>> SQL representations, although it is certainly >>>>>>>>>>>>>>>>>>>>>>>>>> faviouring SQL in examples... >>>>>>>>>>>>>>>>>>>>>>>>>> It would be nice to add a non-SQL example, indeed. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>>>>>> Dmitri. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong < >>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Coming from PyIceberg, I have concerns as this >>>>>>>>>>>>>>>>>>>>>>>>>>> proposal focuses on SQL-based engines, while >>>>>>>>>>>>>>>>>>>>>>>>>>> Python-based systems often >>>>>>>>>>>>>>>>>>>>>>>>>>> work with data frames. Adding imperative languages >>>>>>>>>>>>>>>>>>>>>>>>>>> like Python would make >>>>>>>>>>>>>>>>>>>>>>>>>>> this proposal more inclusive. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Op do 8 aug 2024 om 10:27 schreef Piotr >>>>>>>>>>>>>>>>>>>>>>>>>>> Findeisen <[email protected]>: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Walaa, thanks for asking! >>>>>>>>>>>>>>>>>>>>>>>>>>>> In the design doc linked before in this thread >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1] i read >>>>>>>>>>>>>>>>>>>>>>>>>>>> "Without a common standard, the UDFs are hard >>>>>>>>>>>>>>>>>>>>>>>>>>>> to share among different engines." >>>>>>>>>>>>>>>>>>>>>>>>>>>> ("Background and Motivation" section). >>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with this statement. I don't fully >>>>>>>>>>>>>>>>>>>>>>>>>>>> understand yet how the proposed design addresses >>>>>>>>>>>>>>>>>>>>>>>>>>>> shareability between the >>>>>>>>>>>>>>>>>>>>>>>>>>>> engines though. >>>>>>>>>>>>>>>>>>>>>>>>>>>> I would use some help to understand this better. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best >>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotr >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1] SQL User-Defined Function Spec >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin >>>>>>>>>>>>>>>>>>>>>>>>>>>> Moustafa <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotr, what do you mean by making user-created >>>>>>>>>>>>>>>>>>>>>>>>>>>>> functions shareable >>>>>>>>>>>>>>>>>>>>>>>>>>>>> between engines? Do you mean UDFs written in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> imperative code? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Hi, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Thank you Ajantha for creating this thread. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The Iceberg UDFs are an interesting idea! >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Is there a plan to make the user-created >>>>>>>>>>>>>>>>>>>>>>>>>>>>> functions sharable between the engines? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > If so, how would a CREATE FUNCTION statement >>>>>>>>>>>>>>>>>>>>>>>>>>>>> look like in e..g Spark or Trino? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Meanwhile, added a few comments in the doc. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Best >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Piotr >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> I just looked through the proposal and >>>>>>>>>>>>>>>>>>>>>>>>>>>>> added comments. I think it would be helpful to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> also have a design doc that >>>>>>>>>>>>>>>>>>>>>>>>>>>>> covers the choices from the draft spec. For >>>>>>>>>>>>>>>>>>>>>>>>>>>>> instance, the choice to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> enumerate all possible function input struts >>>>>>>>>>>>>>>>>>>>>>>>>>>>> rather than allowing generics >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and varargs. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> Here’s a quick summary of my feedback: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> I think that the choice to enumerate >>>>>>>>>>>>>>>>>>>>>>>>>>>>> function signatures is limiting. It would be nice >>>>>>>>>>>>>>>>>>>>>>>>>>>>> to see a discussion of >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the trade-offs and a rationale for the choice. I >>>>>>>>>>>>>>>>>>>>>>>>>>>>> think it would also be >>>>>>>>>>>>>>>>>>>>>>>>>>>>> very helpful to have a few representative use >>>>>>>>>>>>>>>>>>>>>>>>>>>>> cases for this included in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the doc. That way the proposal can demonstrate >>>>>>>>>>>>>>>>>>>>>>>>>>>>> that it solves those use >>>>>>>>>>>>>>>>>>>>>>>>>>>>> cases with reasonable trade-offs. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> There are a few instances where this is >>>>>>>>>>>>>>>>>>>>>>>>>>>>> inconsistent with conventions in other specs. For >>>>>>>>>>>>>>>>>>>>>>>>>>>>> example, using string IDs >>>>>>>>>>>>>>>>>>>>>>>>>>>>> rather than an integer. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> This uses a very different model for spec >>>>>>>>>>>>>>>>>>>>>>>>>>>>> versioning than the Iceberg view and table specs. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> It requires readers to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fail if there are any unknown fields, which >>>>>>>>>>>>>>>>>>>>>>>>>>>>> prevents the spec from adding >>>>>>>>>>>>>>>>>>>>>>>>>>>>> things that are fully backward-compatible. Other >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg specs only require >>>>>>>>>>>>>>>>>>>>>>>>>>>>> a version change to introduce >>>>>>>>>>>>>>>>>>>>>>>>>>>>> forward-incompatible changes and I think that >>>>>>>>>>>>>>>>>>>>>>>>>>>>> this should do the same to avoid confusion. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> It looks like the intent is to allow >>>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple function signatures per verison, but it >>>>>>>>>>>>>>>>>>>>>>>>>>>>> is unclear how to encode >>>>>>>>>>>>>>>>>>>>>>>>>>>>> them because a version is associated with a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> single function signature. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> There is no review of SQL syntax for >>>>>>>>>>>>>>>>>>>>>>>>>>>>> creating functions across engines, so this >>>>>>>>>>>>>>>>>>>>>>>>>>>>> doesn’t show that the metadata >>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed is sufficient for cross-engine use cases. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> The example for a table-valued function >>>>>>>>>>>>>>>>>>>>>>>>>>>>> shows a SELECT statement and it isn’t clear how >>>>>>>>>>>>>>>>>>>>>>>>>>>>> this is distinct from a view >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> Thanks Walaa and Robert for the review on >>>>>>>>>>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> We didn't find any blocker for the spec. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> I will wait for a week and If no more >>>>>>>>>>>>>>>>>>>>>>>>>>>>> review comments, I will raise a PR for spec >>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition next week. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> If anyone else is interested, please have >>>>>>>>>>>>>>>>>>>>>>>>>>>>> a look at the proposal >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eldin Moustafa <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> Hi Ajantha, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> I have left some comments. It is an >>>>>>>>>>>>>>>>>>>>>>>>>>>>> interesting direction, but there might be some >>>>>>>>>>>>>>>>>>>>>>>>>>>>> details that need to be fine >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tuned. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> The doc is here [1] for others who might >>>>>>>>>>>>>>>>>>>>>>>>>>>>> be interested. Resharing since I do not think it >>>>>>>>>>>>>>>>>>>>>>>>>>>>> was directly linked in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> thread. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> [1] >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> Walaa. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bhat <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> Hi, just another reminder since we >>>>>>>>>>>>>>>>>>>>>>>>>>>>> didn't get any review on the proposal. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> Initially proposed on June 4. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bhat <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> Hi everyone, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> We've only received one review so far >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (from Benny). >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> We would appreciate more eyes on this. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bhat <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Hi All, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Please find the proposal link >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/10432 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Google doc link is attached in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> And Thanks Stephen Lin for working on >>>>>>>>>>>>>>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Hope it gives more clarity to take the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> decisions and how we want to implement it. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eldin Moustafa <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Thanks Jack. I actually meant >>>>>>>>>>>>>>>>>>>>>>>>>>>>> scalar/aggregate/table user defined functions. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here are some examples of >>>>>>>>>>>>>>>>>>>>>>>>>>>>> what I meant in (2): >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Hive GenericUDF: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Trino user defined functions: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://trino.io/docs/current/develop/functions.html >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Flink user defined functions: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Probably what you referred to is a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> variation of (1) where the API is data flow/data >>>>>>>>>>>>>>>>>>>>>>>>>>>>> pipeline API instead of >>>>>>>>>>>>>>>>>>>>>>>>>>>>> SQL (e.g., Spark Scala). Yes, that is also >>>>>>>>>>>>>>>>>>>>>>>>>>>>> possible in the very long run :) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ye <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> > (2) Custom code written in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> imperative function according to a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Java/Scala/Python API, etc. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> I think we could still explore some >>>>>>>>>>>>>>>>>>>>>>>>>>>>> long term opportunities in this case. Consider >>>>>>>>>>>>>>>>>>>>>>>>>>>>> you register a Spark temp >>>>>>>>>>>>>>>>>>>>>>>>>>>>> view as some sort of data frame read, then it >>>>>>>>>>>>>>>>>>>>>>>>>>>>> could still be resolved to a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Spark plan that is representable by an >>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate representation. But I >>>>>>>>>>>>>>>>>>>>>>>>>>>>> agree this gets very complicated very soon, and >>>>>>>>>>>>>>>>>>>>>>>>>>>>> just having the case (1) >>>>>>>>>>>>>>>>>>>>>>>>>>>>> covered would already be a huge step forward. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> -Jack >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Benny Chow <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> It's interesting to note that a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tabular SQL UDF can be used to build a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> parameterized view. So, there's >>>>>>>>>>>>>>>>>>>>>>>>>>>>> definitely a lot in common between UDFs and views. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Walaa Eldin Moustafa <[email protected]> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> I think there is a disconnect >>>>>>>>>>>>>>>>>>>>>>>>>>>>> about what is perceived as a "UDF". There are 2 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> flavors: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> (1) Functions that are defined by >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user whose definition is a composition of >>>>>>>>>>>>>>>>>>>>>>>>>>>>> other built-in functions/SQL >>>>>>>>>>>>>>>>>>>>>>>>>>>>> expressions. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> (2) Custom code written in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> imperative function according to a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Java/Scala/Python API, etc. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> All the examples in Ajantha's >>>>>>>>>>>>>>>>>>>>>>>>>>>>> references are pretty much from (1) and I think >>>>>>>>>>>>>>>>>>>>>>>>>>>>> those have more analogy to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> views due to their SQL nature. Agree (2) is not >>>>>>>>>>>>>>>>>>>>>>>>>>>>> practical to maintain by >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg, but I think Ajantha's use cases are >>>>>>>>>>>>>>>>>>>>>>>>>>>>> around (1), and may be worth >>>>>>>>>>>>>>>>>>>>>>>>>>>>> evaluating. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ajantha Bhat <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you >>>>>>>>>>>>>>>>>>>>>>>>>>>>> post the proposal, but I think this would be a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> very difficult area to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tackle across engines, languages, and memory >>>>>>>>>>>>>>>>>>>>>>>>>>>>> models without having a huge >>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance penalty. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Assuming Iceberg initially >>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports SQL representations of UDFs (similar to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> views as shared by the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> reference links above), the complexity involved >>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be similar to managing >>>>>>>>>>>>>>>>>>>>>>>>>>>>> views. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for your input. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> We will work on publishing the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> draft spec (inspired by the view spec) this week >>>>>>>>>>>>>>>>>>>>>>>>>>>>> to facilitate further >>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussions. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jack Ye <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> > While it would be great to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> have a common set of functions across engines, I >>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't see how that is >>>>>>>>>>>>>>>>>>>>>>>>>>>>> practical when those engines are implemented so >>>>>>>>>>>>>>>>>>>>>>>>>>>>> differently. Plugging in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> code -- and especially custom user-supplied code >>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- seems inherently >>>>>>>>>>>>>>>>>>>>>>>>>>>>> specialized to me and should be part of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> engines' design. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> How is this different from the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> views? I feel we can say exactly the same thing >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for Iceberg views, but yet >>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have Iceberg multi-dialect views implemented. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Maybe it sounds like we >>>>>>>>>>>>>>>>>>>>>>>>>>>>> are trying to draw a line between SQL vs other >>>>>>>>>>>>>>>>>>>>>>>>>>>>> programming language as >>>>>>>>>>>>>>>>>>>>>>>>>>>>> "code"? but I think SQL is just another type of >>>>>>>>>>>>>>>>>>>>>>>>>>>>> code, and we are already >>>>>>>>>>>>>>>>>>>>>>>>>>>>> talking about compiling all these different code >>>>>>>>>>>>>>>>>>>>>>>>>>>>> dialects to an >>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate representation (using projects like >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Coral, Substrait), which >>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be stored as another type of representation >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of Iceberg view. I think >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the same functionality can be used for UDFs if >>>>>>>>>>>>>>>>>>>>>>>>>>>>> developed. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I actually hink adding UDF >>>>>>>>>>>>>>>>>>>>>>>>>>>>> support is a good idea, even just a multi-dialect >>>>>>>>>>>>>>>>>>>>>>>>>>>>> one like view, and that >>>>>>>>>>>>>>>>>>>>>>>>>>>>> can allow engines to for example parse a view >>>>>>>>>>>>>>>>>>>>>>>>>>>>> SQL, and when a function >>>>>>>>>>>>>>>>>>>>>>>>>>>>> referenced cannot be resolved, try to seek for a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> multi-dialect UDF >>>>>>>>>>>>>>>>>>>>>>>>>>>>> definition. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we can discuss more when >>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have the actual proposal published. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Robert Stupp <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and >>>>>>>>>>>>>>>>>>>>>>>>>>>>> portable and "non-centralized" as views are. The >>>>>>>>>>>>>>>>>>>>>>>>>>>>> same performance concerns >>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply to views as well. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Iceberg should define a common >>>>>>>>>>>>>>>>>>>>>>>>>>>>> base upon which engines can build, so the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument that UDFs aren't >>>>>>>>>>>>>>>>>>>>>>>>>>>>> practical, because engines are different, is >>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably only a temporary >>>>>>>>>>>>>>>>>>>>>>>>>>>>> concern. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg >>>>>>>>>>>>>>>>>>>>>>>>>>>>> should also try to tackle the idea to make views >>>>>>>>>>>>>>>>>>>>>>>>>>>>> portable, which is >>>>>>>>>>>>>>>>>>>>>>>>>>>>> conceptually not that much different from >>>>>>>>>>>>>>>>>>>>>>>>>>>>> portable UDFs. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> negative touch to the idea of having UDFs in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg, especially not in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> this early stage. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether >>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's a good idea to add UDFs tracked by Iceberg >>>>>>>>>>>>>>>>>>>>>>>>>>>>> catalogs. I think that >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg primarily deals with things that are >>>>>>>>>>>>>>>>>>>>>>>>>>>>> centralized, like tables of >>>>>>>>>>>>>>>>>>>>>>>>>>>>> data. While it would be great to have a common >>>>>>>>>>>>>>>>>>>>>>>>>>>>> set of functions across >>>>>>>>>>>>>>>>>>>>>>>>>>>>> engines, I don't see how that is practical when >>>>>>>>>>>>>>>>>>>>>>>>>>>>> those engines are >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implemented so differently. Plugging in code -- >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and especially custom >>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-supplied code -- seems inherently >>>>>>>>>>>>>>>>>>>>>>>>>>>>> specialized to me and should be part >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the engines' design. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I guess we'll know more when >>>>>>>>>>>>>>>>>>>>>>>>>>>>> you post the proposal, but I think this would be >>>>>>>>>>>>>>>>>>>>>>>>>>>>> a very difficult area to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tackle across engines, languages, and memory >>>>>>>>>>>>>>>>>>>>>>>>>>>>> models without having a huge >>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance penalty. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ajantha Bhat <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the community interest in storing the Versioned >>>>>>>>>>>>>>>>>>>>>>>>>>>>> SQL UDFs in Iceberg. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We want to propose the spec >>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition for storing the versioned UDFs in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg (inspired by view spec). >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> These UDFs can operate >>>>>>>>>>>>>>>>>>>>>>>>>>>>> similarly to views in that they are associated >>>>>>>>>>>>>>>>>>>>>>>>>>>>> with tables, but they can >>>>>>>>>>>>>>>>>>>>>>>>>>>>> accept arguments and produce return values, or >>>>>>>>>>>>>>>>>>>>>>>>>>>>> even function as inline >>>>>>>>>>>>>>>>>>>>>>>>>>>>> expressions. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Many Query engines like >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Dremio, Trino, Snowflake, Databricks Spark >>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports SQL UDFs at catalog >>>>>>>>>>>>>>>>>>>>>>>>>>>>> level [1]. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But storing them in Iceberg >>>>>>>>>>>>>>>>>>>>>>>>>>>>> can enable >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Versioning of these UDFs. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Interoperability between the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> engines. Potentially engines can understand the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> UDFs written by other >>>>>>>>>>>>>>>>>>>>>>>>>>>>> engines (with the translate layer). >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that integrating >>>>>>>>>>>>>>>>>>>>>>>>>>>>> this feature into Iceberg would be a valuable >>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition, and we're eager to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> collaborate with the community to develop a UDF >>>>>>>>>>>>>>>>>>>>>>>>>>>>> specification. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stephen has already begun >>>>>>>>>>>>>>>>>>>>>>>>>>>>> drafting a specification to propose to the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> community. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on >>>>>>>>>>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dremio - >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Trino - >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Snowflake - >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Databricks - >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Tabular >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Robert Stupp >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> @snazy >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> Databricks >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>>> Databricks >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>
