Hi folks, Really appreciated feedback from you all over the past few months. I've filed the initial PR for the UDF spec: https://github.com/apache/iceberg/pull/14117. It captures the consensus we've built and addresses the write amplification concern raised in our last discussion.
Please take a look and share your thoughts. Happy to discuss it further during Monday's meeting as well. Yufei On Mon, Sep 8, 2025 at 6:33 PM Yufei Gu <[email protected]> wrote: > Hi folks, thanks for joining today’s UDF sync. > > We covered the UDF metadata structure, captured in this doc: > https://docs.google.com/document/d/1khPKL6zvWjYc5Is8HeVau6sff8FD-jNc2eLKXgit3X8/edit?usp=sharing > . > > We also discussed a way to avoid copying every overload into the new > metadata JSON when creating a new version. One of ideas is to introduce a > global version array, this is not yet reflected in the doc, but I’ll update > it shortly. Other key points: > > - The latest UDF version will typically be used in most scenarios, but > engines retain the flexibility to choose which version to execute. > - Keeping the version while referring to an UDF probably isn't a good > idea. Users are responsible for updating downstream views if they reference > older UDF versions. > > You can watch the recording here: > https://www.youtube.com/watch?v=6ResT-ODelI&ab_channel=ApacheIceberg > > Yufei > > > On Mon, Aug 25, 2025 at 6:36 PM Yufei Gu <[email protected]> wrote: > >> Hi folks, thanks for attending today’s UDF sync. In general, we discussed >> the UDF metadata structure, captured at this doc( >> https://docs.google.com/document/d/1khPKL6zvWjYc5Is8HeVau6sff8FD-jNc2eLKXgit3X8/edit?usp=sharing >> ). Here is the detailed summary: >> >> 1. Each UDF overload has its own return type. e.g., `add(int, int)` >> returns `int`, while `add(long, long)` returns `long` >> 2. Return type should be explicitly specified, no implicit or >> statement-based return type inference should be allowed. >> 3. Adding explicit properties like deterministic, doc properties at >> the overload level. >> 4. Adding property “secure” at the top level. >> 5. Introducing a dedicated signature definitions section to >> centralize metadata (Function parameters, Return type, Parameter >> descriptions). Each overload would reference a signature definition by ID. >> This decoupling allows signature-related updates (like modifying parameter >> descriptions) without requiring a new UDF version, similar to how updating >> a table schema doesn’t create a new snapshot. >> 6. Whether to have versioned open properties or not. Versioned >> properties can lead to unnecessary copying of a bag of properties into >> each >> version, while it provides a clear history of properties for any future >> debugging and understanding of the UDF behavior at a specific point in >> time. >> >> Watch the recording here, >> https://www.youtube.com/watch?v=p7CvuGZKLSo&list=PLkifVhhWtccwzc3oRWjy5XiYJl0R6kdQL >> >> Yufei >> >> >> On Thu, Aug 21, 2025 at 4:18 PM Yufei Gu <[email protected]> wrote: >> >>> Hi everyone, here’s the summary from our last sync on 8/11. Apologies >>> for the delay! >>> >>> - One UDF entity for all overloads >>> - We agreed to combine overloads with the same name into a single >>> UDF entity, which shares a common metadata.json file. >>> - Listing UDFs will return a list of UDF names, not a list of >>> individual signatures. >>> - Loading a UDF by name will return all of its overloads. >>> - Versioning Strategy >>> - A global version number will track changes across the entire >>> UDF entity, it increments monolithically. >>> - Each overload will also maintain its own version (e.g., >>> updated_at_version) to trace changes specific to that overload. >>> - For simplicity, the load API will not support argument-based >>> filtering in the initial release. It will always return all overloads >>> for a >>> given UDF name, overload-level loading is not supported at this stage. >>> >>> Watch the recording here, >>> https://drive.google.com/file/d/10G2HjUH2DaKSjGufEOjMu0bBuNd7sCzO/view >>> >>> Yufei >>> >>> >>> On Fri, Aug 8, 2025 at 3:11 PM Yufei Gu <[email protected]> wrote: >>> >>>> To recap and add my thoughts, we want to support UDFs with multiple >>>> signatures under the same name, which can serve both overload-aware and >>>> overload-naive engines. >>>> >>>> Per my investigation[1], most engines support overloading by arguments >>>> and allow implicit conversions like numeric widening (e.g., INT → >>>> BIGINT/FLOAT). The resolution approach causes issues like silent behavior >>>> change. Here is an example: >>>> >>>> - Initially, only foo(DOUBLE) exists. >>>> - foo(42::INT) widens INT → DOUBLE and runs expected code. >>>> - Later: malicious user creates foo(BIGINT). >>>> - Engine’s best-match resolution now binds the same call to the new >>>> overload, changing behavior without modifying the query. >>>> >>>> To mitigate this issue, we have to choose between these two access >>>> control models: >>>> >>>> 1. Model A – Name-Level ACL: Grants apply to all overloads of a >>>> function name. >>>> 2. Model B – Signature-Level ACL: Grants tied to specific >>>> signatures. >>>> >>>> The general recommendation is to adopt *Model A.* It trades some >>>> precision for safety and simplicity, while eliminating the silent behavior >>>> change problem. More details are in this doc[1]. >>>> >>>> 1. >>>> https://docs.google.com/document/d/1E8mR-vInbQ8LDa5Lv3f22i6f8sceHojnEzxEJ6s6cvc/edit?tab=t.0 >>>> >>>> Yufei >>>> >>>> >>>> On Tue, Jul 29, 2025 at 1:07 AM Ajantha Bhat <[email protected]> >>>> wrote: >>>> >>>>> Thanks to everyone who joined the sync. >>>>> Here is the meeting recording: >>>>> https://drive.google.com/file/d/1L5S6nb-C_pzBwFlClwO_sG1AVBA_ROKo/view >>>>> >>>>> Summary: >>>>> We have discussed how to define function identifiers (should also >>>>> handle function overloading). Ryan suggested that we should check how >>>>> Spark >>>>> does it. We can refer to functions using an identifier and then bind the >>>>> different signatures to it. So that access policies can be applied per >>>>> identifier. This is also linked to how we want to version the functions >>>>> when overloading is supported. >>>>> >>>>> I will check more about this and update the proposal doc. >>>>> >>>>> Please check/subscribe to the dev events calendar for the next >>>>> meeting link (Aug 11). >>>>> >>>>> - Ajantha >>>>> >>>>> On Sun, Jul 27, 2025 at 10:46 PM Kevin Liu <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Ajantha, >>>>>> >>>>>> I see that the UDF Sync is scheduled in the "Iceberg Dev Events" >>>>>> calendar for tomorrow 7/28 at 9AM PT. I missed the last one, but >>>>>> i'll be at this one. >>>>>> >>>>>> Best, >>>>>> Kevin Liu >>>>>> >>>>>> On Mon, Jul 14, 2025 at 9:22 AM Ajantha Bhat <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hey everyone, >>>>>>> >>>>>>> No one joined the sync today. I came to know that Yufei is on >>>>>>> holiday, and Ryan and others couldn't make it, similar to the last >>>>>>> sync. It >>>>>>> seems Yufei might have forgotten to transfer meeting ownership as well, >>>>>>> as >>>>>>> new members needed admin approval and couldn't join automatically this >>>>>>> week. Also, I can understand it is summer holiday season for many. >>>>>>> >>>>>>> I've updated the function signature schema and other open points. I >>>>>>> believe we're very close to the final version of the spec. A meeting is >>>>>>> indeed necessary to finalize this, but we don't have to wait for it to >>>>>>> finish the review process. We had many meetings on this in the past >>>>>>> already. So, please review the document at your earliest convenience. >>>>>>> If we >>>>>>> agree on the spec by next week, I can raise a PR. >>>>>>> >>>>>>> - Ajantha >>>>>>> >>>>>>> On Thu, Jul 3, 2025 at 4:03 AM Yufei Gu <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I’d propose to move the field `properties` from a top level field >>>>>>>> to a field inside “version” along with a representation, so that >>>>>>>> properties >>>>>>>> are versioned. A property like “deterministic” could change along with >>>>>>>> representation over time. For example, we need to change >>>>>>>> “deterministic” >>>>>>>> from true to false in case of adding a non-deterministic SQL >>>>>>>> expression/function(e.g., now()) inside an UDF. Otherwise, rollback >>>>>>>> won't >>>>>>>> be safe. >>>>>>>> >>>>>>>> That said, it's still an open question whether we need any >>>>>>>> non-versioned properties. We can introduce them later if a use case >>>>>>>> arises. >>>>>>>> >>>>>>>> Yufei >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 2, 2025 at 3:06 PM Yufei Gu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for the summary, Ajantha! >>>>>>>>> >>>>>>>>> I’d prefer to keep the signature list separate from the >>>>>>>>> representation history. Here are reasons: >>>>>>>>> >>>>>>>>> 1. Each version still enforces a single signature. Although >>>>>>>>> the signatures array is global to the UDF, each version references >>>>>>>>> just one >>>>>>>>> signature ID. Rollbacks to historical versions remain safe. >>>>>>>>> 2. We’ve separated the less frequently changing component >>>>>>>>> (signatures) from the more dynamic one (representations) to reduce >>>>>>>>> metadata >>>>>>>>> file size. >>>>>>>>> 3. Since signatures use Iceberg data types, they should remain >>>>>>>>> unaffected by multi-dialect representation differences. >>>>>>>>> >>>>>>>>> Yufei >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 30, 2025 at 11:28 AM Ajantha Bhat < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Thanks to everyone who joined the sync. >>>>>>>>>> Here is the meeting recording: >>>>>>>>>> https://drive.google.com/file/d/1FcOSbHo9ZIVeZXdUlmoG42o-chB7Q15P/view?usp=sharing >>>>>>>>>> >>>>>>>>>> Summary: >>>>>>>>>> We have discussed the action items from the last sync (*see >>>>>>>>>> Appendix C* in the proposal doc) >>>>>>>>>> >>>>>>>>>> - Function overloading: Supported by few of the engines and >>>>>>>>>> in the roadmaps of many engines. Iceberg will support it. We will >>>>>>>>>> maintain >>>>>>>>>> the `FunctionIdentifier` (extends `TableIdentifer` but also have >>>>>>>>>> a member >>>>>>>>>> containing the function argument's type list). And all operations >>>>>>>>>> like >>>>>>>>>> load, rename, list, create and drop are based on >>>>>>>>>> `FunctionIdentifier`. >>>>>>>>>> - Secure UDF: If we store it as a property in a bag, we need >>>>>>>>>> to standardize the property name. Iceberg encryption may be >>>>>>>>>> orthogonal to >>>>>>>>>> this discussion. >>>>>>>>>> - UDF with multi statement and procedural bodies are >>>>>>>>>> supported by some engines. Iceberg will support it. Store the >>>>>>>>>> body as it is >>>>>>>>>> while creating function by the engine. >>>>>>>>>> >>>>>>>>>> new discussions around >>>>>>>>>> >>>>>>>>>> - Standardizing the property names (deterministic, secure). >>>>>>>>>> - About the rename function. >>>>>>>>>> - Replace function. To check upto what level replace is >>>>>>>>>> supported (considering function overloading) . >>>>>>>>>> - Signature should be associated with representation? >>>>>>>>>> >>>>>>>>>> I think we are close on the spec. Please review the proposal >>>>>>>>>> >>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> Details for next Iceberg UDF sync: >>>>>>>>>> >>>>>>>>>> *Monday, July 14 · 9:00 – 10:00am*Time zone: America/Los_Angeles >>>>>>>>>> Google Meet joining info >>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>> >>>>>>>>>> - Ajantha >>>>>>>>>> >>>>>>>>>> On Mon, Jun 30, 2025 at 9:27 PM Ajantha Bhat < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Can it be handled by Iceberg encryption? If the whole metadata >>>>>>>>>>> is encrypted, we don't have to worry about just hiding the UDF >>>>>>>>>>> body? Let us >>>>>>>>>>> discuss more on the sync today. >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 30, 2025 at 9:22 PM Yufei Gu <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes, hiding the definition and disabling pushdown are >>>>>>>>>>>> required.We will need a named key(e.g., secure) somewhere, no >>>>>>>>>>>> matter if it >>>>>>>>>>>> is a top level property or a key as a part of the UDF properties. >>>>>>>>>>>> So that >>>>>>>>>>>> both UDF creator and consumer can recognize it. >>>>>>>>>>>> >>>>>>>>>>>> Yufei >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jun 26, 2025 at 4:27 PM Ryan Blue <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks for the extra detail. What do you think the spec would >>>>>>>>>>>>> require? Would it require hiding the UDF definition from users >>>>>>>>>>>>> and require >>>>>>>>>>>>> specific pushdown cases be disabled? The use cases seem valid, >>>>>>>>>>>>> but I'm >>>>>>>>>>>>> trying to understand the requirements this places on engines and >>>>>>>>>>>>> why it >>>>>>>>>>>>> needs to be part of the spec, rather than part of the properties >>>>>>>>>>>>> of the UDF. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Jun 20, 2025 at 3:56 PM Yufei Gu <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here are the main use cases for secure UDFs: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hiding UDF Definitions: This includes concealing the UDF >>>>>>>>>>>>>> body and details like the list of imports, some of them >>>>>>>>>>>>>> aren’t applicable >>>>>>>>>>>>>> to SQL UDFs. >>>>>>>>>>>>>> 2. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sandboxed Execution: Ensuring the UDF runs in an isolated >>>>>>>>>>>>>> environment. Again, this typically doesn’t apply to SQL UDFs. >>>>>>>>>>>>>> 3. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Preventing Data Leakage at Execution Time: For example, >>>>>>>>>>>>>> secure UDFs may disable certain optimizations—such as >>>>>>>>>>>>>> predicate pushdown—to >>>>>>>>>>>>>> avoid exposing sensitive data indirectly. [1] >>>>>>>>>>>>>> >>>>>>>>>>>>>> Given these scenarios, I agree with your point that the >>>>>>>>>>>>>> secure flag is primarily an instruction to the engine to >>>>>>>>>>>>>> behave differently. While it's largely an engine-side behavior, >>>>>>>>>>>>>> we still >>>>>>>>>>>>>> need to include this flag in the UDF definition to indicate >>>>>>>>>>>>>> whether a UDF >>>>>>>>>>>>>> is secure, especially considering the perf penalty introduced by >>>>>>>>>>>>>> scenario >>>>>>>>>>>>>> #3. We should clearly recommend that users avoid marking UDFs as >>>>>>>>>>>>>> secure >>>>>>>>>>>>>> unless it's truly necessary. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/pushdown-optimization#example-of-indirect-data-exposure-through-pushdown >>>>>>>>>>>>>> Yufei >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jun 18, 2025 at 12:32 PM Ryan Blue <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yufei, could you make the argument for supporting a "secure" >>>>>>>>>>>>>>> UDF? What use case are you addressing and what specifically >>>>>>>>>>>>>>> changes about >>>>>>>>>>>>>>> how the UDF is handled? If the idea is to hide the UDF >>>>>>>>>>>>>>> definition, do we >>>>>>>>>>>>>>> need to include it? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think this would be a signal to a "trusted engine". When >>>>>>>>>>>>>>> the engine interacts with the catalog it sends authorization >>>>>>>>>>>>>>> information >>>>>>>>>>>>>>> about itself in addition to the user that it is acting on >>>>>>>>>>>>>>> behalf of. That >>>>>>>>>>>>>>> way the catalog knows that the secure UDF can be sent to the >>>>>>>>>>>>>>> engine and >>>>>>>>>>>>>>> won't be shown to the user. The majority of this logic is on >>>>>>>>>>>>>>> the REST >>>>>>>>>>>>>>> server side, and the only part that is communicated to the >>>>>>>>>>>>>>> client is the >>>>>>>>>>>>>>> request not to show the UDF to the user, right? In that case >>>>>>>>>>>>>>> should this be >>>>>>>>>>>>>>> a property rather than part of the definition? Even if we state >>>>>>>>>>>>>>> that the >>>>>>>>>>>>>>> client "must" suppress the UDF definition, it's really just a >>>>>>>>>>>>>>> request. Only >>>>>>>>>>>>>>> trusted engines can be passed the UDF definition, so a spec >>>>>>>>>>>>>>> requirement to >>>>>>>>>>>>>>> suppress the definition isn't very meaningful. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Jun 16, 2025 at 5:42 PM Yufei Gu < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for the summary, Ajantha! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Multi-statement UDFs are definitely useful, but whether >>>>>>>>>>>>>>>> those statements run within a single transaction should be >>>>>>>>>>>>>>>> treated as an >>>>>>>>>>>>>>>> engine-level concern. The Iceberg UDF spec can spell out the >>>>>>>>>>>>>>>> expectation, >>>>>>>>>>>>>>>> yet the actual guarantee still depends on the runtime. Even if >>>>>>>>>>>>>>>> a UDF >>>>>>>>>>>>>>>> declares itself transactional, the engine may or may not >>>>>>>>>>>>>>>> enforce it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> One more thing: should we also introduce a “secure UDF” >>>>>>>>>>>>>>>> option supported by some engines[1], so the body and any >>>>>>>>>>>>>>>> sensitive details >>>>>>>>>>>>>>>> stay hidden from callers? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/secure-udf-procedure >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yufei >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Jun 16, 2025 at 12:02 PM Ajantha Bhat < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks to everyone who joined the sync. >>>>>>>>>>>>>>>>> Here is the meeting recording: >>>>>>>>>>>>>>>>> https://drive.google.com/file/d/10_Getaasv6tDMGzeZQUgcUVwCUAaFxiz/view?usp=sharing >>>>>>>>>>>>>>>>> Summary: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - We have gone through the SQL UDF syntax supported by >>>>>>>>>>>>>>>>> different engines (Snowflake, databricks, Dremio, Trino, >>>>>>>>>>>>>>>>> OSS spark 4.0). >>>>>>>>>>>>>>>>> - Each engine uses its own block separator, like $$ or >>>>>>>>>>>>>>>>> '' or none. Action item was to check whether engines >>>>>>>>>>>>>>>>> support >>>>>>>>>>>>>>>>> multi-statement (transactional) UDF bodies. >>>>>>>>>>>>>>>>> - Discussed about function overloading. Need to check >>>>>>>>>>>>>>>>> whether these engines support function overloading for SQL >>>>>>>>>>>>>>>>> UDFs. Postgres >>>>>>>>>>>>>>>>> supports it! If yes, need to adopt the spec to handle it. >>>>>>>>>>>>>>>>> - Started online spec review and discussed the >>>>>>>>>>>>>>>>> deterministic flag and concluded that we keep the >>>>>>>>>>>>>>>>> independent fields (like >>>>>>>>>>>>>>>>> deterministic) in spec only if the majority of engines >>>>>>>>>>>>>>>>> supports it. Else it >>>>>>>>>>>>>>>>> will be passed in a property bag (engine specific). And it >>>>>>>>>>>>>>>>> is the engine's >>>>>>>>>>>>>>>>> responsibility to honor those optional properties. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Feel free to review the current proposal document here >>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing>. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Final spec will be put to review and vote once it is >>>>>>>>>>>>>>>>> ready. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Details for next Iceberg UDF sync: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *Monday, June 30 · 9:00 – 10:00am*Time zone: >>>>>>>>>>>>>>>>> America/Los_Angeles >>>>>>>>>>>>>>>>> Google Meet joining info >>>>>>>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Jun 4, 2025 at 9:00 PM Ajantha Bhat < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks to everyone who joined the sync. >>>>>>>>>>>>>>>>>> Here is the meeting recording: >>>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1WItItsNs3m3-no7_qWPHftGqVNOdpw5C/view?usp=sharing >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Summary: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We discussed including Python support; the majority >>>>>>>>>>>>>>>>>> agreed *not to* (see recording for details). >>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> No strong opposition to versioning — it will be >>>>>>>>>>>>>>>>>> included to support change tracking and similar use cases. >>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Suggestions were made to document how each catalog >>>>>>>>>>>>>>>>>> resolves UDFs, similar to views and tables. >>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We agreed not to deviate from the existing table/view >>>>>>>>>>>>>>>>>> spec — e.g., location will remain *required* for >>>>>>>>>>>>>>>>>> cross-catalog compatibility. >>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We also discussed a bit about view interoperability >>>>>>>>>>>>>>>>>> as the same things are applicable here. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Feel free to review the proposal document >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?pli=1&tab=t.0> >>>>>>>>>>>>>>>>>> here. >>>>>>>>>>>>>>>>>> With the current scope, it is similar to the view/table >>>>>>>>>>>>>>>>>> spec now. >>>>>>>>>>>>>>>>>> Final spec will be put to review and vote once it is >>>>>>>>>>>>>>>>>> ready. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Details for next Iceberg UDF sync: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *Monday, June 16 · 9:00 – 10:00am*Time zone: >>>>>>>>>>>>>>>>>> America/Los_Angeles >>>>>>>>>>>>>>>>>> Google Meet joining info >>>>>>>>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, May 21, 2025 at 3:33 AM Yufei Gu < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi folks, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> We’ve set up a dedicated bi-weekly community sync for >>>>>>>>>>>>>>>>>>> the UDF project. Everyone’s welcome to drop in and share >>>>>>>>>>>>>>>>>>> ideas! Here is the >>>>>>>>>>>>>>>>>>> meeting link: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Iceberg UDF sync >>>>>>>>>>>>>>>>>>> Monday, June 2 · 9:00 – 10:00am >>>>>>>>>>>>>>>>>>> Time zone: America/Los_Angeles >>>>>>>>>>>>>>>>>>> Google Meet joining info >>>>>>>>>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Yufei >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, May 16, 2025 at 10:45 AM Ajantha Bhat < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Update on the progress. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I had a meeting today with Yufei and Yun.zou to discuss >>>>>>>>>>>>>>>>>>>> the UDF proposal. We covered several key points, though >>>>>>>>>>>>>>>>>>>> some are still open >>>>>>>>>>>>>>>>>>>> for further discussion: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> a) *UDF Versioning*: Do we truly need versioning for >>>>>>>>>>>>>>>>>>>> UDFs at this stage? We explored the possibility of >>>>>>>>>>>>>>>>>>>> simplifying the >>>>>>>>>>>>>>>>>>>> specification by avoiding view replication, and >>>>>>>>>>>>>>>>>>>> potentially introducing >>>>>>>>>>>>>>>>>>>> versioning support later. UDTFs, being a superset of views >>>>>>>>>>>>>>>>>>>> in some ways, >>>>>>>>>>>>>>>>>>>> may not require versioning initially. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> b) *VarArgs Support*: While some query engines may not >>>>>>>>>>>>>>>>>>>> support vararg syntax in CREATE FUNCTION, Iceberg UDFs >>>>>>>>>>>>>>>>>>>> could represent such arguments as lists when supported by >>>>>>>>>>>>>>>>>>>> the engine. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> c) *Generics in UDFs*: Since Iceberg currently doesn’t >>>>>>>>>>>>>>>>>>>> support generic types (e.g., object), we can only map >>>>>>>>>>>>>>>>>>>> engine-specific types to Iceberg types. As a result, >>>>>>>>>>>>>>>>>>>> generic data types >>>>>>>>>>>>>>>>>>>> will not be supported in the initial version. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> d) *Python Support*: Incorporating Python as a >>>>>>>>>>>>>>>>>>>> language for SQL UDFs seems promising, especially given >>>>>>>>>>>>>>>>>>>> its potential to >>>>>>>>>>>>>>>>>>>> resolve interoperability challenges. Some engines, >>>>>>>>>>>>>>>>>>>> however, require >>>>>>>>>>>>>>>>>>>> platform version and package dependency details to execute >>>>>>>>>>>>>>>>>>>> Python code—this >>>>>>>>>>>>>>>>>>>> should be captured in the specification. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *Next Steps* >>>>>>>>>>>>>>>>>>>> I will update the proposal document with two primary >>>>>>>>>>>>>>>>>>>> UDF use cases: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Policy exchange between engines >>>>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> UDTF as a superset of view functionality >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The update will include corresponding syntax examples >>>>>>>>>>>>>>>>>>>> in both SQL and Python, and detail how each use case is >>>>>>>>>>>>>>>>>>>> represented in >>>>>>>>>>>>>>>>>>>> Iceberg metadata. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> We also plan to set up regular syncs (open to more >>>>>>>>>>>>>>>>>>>> interested participants) to continue refining and >>>>>>>>>>>>>>>>>>>> finalizing the UDF >>>>>>>>>>>>>>>>>>>> specification. >>>>>>>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Mar 12, 2025 at 9:16 PM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I've updated the design document[1] based on the >>>>>>>>>>>>>>>>>>>>> previous comments. Additionally, I've included the SQL >>>>>>>>>>>>>>>>>>>>> UDF syntax supported >>>>>>>>>>>>>>>>>>>>> by various vendors, including Dremio, Snowflake, >>>>>>>>>>>>>>>>>>>>> Databricks, and Trino. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I'm happy to schedule a separate sync if a deeper >>>>>>>>>>>>>>>>>>>>> discussion is needed. Let's keep moving forward, >>>>>>>>>>>>>>>>>>>>> especially with the >>>>>>>>>>>>>>>>>>>>> renewed interest from the community. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Thu, Feb 13, 2025 at 11:17 PM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hey everyone, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> During the last catalog community sync, there was >>>>>>>>>>>>>>>>>>>>>> significant interest in storing UDFs in Iceberg and >>>>>>>>>>>>>>>>>>>>>> adding endpoints for >>>>>>>>>>>>>>>>>>>>>> UDF handling in the REST catalog spec. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I recently discussed this with Yufei to better >>>>>>>>>>>>>>>>>>>>>> understand the new requirement of using UDFs for >>>>>>>>>>>>>>>>>>>>>> fine-grained access >>>>>>>>>>>>>>>>>>>>>> control policies. This expands the use cases beyond just >>>>>>>>>>>>>>>>>>>>>> versioned and >>>>>>>>>>>>>>>>>>>>>> interoperable UDFs. Additionally, I learnt that many >>>>>>>>>>>>>>>>>>>>>> vendors are interested >>>>>>>>>>>>>>>>>>>>>> in this feature. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Given the strong community interest and support, I’d >>>>>>>>>>>>>>>>>>>>>> like to take ownership of this effort and revive the >>>>>>>>>>>>>>>>>>>>>> work. I'll be >>>>>>>>>>>>>>>>>>>>>> revisiting the document I proposed long back and will >>>>>>>>>>>>>>>>>>>>>> share an updated >>>>>>>>>>>>>>>>>>>>>> proposal by next week. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Looking forward to storing UDFs in Iceberg! >>>>>>>>>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 2:55 PM Dmitri Bourlatchkov >>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> The UDF spec does not require representations to be >>>>>>>>>>>>>>>>>>>>>>> SQL. It merely does not specify (in this revision) how >>>>>>>>>>>>>>>>>>>>>>> other >>>>>>>>>>>>>>>>>>>>>>> representations are to be written. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> This seems like an easy extension (adding a new type >>>>>>>>>>>>>>>>>>>>>>> in the "Representations" section). >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>>> Dmitri. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 3:47 PM Ryan Blue >>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Right now, SQL is an explicit requirement of the >>>>>>>>>>>>>>>>>>>>>>>> spec. It leaves a way for future versions to add >>>>>>>>>>>>>>>>>>>>>>>> different representations >>>>>>>>>>>>>>>>>>>>>>>> later, but only SQL is supported. That was also the >>>>>>>>>>>>>>>>>>>>>>>> feedback to my initial >>>>>>>>>>>>>>>>>>>>>>>> skepticism about how it would work to add functions. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 12:44 PM Dmitri Bourlatchkov >>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I do not think the spec is meant to allow only SQL >>>>>>>>>>>>>>>>>>>>>>>>> representations, although it is certainly faviouring >>>>>>>>>>>>>>>>>>>>>>>>> SQL in examples... It >>>>>>>>>>>>>>>>>>>>>>>>> would be nice to add a non-SQL example, indeed. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>>>>> Dmitri. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong < >>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Coming from PyIceberg, I have concerns as this >>>>>>>>>>>>>>>>>>>>>>>>>> proposal focuses on SQL-based engines, while >>>>>>>>>>>>>>>>>>>>>>>>>> Python-based systems often >>>>>>>>>>>>>>>>>>>>>>>>>> work with data frames. Adding imperative languages >>>>>>>>>>>>>>>>>>>>>>>>>> like Python would make >>>>>>>>>>>>>>>>>>>>>>>>>> this proposal more inclusive. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen >>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]>: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Walaa, thanks for asking! >>>>>>>>>>>>>>>>>>>>>>>>>>> In the design doc linked before in this thread >>>>>>>>>>>>>>>>>>>>>>>>>>> [1] i read >>>>>>>>>>>>>>>>>>>>>>>>>>> "Without a common standard, the UDFs are hard to >>>>>>>>>>>>>>>>>>>>>>>>>>> share among different engines." >>>>>>>>>>>>>>>>>>>>>>>>>>> ("Background and Motivation" section). >>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with this statement. I don't fully >>>>>>>>>>>>>>>>>>>>>>>>>>> understand yet how the proposed design addresses >>>>>>>>>>>>>>>>>>>>>>>>>>> shareability between the >>>>>>>>>>>>>>>>>>>>>>>>>>> engines though. >>>>>>>>>>>>>>>>>>>>>>>>>>> I would use some help to understand this better. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Best >>>>>>>>>>>>>>>>>>>>>>>>>>> Piotr >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> [1] SQL User-Defined Function Spec >>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin >>>>>>>>>>>>>>>>>>>>>>>>>>> Moustafa <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Piotr, what do you mean by making user-created >>>>>>>>>>>>>>>>>>>>>>>>>>>> functions shareable >>>>>>>>>>>>>>>>>>>>>>>>>>>> between engines? Do you mean UDFs written in >>>>>>>>>>>>>>>>>>>>>>>>>>>> imperative code? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen >>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > Hi, >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > Thank you Ajantha for creating this thread. >>>>>>>>>>>>>>>>>>>>>>>>>>>> The Iceberg UDFs are an interesting idea! >>>>>>>>>>>>>>>>>>>>>>>>>>>> > Is there a plan to make the user-created >>>>>>>>>>>>>>>>>>>>>>>>>>>> functions sharable between the engines? >>>>>>>>>>>>>>>>>>>>>>>>>>>> > If so, how would a CREATE FUNCTION statement >>>>>>>>>>>>>>>>>>>>>>>>>>>> look like in e..g Spark or Trino? >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > Meanwhile, added a few comments in the doc. >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > Best >>>>>>>>>>>>>>>>>>>>>>>>>>>> > Piotr >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> I just looked through the proposal and added >>>>>>>>>>>>>>>>>>>>>>>>>>>> comments. I think it would be helpful to also have >>>>>>>>>>>>>>>>>>>>>>>>>>>> a design doc that covers >>>>>>>>>>>>>>>>>>>>>>>>>>>> the choices from the draft spec. For instance, the >>>>>>>>>>>>>>>>>>>>>>>>>>>> choice to enumerate all >>>>>>>>>>>>>>>>>>>>>>>>>>>> possible function input struts rather than >>>>>>>>>>>>>>>>>>>>>>>>>>>> allowing generics and varargs. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> Here’s a quick summary of my feedback: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> I think that the choice to enumerate >>>>>>>>>>>>>>>>>>>>>>>>>>>> function signatures is limiting. It would be nice >>>>>>>>>>>>>>>>>>>>>>>>>>>> to see a discussion of >>>>>>>>>>>>>>>>>>>>>>>>>>>> the trade-offs and a rationale for the choice. I >>>>>>>>>>>>>>>>>>>>>>>>>>>> think it would also be >>>>>>>>>>>>>>>>>>>>>>>>>>>> very helpful to have a few representative use >>>>>>>>>>>>>>>>>>>>>>>>>>>> cases for this included in >>>>>>>>>>>>>>>>>>>>>>>>>>>> the doc. That way the proposal can demonstrate >>>>>>>>>>>>>>>>>>>>>>>>>>>> that it solves those use >>>>>>>>>>>>>>>>>>>>>>>>>>>> cases with reasonable trade-offs. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> There are a few instances where this is >>>>>>>>>>>>>>>>>>>>>>>>>>>> inconsistent with conventions in other specs. For >>>>>>>>>>>>>>>>>>>>>>>>>>>> example, using string IDs >>>>>>>>>>>>>>>>>>>>>>>>>>>> rather than an integer. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> This uses a very different model for spec >>>>>>>>>>>>>>>>>>>>>>>>>>>> versioning than the Iceberg view and table specs. >>>>>>>>>>>>>>>>>>>>>>>>>>>> It requires readers to >>>>>>>>>>>>>>>>>>>>>>>>>>>> fail if there are any unknown fields, which >>>>>>>>>>>>>>>>>>>>>>>>>>>> prevents the spec from adding >>>>>>>>>>>>>>>>>>>>>>>>>>>> things that are fully backward-compatible. Other >>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg specs only require >>>>>>>>>>>>>>>>>>>>>>>>>>>> a version change to introduce forward-incompatible >>>>>>>>>>>>>>>>>>>>>>>>>>>> changes and I think that >>>>>>>>>>>>>>>>>>>>>>>>>>>> this should do the same to avoid confusion. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> It looks like the intent is to allow >>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple function signatures per verison, but it >>>>>>>>>>>>>>>>>>>>>>>>>>>> is unclear how to encode >>>>>>>>>>>>>>>>>>>>>>>>>>>> them because a version is associated with a single >>>>>>>>>>>>>>>>>>>>>>>>>>>> function signature. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> There is no review of SQL syntax for >>>>>>>>>>>>>>>>>>>>>>>>>>>> creating functions across engines, so this doesn’t >>>>>>>>>>>>>>>>>>>>>>>>>>>> show that the metadata >>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed is sufficient for cross-engine use cases. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> The example for a table-valued function >>>>>>>>>>>>>>>>>>>>>>>>>>>> shows a SELECT statement and it isn’t clear how >>>>>>>>>>>>>>>>>>>>>>>>>>>> this is distinct from a view >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> Thanks Walaa and Robert for the review on >>>>>>>>>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> We didn't find any blocker for the spec. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> I will wait for a week and If no more >>>>>>>>>>>>>>>>>>>>>>>>>>>> review comments, I will raise a PR for spec >>>>>>>>>>>>>>>>>>>>>>>>>>>> addition next week. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> If anyone else is interested, please have a >>>>>>>>>>>>>>>>>>>>>>>>>>>> look at the proposal >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin >>>>>>>>>>>>>>>>>>>>>>>>>>>> Moustafa <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> Hi Ajantha, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> I have left some comments. It is an >>>>>>>>>>>>>>>>>>>>>>>>>>>> interesting direction, but there might be some >>>>>>>>>>>>>>>>>>>>>>>>>>>> details that need to be fine >>>>>>>>>>>>>>>>>>>>>>>>>>>> tuned. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> The doc is here [1] for others who might >>>>>>>>>>>>>>>>>>>>>>>>>>>> be interested. Resharing since I do not think it >>>>>>>>>>>>>>>>>>>>>>>>>>>> was directly linked in the >>>>>>>>>>>>>>>>>>>>>>>>>>>> thread. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> [1] >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> Walaa. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>> Bhat <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> Hi, just another reminder since we didn't >>>>>>>>>>>>>>>>>>>>>>>>>>>> get any review on the proposal. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> Initially proposed on June 4. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>> Bhat <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> Hi everyone, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> We've only received one review so far >>>>>>>>>>>>>>>>>>>>>>>>>>>> (from Benny). >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> We would appreciate more eyes on this. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>> Bhat <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Hi All, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Please find the proposal link >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/10432 >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Google doc link is attached in the >>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> And Thanks Stephen Lin for working on >>>>>>>>>>>>>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> Hope it gives more clarity to take the >>>>>>>>>>>>>>>>>>>>>>>>>>>> decisions and how we want to implement it. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa >>>>>>>>>>>>>>>>>>>>>>>>>>>> Eldin Moustafa <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Thanks Jack. I actually meant >>>>>>>>>>>>>>>>>>>>>>>>>>>> scalar/aggregate/table user defined functions. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Here are some examples of >>>>>>>>>>>>>>>>>>>>>>>>>>>> what I meant in (2): >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Hive GenericUDF: >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Trino user defined functions: >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://trino.io/docs/current/develop/functions.html >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Flink user defined functions: >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Probably what you referred to is a >>>>>>>>>>>>>>>>>>>>>>>>>>>> variation of (1) where the API is data flow/data >>>>>>>>>>>>>>>>>>>>>>>>>>>> pipeline API instead of >>>>>>>>>>>>>>>>>>>>>>>>>>>> SQL (e.g., Spark Scala). Yes, that is also >>>>>>>>>>>>>>>>>>>>>>>>>>>> possible in the very long run :) >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack >>>>>>>>>>>>>>>>>>>>>>>>>>>> Ye <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> > (2) Custom code written in >>>>>>>>>>>>>>>>>>>>>>>>>>>> imperative function according to a >>>>>>>>>>>>>>>>>>>>>>>>>>>> Java/Scala/Python API, etc. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> I think we could still explore some >>>>>>>>>>>>>>>>>>>>>>>>>>>> long term opportunities in this case. Consider you >>>>>>>>>>>>>>>>>>>>>>>>>>>> register a Spark temp >>>>>>>>>>>>>>>>>>>>>>>>>>>> view as some sort of data frame read, then it >>>>>>>>>>>>>>>>>>>>>>>>>>>> could still be resolved to a >>>>>>>>>>>>>>>>>>>>>>>>>>>> Spark plan that is representable by an >>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate representation. But I >>>>>>>>>>>>>>>>>>>>>>>>>>>> agree this gets very complicated very soon, and >>>>>>>>>>>>>>>>>>>>>>>>>>>> just having the case (1) >>>>>>>>>>>>>>>>>>>>>>>>>>>> covered would already be a huge step forward. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> -Jack >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny >>>>>>>>>>>>>>>>>>>>>>>>>>>> Chow <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> It's interesting to note that a >>>>>>>>>>>>>>>>>>>>>>>>>>>> tabular SQL UDF can be used to build a >>>>>>>>>>>>>>>>>>>>>>>>>>>> parameterized view. So, there's >>>>>>>>>>>>>>>>>>>>>>>>>>>> definitely a lot in common between UDFs and views. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>> Walaa Eldin Moustafa <[email protected]> >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> I think there is a disconnect about >>>>>>>>>>>>>>>>>>>>>>>>>>>> what is perceived as a "UDF". There are 2 flavors: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> (1) Functions that are defined by >>>>>>>>>>>>>>>>>>>>>>>>>>>> the user whose definition is a composition of >>>>>>>>>>>>>>>>>>>>>>>>>>>> other built-in functions/SQL >>>>>>>>>>>>>>>>>>>>>>>>>>>> expressions. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> (2) Custom code written in >>>>>>>>>>>>>>>>>>>>>>>>>>>> imperative function according to a >>>>>>>>>>>>>>>>>>>>>>>>>>>> Java/Scala/Python API, etc. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> All the examples in Ajantha's >>>>>>>>>>>>>>>>>>>>>>>>>>>> references are pretty much from (1) and I think >>>>>>>>>>>>>>>>>>>>>>>>>>>> those have more analogy to >>>>>>>>>>>>>>>>>>>>>>>>>>>> views due to their SQL nature. Agree (2) is not >>>>>>>>>>>>>>>>>>>>>>>>>>>> practical to maintain by >>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg, but I think Ajantha's use cases are >>>>>>>>>>>>>>>>>>>>>>>>>>>> around (1), and may be worth >>>>>>>>>>>>>>>>>>>>>>>>>>>> evaluating. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>> Ajantha Bhat <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you >>>>>>>>>>>>>>>>>>>>>>>>>>>> post the proposal, but I think this would be a >>>>>>>>>>>>>>>>>>>>>>>>>>>> very difficult area to >>>>>>>>>>>>>>>>>>>>>>>>>>>> tackle across engines, languages, and memory >>>>>>>>>>>>>>>>>>>>>>>>>>>> models without having a huge >>>>>>>>>>>>>>>>>>>>>>>>>>>> performance penalty. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Assuming Iceberg initially >>>>>>>>>>>>>>>>>>>>>>>>>>>> supports SQL representations of UDFs (similar to >>>>>>>>>>>>>>>>>>>>>>>>>>>> views as shared by the >>>>>>>>>>>>>>>>>>>>>>>>>>>> reference links above), the complexity involved >>>>>>>>>>>>>>>>>>>>>>>>>>>> will be similar to managing >>>>>>>>>>>>>>>>>>>>>>>>>>>> views. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, >>>>>>>>>>>>>>>>>>>>>>>>>>>> for your input. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> We will work on publishing the >>>>>>>>>>>>>>>>>>>>>>>>>>>> draft spec (inspired by the view spec) this week >>>>>>>>>>>>>>>>>>>>>>>>>>>> to facilitate further >>>>>>>>>>>>>>>>>>>>>>>>>>>> discussions. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM >>>>>>>>>>>>>>>>>>>>>>>>>>>> Jack Ye <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> > While it would be great to have >>>>>>>>>>>>>>>>>>>>>>>>>>>> a common set of functions across engines, I don't >>>>>>>>>>>>>>>>>>>>>>>>>>>> see how that is practical >>>>>>>>>>>>>>>>>>>>>>>>>>>> when those engines are implemented so differently. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Plugging in code -- and >>>>>>>>>>>>>>>>>>>>>>>>>>>> especially custom user-supplied code -- seems >>>>>>>>>>>>>>>>>>>>>>>>>>>> inherently specialized to me >>>>>>>>>>>>>>>>>>>>>>>>>>>> and should be part of the engines' design. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> How is this different from the >>>>>>>>>>>>>>>>>>>>>>>>>>>> views? I feel we can say exactly the same thing >>>>>>>>>>>>>>>>>>>>>>>>>>>> for Iceberg views, but yet >>>>>>>>>>>>>>>>>>>>>>>>>>>> we have Iceberg multi-dialect views implemented. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Maybe it sounds like we >>>>>>>>>>>>>>>>>>>>>>>>>>>> are trying to draw a line between SQL vs other >>>>>>>>>>>>>>>>>>>>>>>>>>>> programming language as >>>>>>>>>>>>>>>>>>>>>>>>>>>> "code"? but I think SQL is just another type of >>>>>>>>>>>>>>>>>>>>>>>>>>>> code, and we are already >>>>>>>>>>>>>>>>>>>>>>>>>>>> talking about compiling all these different code >>>>>>>>>>>>>>>>>>>>>>>>>>>> dialects to an >>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediate representation (using projects like >>>>>>>>>>>>>>>>>>>>>>>>>>>> Coral, Substrait), which >>>>>>>>>>>>>>>>>>>>>>>>>>>> will be stored as another type of representation >>>>>>>>>>>>>>>>>>>>>>>>>>>> of Iceberg view. I think >>>>>>>>>>>>>>>>>>>>>>>>>>>> the same functionality can be used for UDFs if >>>>>>>>>>>>>>>>>>>>>>>>>>>> developed. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I actually hink adding UDF >>>>>>>>>>>>>>>>>>>>>>>>>>>> support is a good idea, even just a multi-dialect >>>>>>>>>>>>>>>>>>>>>>>>>>>> one like view, and that >>>>>>>>>>>>>>>>>>>>>>>>>>>> can allow engines to for example parse a view SQL, >>>>>>>>>>>>>>>>>>>>>>>>>>>> and when a function >>>>>>>>>>>>>>>>>>>>>>>>>>>> referenced cannot be resolved, try to seek for a >>>>>>>>>>>>>>>>>>>>>>>>>>>> multi-dialect UDF >>>>>>>>>>>>>>>>>>>>>>>>>>>> definition. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we can discuss more when >>>>>>>>>>>>>>>>>>>>>>>>>>>> we have the actual proposal published. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>> Robert Stupp <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and >>>>>>>>>>>>>>>>>>>>>>>>>>>> portable and "non-centralized" as views are. The >>>>>>>>>>>>>>>>>>>>>>>>>>>> same performance concerns >>>>>>>>>>>>>>>>>>>>>>>>>>>> apply to views as well. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Iceberg should define a common >>>>>>>>>>>>>>>>>>>>>>>>>>>> base upon which engines can build, so the argument >>>>>>>>>>>>>>>>>>>>>>>>>>>> that UDFs aren't >>>>>>>>>>>>>>>>>>>>>>>>>>>> practical, because engines are different, is >>>>>>>>>>>>>>>>>>>>>>>>>>>> probably only a temporary >>>>>>>>>>>>>>>>>>>>>>>>>>>> concern. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg should >>>>>>>>>>>>>>>>>>>>>>>>>>>> also try to tackle the idea to make views >>>>>>>>>>>>>>>>>>>>>>>>>>>> portable, which is conceptually >>>>>>>>>>>>>>>>>>>>>>>>>>>> not that much different from portable UDFs. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a >>>>>>>>>>>>>>>>>>>>>>>>>>>> negative touch to the idea of having UDFs in >>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg, especially not in >>>>>>>>>>>>>>>>>>>>>>>>>>>> this early stage. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether it's >>>>>>>>>>>>>>>>>>>>>>>>>>>> a good idea to add UDFs tracked by Iceberg >>>>>>>>>>>>>>>>>>>>>>>>>>>> catalogs. I think that Iceberg >>>>>>>>>>>>>>>>>>>>>>>>>>>> primarily deals with things that are centralized, >>>>>>>>>>>>>>>>>>>>>>>>>>>> like tables of data. >>>>>>>>>>>>>>>>>>>>>>>>>>>> While it would be great to have a common set of >>>>>>>>>>>>>>>>>>>>>>>>>>>> functions across engines, I >>>>>>>>>>>>>>>>>>>>>>>>>>>> don't see how that is practical when those engines >>>>>>>>>>>>>>>>>>>>>>>>>>>> are implemented so >>>>>>>>>>>>>>>>>>>>>>>>>>>> differently. Plugging in code -- and especially >>>>>>>>>>>>>>>>>>>>>>>>>>>> custom user-supplied code >>>>>>>>>>>>>>>>>>>>>>>>>>>> -- seems inherently specialized to me and should >>>>>>>>>>>>>>>>>>>>>>>>>>>> be part of the engines' >>>>>>>>>>>>>>>>>>>>>>>>>>>> design. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I guess we'll know more when you >>>>>>>>>>>>>>>>>>>>>>>>>>>> post the proposal, but I think this would be a >>>>>>>>>>>>>>>>>>>>>>>>>>>> very difficult area to >>>>>>>>>>>>>>>>>>>>>>>>>>>> tackle across engines, languages, and memory >>>>>>>>>>>>>>>>>>>>>>>>>>>> models without having a huge >>>>>>>>>>>>>>>>>>>>>>>>>>>> performance penalty. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>> Ajantha Bhat <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge >>>>>>>>>>>>>>>>>>>>>>>>>>>> the community interest in storing the Versioned >>>>>>>>>>>>>>>>>>>>>>>>>>>> SQL UDFs in Iceberg. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We want to propose the spec >>>>>>>>>>>>>>>>>>>>>>>>>>>> addition for storing the versioned UDFs in Iceberg >>>>>>>>>>>>>>>>>>>>>>>>>>>> (inspired by view spec). >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> These UDFs can operate >>>>>>>>>>>>>>>>>>>>>>>>>>>> similarly to views in that they are associated >>>>>>>>>>>>>>>>>>>>>>>>>>>> with tables, but they can >>>>>>>>>>>>>>>>>>>>>>>>>>>> accept arguments and produce return values, or >>>>>>>>>>>>>>>>>>>>>>>>>>>> even function as inline >>>>>>>>>>>>>>>>>>>>>>>>>>>> expressions. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Many Query engines like Dremio, >>>>>>>>>>>>>>>>>>>>>>>>>>>> Trino, Snowflake, Databricks Spark supports SQL >>>>>>>>>>>>>>>>>>>>>>>>>>>> UDFs at catalog level [1]. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But storing them in Iceberg can >>>>>>>>>>>>>>>>>>>>>>>>>>>> enable >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Versioning of these UDFs. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Interoperability between the >>>>>>>>>>>>>>>>>>>>>>>>>>>> engines. Potentially engines can understand the >>>>>>>>>>>>>>>>>>>>>>>>>>>> UDFs written by other >>>>>>>>>>>>>>>>>>>>>>>>>>>> engines (with the translate layer). >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that integrating >>>>>>>>>>>>>>>>>>>>>>>>>>>> this feature into Iceberg would be a valuable >>>>>>>>>>>>>>>>>>>>>>>>>>>> addition, and we're eager to >>>>>>>>>>>>>>>>>>>>>>>>>>>> collaborate with the community to develop a UDF >>>>>>>>>>>>>>>>>>>>>>>>>>>> specification. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stephen has already begun >>>>>>>>>>>>>>>>>>>>>>>>>>>> drafting a specification to propose to the >>>>>>>>>>>>>>>>>>>>>>>>>>>> community. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on >>>>>>>>>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dremio - >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Trino - >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Snowflake - >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Databricks - >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Tabular >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Robert Stupp >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> @snazy >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> Databricks >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>>>>>>>>>> Databricks >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>
