Hi Ryan, Here are the main use cases for secure UDFs:
1. Hiding UDF Definitions: This includes concealing the UDF body and details like the list of imports, some of them aren’t applicable to SQL UDFs. 2. Sandboxed Execution: Ensuring the UDF runs in an isolated environment. Again, this typically doesn’t apply to SQL UDFs. 3. Preventing Data Leakage at Execution Time: For example, secure UDFs may disable certain optimizations—such as predicate pushdown—to avoid exposing sensitive data indirectly. [1] Given these scenarios, I agree with your point that the secure flag is primarily an instruction to the engine to behave differently. While it's largely an engine-side behavior, we still need to include this flag in the UDF definition to indicate whether a UDF is secure, especially considering the perf penalty introduced by scenario #3. We should clearly recommend that users avoid marking UDFs as secure unless it's truly necessary. [1] https://docs.snowflake.com/en/developer-guide/pushdown-optimization#example-of-indirect-data-exposure-through-pushdown Yufei On Wed, Jun 18, 2025 at 12:32 PM Ryan Blue <rdb...@gmail.com> wrote: > Yufei, could you make the argument for supporting a "secure" UDF? What use > case are you addressing and what specifically changes about how the UDF is > handled? If the idea is to hide the UDF definition, do we need to include > it? > > I think this would be a signal to a "trusted engine". When the engine > interacts with the catalog it sends authorization information about itself > in addition to the user that it is acting on behalf of. That way the > catalog knows that the secure UDF can be sent to the engine and won't be > shown to the user. The majority of this logic is on the REST server side, > and the only part that is communicated to the client is the request not to > show the UDF to the user, right? In that case should this be a property > rather than part of the definition? Even if we state that the client "must" > suppress the UDF definition, it's really just a request. Only trusted > engines can be passed the UDF definition, so a spec requirement to suppress > the definition isn't very meaningful. > > On Mon, Jun 16, 2025 at 5:42 PM Yufei Gu <flyrain...@gmail.com> wrote: > >> Thanks for the summary, Ajantha! >> >> Multi-statement UDFs are definitely useful, but whether those statements >> run within a single transaction should be treated as an engine-level >> concern. The Iceberg UDF spec can spell out the expectation, yet the actual >> guarantee still depends on the runtime. Even if a UDF declares itself >> transactional, the engine may or may not enforce it. >> >> One more thing: should we also introduce a “secure UDF” option supported >> by some engines[1], so the body and any sensitive details stay hidden from >> callers? >> >> [1] https://docs.snowflake.com/en/developer-guide/secure-udf-procedure >> >> Yufei >> >> >> On Mon, Jun 16, 2025 at 12:02 PM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>> Thanks to everyone who joined the sync. >>> Here is the meeting recording: >>> https://drive.google.com/file/d/10_Getaasv6tDMGzeZQUgcUVwCUAaFxiz/view?usp=sharing >>> Summary: >>> >>> - We have gone through the SQL UDF syntax supported by different >>> engines (Snowflake, databricks, Dremio, Trino, OSS spark 4.0). >>> - Each engine uses its own block separator, like $$ or '' or none. >>> Action item was to check whether engines support multi-statement >>> (transactional) UDF bodies. >>> - Discussed about function overloading. Need to check whether these >>> engines support function overloading for SQL UDFs. Postgres supports it! >>> If >>> yes, need to adopt the spec to handle it. >>> - Started online spec review and discussed the deterministic flag >>> and concluded that we keep the independent fields (like deterministic) in >>> spec only if the majority of engines supports it. Else it will be passed >>> in >>> a property bag (engine specific). And it is the engine's responsibility >>> to >>> honor those optional properties. >>> >>> Feel free to review the current proposal document here >>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing>. >>> >>> Final spec will be put to review and vote once it is ready. >>> >>> Details for next Iceberg UDF sync: >>> >>> *Monday, June 30 · 9:00 – 10:00am*Time zone: America/Los_Angeles >>> Google Meet joining info >>> Video call link: https://meet.google.com/aui-czix-nbh >>> >>> - Ajantha >>> >>> On Wed, Jun 4, 2025 at 9:00 PM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> Thanks to everyone who joined the sync. >>>> Here is the meeting recording: >>>> https://drive.google.com/file/d/1WItItsNs3m3-no7_qWPHftGqVNOdpw5C/view?usp=sharing >>>> >>>> Summary: >>>> >>>> - >>>> >>>> We discussed including Python support; the majority agreed *not to* >>>> (see recording for details). >>>> - >>>> >>>> No strong opposition to versioning — it will be included to support >>>> change tracking and similar use cases. >>>> - >>>> >>>> Suggestions were made to document how each catalog resolves UDFs, >>>> similar to views and tables. >>>> - >>>> >>>> We agreed not to deviate from the existing table/view spec — e.g., >>>> location will remain *required* for cross-catalog compatibility. >>>> - >>>> >>>> We also discussed a bit about view interoperability as the same >>>> things are applicable here. >>>> >>>> Feel free to review the proposal document >>>> >>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?pli=1&tab=t.0> >>>> here. >>>> With the current scope, it is similar to the view/table spec now. >>>> Final spec will be put to review and vote once it is ready. >>>> >>>> Details for next Iceberg UDF sync: >>>> >>>> *Monday, June 16 · 9:00 – 10:00am*Time zone: America/Los_Angeles >>>> Google Meet joining info >>>> Video call link: https://meet.google.com/aui-czix-nbh >>>> >>>> - Ajantha >>>> >>>> On Wed, May 21, 2025 at 3:33 AM Yufei Gu <flyrain...@gmail.com> wrote: >>>> >>>>> Hi folks, >>>>> >>>>> We’ve set up a dedicated bi-weekly community sync for the UDF project. >>>>> Everyone’s welcome to drop in and share ideas! Here is the meeting link: >>>>> >>>>> Iceberg UDF sync >>>>> Monday, June 2 · 9:00 – 10:00am >>>>> Time zone: America/Los_Angeles >>>>> Google Meet joining info >>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>> >>>>> Yufei >>>>> >>>>> >>>>> On Fri, May 16, 2025 at 10:45 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>> wrote: >>>>> >>>>>> Update on the progress. >>>>>> >>>>>> I had a meeting today with Yufei and Yun.zou to discuss the UDF >>>>>> proposal. We covered several key points, though some are still open for >>>>>> further discussion: >>>>>> >>>>>> a) *UDF Versioning*: Do we truly need versioning for UDFs at this >>>>>> stage? We explored the possibility of simplifying the specification by >>>>>> avoiding view replication, and potentially introducing versioning support >>>>>> later. UDTFs, being a superset of views in some ways, may not require >>>>>> versioning initially. >>>>>> >>>>>> b) *VarArgs Support*: While some query engines may not support >>>>>> vararg syntax in CREATE FUNCTION, Iceberg UDFs could represent such >>>>>> arguments as lists when supported by the engine. >>>>>> >>>>>> c) *Generics in UDFs*: Since Iceberg currently doesn’t support >>>>>> generic types (e.g., object), we can only map engine-specific types >>>>>> to Iceberg types. As a result, generic data types will not be supported >>>>>> in >>>>>> the initial version. >>>>>> >>>>>> d) *Python Support*: Incorporating Python as a language for SQL UDFs >>>>>> seems promising, especially given its potential to resolve >>>>>> interoperability >>>>>> challenges. Some engines, however, require platform version and package >>>>>> dependency details to execute Python code—this should be captured in the >>>>>> specification. >>>>>> >>>>>> *Next Steps* >>>>>> I will update the proposal document with two primary UDF use cases: >>>>>> >>>>>> - >>>>>> >>>>>> Policy exchange between engines >>>>>> - >>>>>> >>>>>> UDTF as a superset of view functionality >>>>>> >>>>>> The update will include corresponding syntax examples in both SQL and >>>>>> Python, and detail how each use case is represented in Iceberg metadata. >>>>>> >>>>>> We also plan to set up regular syncs (open to more interested >>>>>> participants) to continue refining and finalizing the UDF specification. >>>>>> - Ajantha >>>>>> >>>>>> >>>>>> On Wed, Mar 12, 2025 at 9:16 PM Ajantha Bhat <ajanthab...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> I've updated the design document[1] based on the previous comments. >>>>>>> Additionally, I've included the SQL UDF syntax supported by various >>>>>>> vendors, including Dremio, Snowflake, Databricks, and Trino. >>>>>>> >>>>>>> I'm happy to schedule a separate sync if a deeper discussion is >>>>>>> needed. Let's keep moving forward, especially with the renewed interest >>>>>>> from the community. >>>>>>> >>>>>>> [1] >>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing >>>>>>> >>>>>>> On Thu, Feb 13, 2025 at 11:17 PM Ajantha Bhat <ajanthab...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hey everyone, >>>>>>>> >>>>>>>> During the last catalog community sync, there was significant >>>>>>>> interest in storing UDFs in Iceberg and adding endpoints for UDF >>>>>>>> handling >>>>>>>> in the REST catalog spec. >>>>>>>> >>>>>>>> I recently discussed this with Yufei to better understand the new >>>>>>>> requirement of using UDFs for fine-grained access control policies. >>>>>>>> This >>>>>>>> expands the use cases beyond just versioned and interoperable UDFs. >>>>>>>> Additionally, I learnt that many vendors are interested in this >>>>>>>> feature. >>>>>>>> >>>>>>>> Given the strong community interest and support, I’d like to take >>>>>>>> ownership of this effort and revive the work. I'll be revisiting the >>>>>>>> document I proposed long back and will share an updated proposal by >>>>>>>> next >>>>>>>> week. >>>>>>>> >>>>>>>> Looking forward to storing UDFs in Iceberg! >>>>>>>> - Ajantha >>>>>>>> >>>>>>>> On Thu, Aug 8, 2024 at 2:55 PM Dmitri Bourlatchkov >>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote: >>>>>>>> >>>>>>>>> The UDF spec does not require representations to be SQL. It merely >>>>>>>>> does not specify (in this revision) how other representations are to >>>>>>>>> be >>>>>>>>> written. >>>>>>>>> >>>>>>>>> This seems like an easy extension (adding a new type in the >>>>>>>>> "Representations" section). >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Dmitri. >>>>>>>>> >>>>>>>>> On Thu, Aug 8, 2024 at 3:47 PM Ryan Blue >>>>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Right now, SQL is an explicit requirement of the spec. It leaves >>>>>>>>>> a way for future versions to add different representations later, >>>>>>>>>> but only >>>>>>>>>> SQL is supported. That was also the feedback to my initial >>>>>>>>>> skepticism about >>>>>>>>>> how it would work to add functions. >>>>>>>>>> >>>>>>>>>> On Thu, Aug 8, 2024 at 12:44 PM Dmitri Bourlatchkov >>>>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote: >>>>>>>>>> >>>>>>>>>>> I do not think the spec is meant to allow only SQL >>>>>>>>>>> representations, although it is certainly faviouring SQL in >>>>>>>>>>> examples... It >>>>>>>>>>> would be nice to add a non-SQL example, indeed. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Dmitri. >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong < >>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> Coming from PyIceberg, I have concerns as this proposal focuses >>>>>>>>>>>> on SQL-based engines, while Python-based systems often work with >>>>>>>>>>>> data >>>>>>>>>>>> frames. Adding imperative languages like Python would make this >>>>>>>>>>>> proposal >>>>>>>>>>>> more inclusive. >>>>>>>>>>>> >>>>>>>>>>>> Kind regards, >>>>>>>>>>>> Fokko >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen < >>>>>>>>>>>> piotr.findei...@gmail.com>: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> Walaa, thanks for asking! >>>>>>>>>>>>> In the design doc linked before in this thread [1] i read >>>>>>>>>>>>> "Without a common standard, the UDFs are hard to share among >>>>>>>>>>>>> different engines." >>>>>>>>>>>>> ("Background and Motivation" section). >>>>>>>>>>>>> I agree with this statement. I don't fully understand yet how >>>>>>>>>>>>> the proposed design addresses shareability between the engines >>>>>>>>>>>>> though. >>>>>>>>>>>>> I would use some help to understand this better. >>>>>>>>>>>>> >>>>>>>>>>>>> Best >>>>>>>>>>>>> Piotr >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [1] SQL User-Defined Function Spec >>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa < >>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Piotr, what do you mean by making user-created functions >>>>>>>>>>>>>> shareable >>>>>>>>>>>>>> between engines? Do you mean UDFs written in imperative code? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen >>>>>>>>>>>>>> <piotr.findei...@gmail.com> wrote: >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Hi, >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Thank you Ajantha for creating this thread. The Iceberg >>>>>>>>>>>>>> UDFs are an interesting idea! >>>>>>>>>>>>>> > Is there a plan to make the user-created functions sharable >>>>>>>>>>>>>> between the engines? >>>>>>>>>>>>>> > If so, how would a CREATE FUNCTION statement look like in >>>>>>>>>>>>>> e..g Spark or Trino? >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Meanwhile, added a few comments in the doc. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Best >>>>>>>>>>>>>> > Piotr >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue >>>>>>>>>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> I just looked through the proposal and added comments. I >>>>>>>>>>>>>> think it would be helpful to also have a design doc that covers >>>>>>>>>>>>>> the choices >>>>>>>>>>>>>> from the draft spec. For instance, the choice to enumerate all >>>>>>>>>>>>>> possible >>>>>>>>>>>>>> function input struts rather than allowing generics and varargs. >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> Here’s a quick summary of my feedback: >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> I think that the choice to enumerate function signatures >>>>>>>>>>>>>> is limiting. It would be nice to see a discussion of the >>>>>>>>>>>>>> trade-offs and a >>>>>>>>>>>>>> rationale for the choice. I think it would also be very helpful >>>>>>>>>>>>>> to have a >>>>>>>>>>>>>> few representative use cases for this included in the doc. That >>>>>>>>>>>>>> way the >>>>>>>>>>>>>> proposal can demonstrate that it solves those use cases with >>>>>>>>>>>>>> reasonable >>>>>>>>>>>>>> trade-offs. >>>>>>>>>>>>>> >> There are a few instances where this is inconsistent with >>>>>>>>>>>>>> conventions in other specs. For example, using string IDs rather >>>>>>>>>>>>>> than an >>>>>>>>>>>>>> integer. >>>>>>>>>>>>>> >> This uses a very different model for spec versioning than >>>>>>>>>>>>>> the Iceberg view and table specs. It requires readers to fail if >>>>>>>>>>>>>> there are >>>>>>>>>>>>>> any unknown fields, which prevents the spec from adding things >>>>>>>>>>>>>> that are >>>>>>>>>>>>>> fully backward-compatible. Other Iceberg specs only require a >>>>>>>>>>>>>> version >>>>>>>>>>>>>> change to introduce forward-incompatible changes and I think >>>>>>>>>>>>>> that this >>>>>>>>>>>>>> should do the same to avoid confusion. >>>>>>>>>>>>>> >> It looks like the intent is to allow multiple function >>>>>>>>>>>>>> signatures per verison, but it is unclear how to encode them >>>>>>>>>>>>>> because a >>>>>>>>>>>>>> version is associated with a single function signature. >>>>>>>>>>>>>> >> There is no review of SQL syntax for creating functions >>>>>>>>>>>>>> across engines, so this doesn’t show that the metadata proposed >>>>>>>>>>>>>> is >>>>>>>>>>>>>> sufficient for cross-engine use cases. >>>>>>>>>>>>>> >> The example for a table-valued function shows a SELECT >>>>>>>>>>>>>> statement and it isn’t clear how this is distinct from a view >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat < >>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> Thanks Walaa and Robert for the review on this. >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> We didn't find any blocker for the spec. >>>>>>>>>>>>>> >>> I will wait for a week and If no more review comments, I >>>>>>>>>>>>>> will raise a PR for spec addition next week. >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> If anyone else is interested, please have a look at the >>>>>>>>>>>>>> proposal >>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> - Ajantha >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa < >>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> Hi Ajantha, >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> I have left some comments. It is an interesting >>>>>>>>>>>>>> direction, but there might be some details that need to be fine >>>>>>>>>>>>>> tuned. >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> The doc is here [1] for others who might be interested. >>>>>>>>>>>>>> Resharing since I do not think it was directly linked in the >>>>>>>>>>>>>> thread. >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> [1] >>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> Thanks, >>>>>>>>>>>>>> >>>> Walaa. >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat < >>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>> >>>>> Hi, just another reminder since we didn't get any >>>>>>>>>>>>>> review on the proposal. >>>>>>>>>>>>>> >>>>> Initially proposed on June 4. >>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>> >>>>> - Ajantha >>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat < >>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>> >>>>>> Hi everyone, >>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>> >>>>>> We've only received one review so far (from Benny). >>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>> >>>>>> We would appreciate more eyes on this. >>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>> >>>>>> - Ajantha >>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat < >>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> >>>>>>> Hi All, >>>>>>>>>>>>>> >>>>>>> Please find the proposal link >>>>>>>>>>>>>> >>>>>>> https://github.com/apache/iceberg/issues/10432 >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> >>>>>>> Google doc link is attached in the proposal. >>>>>>>>>>>>>> >>>>>>> And Thanks Stephen Lin for working on it. >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> >>>>>>> Hope it gives more clarity to take the decisions and >>>>>>>>>>>>>> how we want to implement it. >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> >>>>>>> - Ajantha >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa < >>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>> >>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table >>>>>>>>>>>>>> user defined functions. Here are some examples of what I meant >>>>>>>>>>>>>> in (2): >>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>> >>>>>>>> Hive GenericUDF: >>>>>>>>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>>>>>>>>>>>>> >>>>>>>> Trino user defined functions: >>>>>>>>>>>>>> https://trino.io/docs/current/develop/functions.html >>>>>>>>>>>>>> >>>>>>>> Flink user defined functions: >>>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>> >>>>>>>> Probably what you referred to is a variation of (1) >>>>>>>>>>>>>> where the API is data flow/data pipeline API instead of SQL >>>>>>>>>>>>>> (e.g., Spark >>>>>>>>>>>>>> Scala). Yes, that is also possible in the very long run :) >>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>> >>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>> Walaa. >>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye < >>>>>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>> > (2) Custom code written in imperative function >>>>>>>>>>>>>> according to a Java/Scala/Python API, etc. >>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>> I think we could still explore some long term >>>>>>>>>>>>>> opportunities in this case. Consider you register a Spark temp >>>>>>>>>>>>>> view as some >>>>>>>>>>>>>> sort of data frame read, then it could still be resolved to a >>>>>>>>>>>>>> Spark plan >>>>>>>>>>>>>> that is representable by an intermediate representation. But I >>>>>>>>>>>>>> agree this >>>>>>>>>>>>>> gets very complicated very soon, and just having the case (1) >>>>>>>>>>>>>> covered would >>>>>>>>>>>>>> already be a huge step forward. >>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>> -Jack >>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow < >>>>>>>>>>>>>> btc...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>> It's interesting to note that a tabular SQL UDF >>>>>>>>>>>>>> can be used to build a parameterized view. So, there's >>>>>>>>>>>>>> definitely a lot in >>>>>>>>>>>>>> common between UDFs and views. >>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin >>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> I think there is a disconnect about what is >>>>>>>>>>>>>> perceived as a "UDF". There are 2 flavors: >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> (1) Functions that are defined by the user whose >>>>>>>>>>>>>> definition is a composition of other built-in functions/SQL >>>>>>>>>>>>>> expressions. >>>>>>>>>>>>>> >>>>>>>>>>> (2) Custom code written in imperative function >>>>>>>>>>>>>> according to a Java/Scala/Python API, etc. >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> All the examples in Ajantha's references are >>>>>>>>>>>>>> pretty much from (1) and I think those have more analogy to >>>>>>>>>>>>>> views due to >>>>>>>>>>>>>> their SQL nature. Agree (2) is not practical to maintain by >>>>>>>>>>>>>> Iceberg, but I >>>>>>>>>>>>>> think Ajantha's use cases are around (1), and may be worth >>>>>>>>>>>>>> evaluating. >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>> Walaa. >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat < >>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you post the >>>>>>>>>>>>>> proposal, but I think this would be a very difficult area to >>>>>>>>>>>>>> tackle across >>>>>>>>>>>>>> engines, languages, and memory models without having a huge >>>>>>>>>>>>>> performance >>>>>>>>>>>>>> penalty. >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL >>>>>>>>>>>>>> representations of UDFs (similar to views as shared by the >>>>>>>>>>>>>> reference links >>>>>>>>>>>>>> above), the complexity involved will be similar to managing >>>>>>>>>>>>>> views. >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input. >>>>>>>>>>>>>> >>>>>>>>>>>> We will work on publishing the draft spec >>>>>>>>>>>>>> (inspired by the view spec) this week to facilitate further >>>>>>>>>>>>>> discussions. >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye < >>>>>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> > While it would be great to have a common set >>>>>>>>>>>>>> of functions across engines, I don't see how that is practical >>>>>>>>>>>>>> when those >>>>>>>>>>>>>> engines are implemented so differently. Plugging in code -- and >>>>>>>>>>>>>> especially >>>>>>>>>>>>>> custom user-supplied code -- seems inherently specialized to me >>>>>>>>>>>>>> and should >>>>>>>>>>>>>> be part of the engines' design. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> How is this different from the views? I feel we >>>>>>>>>>>>>> can say exactly the same thing for Iceberg views, but yet we >>>>>>>>>>>>>> have Iceberg >>>>>>>>>>>>>> multi-dialect views implemented. Maybe it sounds like we are >>>>>>>>>>>>>> trying to draw >>>>>>>>>>>>>> a line between SQL vs other programming language as "code"? but >>>>>>>>>>>>>> I think SQL >>>>>>>>>>>>>> is just another type of code, and we are already talking about >>>>>>>>>>>>>> compiling >>>>>>>>>>>>>> all these different code dialects to an intermediate >>>>>>>>>>>>>> representation (using >>>>>>>>>>>>>> projects like Coral, Substrait), which will be stored as another >>>>>>>>>>>>>> type of >>>>>>>>>>>>>> representation of Iceberg view. I think the same functionality >>>>>>>>>>>>>> can be used >>>>>>>>>>>>>> for UDFs if developed. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> I actually hink adding UDF support is a good >>>>>>>>>>>>>> idea, even just a multi-dialect one like view, and that can >>>>>>>>>>>>>> allow engines >>>>>>>>>>>>>> to for example parse a view SQL, and when a function referenced >>>>>>>>>>>>>> cannot be >>>>>>>>>>>>>> resolved, try to seek for a multi-dialect UDF definition. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we can discuss more when we have the >>>>>>>>>>>>>> actual proposal published. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>>> >>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp < >>>>>>>>>>>>>> sn...@snazy.de> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and portable and >>>>>>>>>>>>>> "non-centralized" as views are. The same performance concerns >>>>>>>>>>>>>> apply to >>>>>>>>>>>>>> views as well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Iceberg should define a common base upon which >>>>>>>>>>>>>> engines can build, so the argument that UDFs aren't practical, >>>>>>>>>>>>>> because >>>>>>>>>>>>>> engines are different, is probably only a temporary concern. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg should also try to >>>>>>>>>>>>>> tackle the idea to make views portable, which is conceptually >>>>>>>>>>>>>> not that much >>>>>>>>>>>>>> different from portable UDFs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch >>>>>>>>>>>>>> to the idea of having UDFs in Iceberg, especially not in this >>>>>>>>>>>>>> early stage. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea >>>>>>>>>>>>>> to add UDFs tracked by Iceberg catalogs. I think that Iceberg >>>>>>>>>>>>>> primarily >>>>>>>>>>>>>> deals with things that are centralized, like tables of data. >>>>>>>>>>>>>> While it would >>>>>>>>>>>>>> be great to have a common set of functions across engines, I >>>>>>>>>>>>>> don't see how >>>>>>>>>>>>>> that is practical when those engines are implemented so >>>>>>>>>>>>>> differently. >>>>>>>>>>>>>> Plugging in code -- and especially custom user-supplied code -- >>>>>>>>>>>>>> seems >>>>>>>>>>>>>> inherently specialized to me and should be part of the engines' >>>>>>>>>>>>>> design. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I guess we'll know more when you post the >>>>>>>>>>>>>> proposal, but I think this would be a very difficult area to >>>>>>>>>>>>>> tackle across >>>>>>>>>>>>>> engines, languages, and memory models without having a huge >>>>>>>>>>>>>> performance >>>>>>>>>>>>>> penalty. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat < >>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge the community >>>>>>>>>>>>>> interest in storing the Versioned SQL UDFs in Iceberg. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> We want to propose the spec addition for >>>>>>>>>>>>>> storing the versioned UDFs in Iceberg (inspired by view spec). >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> These UDFs can operate similarly to views in >>>>>>>>>>>>>> that they are associated with tables, but they can accept >>>>>>>>>>>>>> arguments and >>>>>>>>>>>>>> produce return values, or even function as inline expressions. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, >>>>>>>>>>>>>> Snowflake, Databricks Spark supports SQL UDFs at catalog level >>>>>>>>>>>>>> [1]. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> But storing them in Iceberg can enable >>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Versioning of these UDFs. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Interoperability between the engines. >>>>>>>>>>>>>> Potentially engines can understand the UDFs written by other >>>>>>>>>>>>>> engines (with >>>>>>>>>>>>>> the translate layer). >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that integrating this feature into >>>>>>>>>>>>>> Iceberg would be a valuable addition, and we're eager to >>>>>>>>>>>>>> collaborate with >>>>>>>>>>>>>> the community to develop a UDF specification. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stephen has already begun drafting a >>>>>>>>>>>>>> specification to propose to the community. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on this. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dremio - >>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Trino - >>>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Snowflake - >>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Databricks - >>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>> >>>>>>>>>>>>>> Tabular >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> >>>>>>>>>>>>>> Robert Stupp >>>>>>>>>>>>>> >>>>>>>>>>>>>> @snazy >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> -- >>>>>>>>>>>>>> >> Ryan Blue >>>>>>>>>>>>>> >> Databricks >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Databricks >>>>>>>>>> >>>>>>>>>