Yes, hiding the definition and disabling pushdown are required.We will need a named key(e.g., secure) somewhere, no matter if it is a top level property or a key as a part of the UDF properties. So that both UDF creator and consumer can recognize it.
Yufei On Thu, Jun 26, 2025 at 4:27 PM Ryan Blue <rdb...@gmail.com> wrote: > Thanks for the extra detail. What do you think the spec would require? > Would it require hiding the UDF definition from users and require specific > pushdown cases be disabled? The use cases seem valid, but I'm trying to > understand the requirements this places on engines and why it needs to be > part of the spec, rather than part of the properties of the UDF. > > On Fri, Jun 20, 2025 at 3:56 PM Yufei Gu <flyrain...@gmail.com> wrote: > >> Hi Ryan, >> >> Here are the main use cases for secure UDFs: >> >> 1. >> >> Hiding UDF Definitions: This includes concealing the UDF body and >> details like the list of imports, some of them aren’t applicable to SQL >> UDFs. >> 2. >> >> Sandboxed Execution: Ensuring the UDF runs in an isolated >> environment. Again, this typically doesn’t apply to SQL UDFs. >> 3. >> >> Preventing Data Leakage at Execution Time: For example, secure UDFs >> may disable certain optimizations—such as predicate pushdown—to avoid >> exposing sensitive data indirectly. [1] >> >> Given these scenarios, I agree with your point that the secure flag is >> primarily an instruction to the engine to behave differently. While it's >> largely an engine-side behavior, we still need to include this flag in the >> UDF definition to indicate whether a UDF is secure, especially considering >> the perf penalty introduced by scenario #3. We should clearly recommend >> that users avoid marking UDFs as secure unless it's truly necessary. >> >> [1] >> https://docs.snowflake.com/en/developer-guide/pushdown-optimization#example-of-indirect-data-exposure-through-pushdown >> Yufei >> >> >> On Wed, Jun 18, 2025 at 12:32 PM Ryan Blue <rdb...@gmail.com> wrote: >> >>> Yufei, could you make the argument for supporting a "secure" UDF? What >>> use case are you addressing and what specifically changes about how the UDF >>> is handled? If the idea is to hide the UDF definition, do we need to >>> include it? >>> >>> I think this would be a signal to a "trusted engine". When the engine >>> interacts with the catalog it sends authorization information about itself >>> in addition to the user that it is acting on behalf of. That way the >>> catalog knows that the secure UDF can be sent to the engine and won't be >>> shown to the user. The majority of this logic is on the REST server side, >>> and the only part that is communicated to the client is the request not to >>> show the UDF to the user, right? In that case should this be a property >>> rather than part of the definition? Even if we state that the client "must" >>> suppress the UDF definition, it's really just a request. Only trusted >>> engines can be passed the UDF definition, so a spec requirement to suppress >>> the definition isn't very meaningful. >>> >>> On Mon, Jun 16, 2025 at 5:42 PM Yufei Gu <flyrain...@gmail.com> wrote: >>> >>>> Thanks for the summary, Ajantha! >>>> >>>> Multi-statement UDFs are definitely useful, but whether those >>>> statements run within a single transaction should be treated as an >>>> engine-level concern. The Iceberg UDF spec can spell out the expectation, >>>> yet the actual guarantee still depends on the runtime. Even if a UDF >>>> declares itself transactional, the engine may or may not enforce it. >>>> >>>> One more thing: should we also introduce a “secure UDF” option >>>> supported by some engines[1], so the body and any sensitive details stay >>>> hidden from callers? >>>> >>>> [1] https://docs.snowflake.com/en/developer-guide/secure-udf-procedure >>>> >>>> Yufei >>>> >>>> >>>> On Mon, Jun 16, 2025 at 12:02 PM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>>> Thanks to everyone who joined the sync. >>>>> Here is the meeting recording: >>>>> https://drive.google.com/file/d/10_Getaasv6tDMGzeZQUgcUVwCUAaFxiz/view?usp=sharing >>>>> Summary: >>>>> >>>>> - We have gone through the SQL UDF syntax supported by different >>>>> engines (Snowflake, databricks, Dremio, Trino, OSS spark 4.0). >>>>> - Each engine uses its own block separator, like $$ or '' or none. >>>>> Action item was to check whether engines support multi-statement >>>>> (transactional) UDF bodies. >>>>> - Discussed about function overloading. Need to check whether >>>>> these engines support function overloading for SQL UDFs. Postgres >>>>> supports >>>>> it! If yes, need to adopt the spec to handle it. >>>>> - Started online spec review and discussed the deterministic flag >>>>> and concluded that we keep the independent fields (like deterministic) >>>>> in >>>>> spec only if the majority of engines supports it. Else it will be >>>>> passed in >>>>> a property bag (engine specific). And it is the engine's >>>>> responsibility to >>>>> honor those optional properties. >>>>> >>>>> Feel free to review the current proposal document here >>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing>. >>>>> >>>>> Final spec will be put to review and vote once it is ready. >>>>> >>>>> Details for next Iceberg UDF sync: >>>>> >>>>> *Monday, June 30 · 9:00 – 10:00am*Time zone: America/Los_Angeles >>>>> Google Meet joining info >>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>> >>>>> - Ajantha >>>>> >>>>> On Wed, Jun 4, 2025 at 9:00 PM Ajantha Bhat <ajanthab...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks to everyone who joined the sync. >>>>>> Here is the meeting recording: >>>>>> https://drive.google.com/file/d/1WItItsNs3m3-no7_qWPHftGqVNOdpw5C/view?usp=sharing >>>>>> >>>>>> Summary: >>>>>> >>>>>> - >>>>>> >>>>>> We discussed including Python support; the majority agreed *not >>>>>> to* (see recording for details). >>>>>> - >>>>>> >>>>>> No strong opposition to versioning — it will be included to >>>>>> support change tracking and similar use cases. >>>>>> - >>>>>> >>>>>> Suggestions were made to document how each catalog resolves UDFs, >>>>>> similar to views and tables. >>>>>> - >>>>>> >>>>>> We agreed not to deviate from the existing table/view spec — >>>>>> e.g., location will remain *required* for cross-catalog >>>>>> compatibility. >>>>>> - >>>>>> >>>>>> We also discussed a bit about view interoperability as the same >>>>>> things are applicable here. >>>>>> >>>>>> Feel free to review the proposal document >>>>>> >>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?pli=1&tab=t.0> >>>>>> here. >>>>>> With the current scope, it is similar to the view/table spec now. >>>>>> Final spec will be put to review and vote once it is ready. >>>>>> >>>>>> Details for next Iceberg UDF sync: >>>>>> >>>>>> *Monday, June 16 · 9:00 – 10:00am*Time zone: America/Los_Angeles >>>>>> Google Meet joining info >>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>> >>>>>> - Ajantha >>>>>> >>>>>> On Wed, May 21, 2025 at 3:33 AM Yufei Gu <flyrain...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi folks, >>>>>>> >>>>>>> We’ve set up a dedicated bi-weekly community sync for the UDF >>>>>>> project. Everyone’s welcome to drop in and share ideas! Here is the >>>>>>> meeting >>>>>>> link: >>>>>>> >>>>>>> Iceberg UDF sync >>>>>>> Monday, June 2 · 9:00 – 10:00am >>>>>>> Time zone: America/Los_Angeles >>>>>>> Google Meet joining info >>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>> >>>>>>> Yufei >>>>>>> >>>>>>> >>>>>>> On Fri, May 16, 2025 at 10:45 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Update on the progress. >>>>>>>> >>>>>>>> I had a meeting today with Yufei and Yun.zou to discuss the UDF >>>>>>>> proposal. We covered several key points, though some are still open for >>>>>>>> further discussion: >>>>>>>> >>>>>>>> a) *UDF Versioning*: Do we truly need versioning for UDFs at this >>>>>>>> stage? We explored the possibility of simplifying the specification by >>>>>>>> avoiding view replication, and potentially introducing versioning >>>>>>>> support >>>>>>>> later. UDTFs, being a superset of views in some ways, may not require >>>>>>>> versioning initially. >>>>>>>> >>>>>>>> b) *VarArgs Support*: While some query engines may not support >>>>>>>> vararg syntax in CREATE FUNCTION, Iceberg UDFs could represent >>>>>>>> such arguments as lists when supported by the engine. >>>>>>>> >>>>>>>> c) *Generics in UDFs*: Since Iceberg currently doesn’t support >>>>>>>> generic types (e.g., object), we can only map engine-specific >>>>>>>> types to Iceberg types. As a result, generic data types will not be >>>>>>>> supported in the initial version. >>>>>>>> >>>>>>>> d) *Python Support*: Incorporating Python as a language for SQL >>>>>>>> UDFs seems promising, especially given its potential to resolve >>>>>>>> interoperability challenges. Some engines, however, require platform >>>>>>>> version and package dependency details to execute Python code—this >>>>>>>> should >>>>>>>> be captured in the specification. >>>>>>>> >>>>>>>> *Next Steps* >>>>>>>> I will update the proposal document with two primary UDF use cases: >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> Policy exchange between engines >>>>>>>> - >>>>>>>> >>>>>>>> UDTF as a superset of view functionality >>>>>>>> >>>>>>>> The update will include corresponding syntax examples in both SQL >>>>>>>> and Python, and detail how each use case is represented in Iceberg >>>>>>>> metadata. >>>>>>>> >>>>>>>> We also plan to set up regular syncs (open to more interested >>>>>>>> participants) to continue refining and finalizing the UDF >>>>>>>> specification. >>>>>>>> - Ajantha >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Mar 12, 2025 at 9:16 PM Ajantha Bhat <ajanthab...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> I've updated the design document[1] based on the previous >>>>>>>>> comments. Additionally, I've included the SQL UDF syntax supported by >>>>>>>>> various vendors, including Dremio, Snowflake, Databricks, and Trino. >>>>>>>>> >>>>>>>>> I'm happy to schedule a separate sync if a deeper discussion is >>>>>>>>> needed. Let's keep moving forward, especially with the renewed >>>>>>>>> interest >>>>>>>>> from the community. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing >>>>>>>>> >>>>>>>>> On Thu, Feb 13, 2025 at 11:17 PM Ajantha Bhat < >>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hey everyone, >>>>>>>>>> >>>>>>>>>> During the last catalog community sync, there was significant >>>>>>>>>> interest in storing UDFs in Iceberg and adding endpoints for UDF >>>>>>>>>> handling >>>>>>>>>> in the REST catalog spec. >>>>>>>>>> >>>>>>>>>> I recently discussed this with Yufei to better understand the new >>>>>>>>>> requirement of using UDFs for fine-grained access control policies. >>>>>>>>>> This >>>>>>>>>> expands the use cases beyond just versioned and interoperable UDFs. >>>>>>>>>> Additionally, I learnt that many vendors are interested in this >>>>>>>>>> feature. >>>>>>>>>> >>>>>>>>>> Given the strong community interest and support, I’d like to take >>>>>>>>>> ownership of this effort and revive the work. I'll be revisiting the >>>>>>>>>> document I proposed long back and will share an updated proposal by >>>>>>>>>> next >>>>>>>>>> week. >>>>>>>>>> >>>>>>>>>> Looking forward to storing UDFs in Iceberg! >>>>>>>>>> - Ajantha >>>>>>>>>> >>>>>>>>>> On Thu, Aug 8, 2024 at 2:55 PM Dmitri Bourlatchkov >>>>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote: >>>>>>>>>> >>>>>>>>>>> The UDF spec does not require representations to be SQL. It >>>>>>>>>>> merely does not specify (in this revision) how other >>>>>>>>>>> representations are to >>>>>>>>>>> be written. >>>>>>>>>>> >>>>>>>>>>> This seems like an easy extension (adding a new type in the >>>>>>>>>>> "Representations" section). >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Dmitri. >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 8, 2024 at 3:47 PM Ryan Blue >>>>>>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>>> Right now, SQL is an explicit requirement of the spec. It >>>>>>>>>>>> leaves a way for future versions to add different representations >>>>>>>>>>>> later, >>>>>>>>>>>> but only SQL is supported. That was also the feedback to my initial >>>>>>>>>>>> skepticism about how it would work to add functions. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 8, 2024 at 12:44 PM Dmitri Bourlatchkov >>>>>>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I do not think the spec is meant to allow only SQL >>>>>>>>>>>>> representations, although it is certainly faviouring SQL in >>>>>>>>>>>>> examples... It >>>>>>>>>>>>> would be nice to add a non-SQL example, indeed. >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> Dmitri. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong < >>>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Coming from PyIceberg, I have concerns as this proposal >>>>>>>>>>>>>> focuses on SQL-based engines, while Python-based systems often >>>>>>>>>>>>>> work with >>>>>>>>>>>>>> data frames. Adding imperative languages like Python would make >>>>>>>>>>>>>> this >>>>>>>>>>>>>> proposal more inclusive. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen < >>>>>>>>>>>>>> piotr.findei...@gmail.com>: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Walaa, thanks for asking! >>>>>>>>>>>>>>> In the design doc linked before in this thread [1] i read >>>>>>>>>>>>>>> "Without a common standard, the UDFs are hard to share among >>>>>>>>>>>>>>> different engines." >>>>>>>>>>>>>>> ("Background and Motivation" section). >>>>>>>>>>>>>>> I agree with this statement. I don't fully understand yet >>>>>>>>>>>>>>> how the proposed design addresses shareability between the >>>>>>>>>>>>>>> engines though. >>>>>>>>>>>>>>> I would use some help to understand this better. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best >>>>>>>>>>>>>>> Piotr >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] SQL User-Defined Function Spec >>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa < >>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Piotr, what do you mean by making user-created functions >>>>>>>>>>>>>>>> shareable >>>>>>>>>>>>>>>> between engines? Do you mean UDFs written in imperative >>>>>>>>>>>>>>>> code? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen >>>>>>>>>>>>>>>> <piotr.findei...@gmail.com> wrote: >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > Hi, >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > Thank you Ajantha for creating this thread. The Iceberg >>>>>>>>>>>>>>>> UDFs are an interesting idea! >>>>>>>>>>>>>>>> > Is there a plan to make the user-created functions >>>>>>>>>>>>>>>> sharable between the engines? >>>>>>>>>>>>>>>> > If so, how would a CREATE FUNCTION statement look like in >>>>>>>>>>>>>>>> e..g Spark or Trino? >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > Meanwhile, added a few comments in the doc. >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > Best >>>>>>>>>>>>>>>> > Piotr >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue >>>>>>>>>>>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> I just looked through the proposal and added comments. I >>>>>>>>>>>>>>>> think it would be helpful to also have a design doc that >>>>>>>>>>>>>>>> covers the choices >>>>>>>>>>>>>>>> from the draft spec. For instance, the choice to enumerate all >>>>>>>>>>>>>>>> possible >>>>>>>>>>>>>>>> function input struts rather than allowing generics and >>>>>>>>>>>>>>>> varargs. >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> Here’s a quick summary of my feedback: >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> I think that the choice to enumerate function signatures >>>>>>>>>>>>>>>> is limiting. It would be nice to see a discussion of the >>>>>>>>>>>>>>>> trade-offs and a >>>>>>>>>>>>>>>> rationale for the choice. I think it would also be very >>>>>>>>>>>>>>>> helpful to have a >>>>>>>>>>>>>>>> few representative use cases for this included in the doc. >>>>>>>>>>>>>>>> That way the >>>>>>>>>>>>>>>> proposal can demonstrate that it solves those use cases with >>>>>>>>>>>>>>>> reasonable >>>>>>>>>>>>>>>> trade-offs. >>>>>>>>>>>>>>>> >> There are a few instances where this is inconsistent >>>>>>>>>>>>>>>> with conventions in other specs. For example, using string IDs >>>>>>>>>>>>>>>> rather than >>>>>>>>>>>>>>>> an integer. >>>>>>>>>>>>>>>> >> This uses a very different model for spec versioning >>>>>>>>>>>>>>>> than the Iceberg view and table specs. It requires readers to >>>>>>>>>>>>>>>> fail if there >>>>>>>>>>>>>>>> are any unknown fields, which prevents the spec from adding >>>>>>>>>>>>>>>> things that are >>>>>>>>>>>>>>>> fully backward-compatible. Other Iceberg specs only require a >>>>>>>>>>>>>>>> version >>>>>>>>>>>>>>>> change to introduce forward-incompatible changes and I think >>>>>>>>>>>>>>>> that this >>>>>>>>>>>>>>>> should do the same to avoid confusion. >>>>>>>>>>>>>>>> >> It looks like the intent is to allow multiple function >>>>>>>>>>>>>>>> signatures per verison, but it is unclear how to encode them >>>>>>>>>>>>>>>> because a >>>>>>>>>>>>>>>> version is associated with a single function signature. >>>>>>>>>>>>>>>> >> There is no review of SQL syntax for creating functions >>>>>>>>>>>>>>>> across engines, so this doesn’t show that the metadata >>>>>>>>>>>>>>>> proposed is >>>>>>>>>>>>>>>> sufficient for cross-engine use cases. >>>>>>>>>>>>>>>> >> The example for a table-valued function shows a SELECT >>>>>>>>>>>>>>>> statement and it isn’t clear how this is distinct from a view >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat < >>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> Thanks Walaa and Robert for the review on this. >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> We didn't find any blocker for the spec. >>>>>>>>>>>>>>>> >>> I will wait for a week and If no more review comments, >>>>>>>>>>>>>>>> I will raise a PR for spec addition next week. >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> If anyone else is interested, please have a look at the >>>>>>>>>>>>>>>> proposal >>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> - Ajantha >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa < >>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> Hi Ajantha, >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> I have left some comments. It is an interesting >>>>>>>>>>>>>>>> direction, but there might be some details that need to be >>>>>>>>>>>>>>>> fine tuned. >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> The doc is here [1] for others who might be >>>>>>>>>>>>>>>> interested. Resharing since I do not think it was directly >>>>>>>>>>>>>>>> linked in the >>>>>>>>>>>>>>>> thread. >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> [1] >>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> Thanks, >>>>>>>>>>>>>>>> >>>> Walaa. >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat < >>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>> >>>>> Hi, just another reminder since we didn't get any >>>>>>>>>>>>>>>> review on the proposal. >>>>>>>>>>>>>>>> >>>>> Initially proposed on June 4. >>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>> >>>>> - Ajantha >>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat < >>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>> >>>>>> Hi everyone, >>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>> >>>>>> We've only received one review so far (from Benny). >>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>> >>>>>> We would appreciate more eyes on this. >>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>> >>>>>> - Ajantha >>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat < >>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>> >>>>>>> Hi All, >>>>>>>>>>>>>>>> >>>>>>> Please find the proposal link >>>>>>>>>>>>>>>> >>>>>>> https://github.com/apache/iceberg/issues/10432 >>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>> >>>>>>> Google doc link is attached in the proposal. >>>>>>>>>>>>>>>> >>>>>>> And Thanks Stephen Lin for working on it. >>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>> >>>>>>> Hope it gives more clarity to take the decisions >>>>>>>>>>>>>>>> and how we want to implement it. >>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>> >>>>>>> - Ajantha >>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin >>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>> Thanks Jack. I actually meant >>>>>>>>>>>>>>>> scalar/aggregate/table user defined functions. Here are some >>>>>>>>>>>>>>>> examples of >>>>>>>>>>>>>>>> what I meant in (2): >>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>> Hive GenericUDF: >>>>>>>>>>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>>>>>>>>>>>>>>> >>>>>>>> Trino user defined functions: >>>>>>>>>>>>>>>> https://trino.io/docs/current/develop/functions.html >>>>>>>>>>>>>>>> >>>>>>>> Flink user defined functions: >>>>>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>> Probably what you referred to is a variation of >>>>>>>>>>>>>>>> (1) where the API is data flow/data pipeline API instead of >>>>>>>>>>>>>>>> SQL (e.g., >>>>>>>>>>>>>>>> Spark Scala). Yes, that is also possible in the very long run >>>>>>>>>>>>>>>> :) >>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>> Thanks, >>>>>>>>>>>>>>>> >>>>>>>> Walaa. >>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye < >>>>>>>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>> > (2) Custom code written in imperative function >>>>>>>>>>>>>>>> according to a Java/Scala/Python API, etc. >>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>> I think we could still explore some long term >>>>>>>>>>>>>>>> opportunities in this case. Consider you register a Spark temp >>>>>>>>>>>>>>>> view as some >>>>>>>>>>>>>>>> sort of data frame read, then it could still be resolved to a >>>>>>>>>>>>>>>> Spark plan >>>>>>>>>>>>>>>> that is representable by an intermediate representation. But I >>>>>>>>>>>>>>>> agree this >>>>>>>>>>>>>>>> gets very complicated very soon, and just having the case (1) >>>>>>>>>>>>>>>> covered would >>>>>>>>>>>>>>>> already be a huge step forward. >>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>> -Jack >>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow < >>>>>>>>>>>>>>>> btc...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>> It's interesting to note that a tabular SQL UDF >>>>>>>>>>>>>>>> can be used to build a parameterized view. So, there's >>>>>>>>>>>>>>>> definitely a lot in >>>>>>>>>>>>>>>> common between UDFs and views. >>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin >>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>> I think there is a disconnect about what is >>>>>>>>>>>>>>>> perceived as a "UDF". There are 2 flavors: >>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>> (1) Functions that are defined by the user >>>>>>>>>>>>>>>> whose definition is a composition of other built-in >>>>>>>>>>>>>>>> functions/SQL >>>>>>>>>>>>>>>> expressions. >>>>>>>>>>>>>>>> >>>>>>>>>>> (2) Custom code written in imperative function >>>>>>>>>>>>>>>> according to a Java/Scala/Python API, etc. >>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>> All the examples in Ajantha's references are >>>>>>>>>>>>>>>> pretty much from (1) and I think those have more analogy to >>>>>>>>>>>>>>>> views due to >>>>>>>>>>>>>>>> their SQL nature. Agree (2) is not practical to maintain by >>>>>>>>>>>>>>>> Iceberg, but I >>>>>>>>>>>>>>>> think Ajantha's use cases are around (1), and may be worth >>>>>>>>>>>>>>>> evaluating. >>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> >>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat < >>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you post the >>>>>>>>>>>>>>>> proposal, but I think this would be a very difficult area to >>>>>>>>>>>>>>>> tackle across >>>>>>>>>>>>>>>> engines, languages, and memory models without having a huge >>>>>>>>>>>>>>>> performance >>>>>>>>>>>>>>>> penalty. >>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL >>>>>>>>>>>>>>>> representations of UDFs (similar to views as shared by the >>>>>>>>>>>>>>>> reference links >>>>>>>>>>>>>>>> above), the complexity involved will be similar to managing >>>>>>>>>>>>>>>> views. >>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input. >>>>>>>>>>>>>>>> >>>>>>>>>>>> We will work on publishing the draft spec >>>>>>>>>>>>>>>> (inspired by the view spec) this week to facilitate further >>>>>>>>>>>>>>>> discussions. >>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye < >>>>>>>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> > While it would be great to have a common >>>>>>>>>>>>>>>> set of functions across engines, I don't see how that is >>>>>>>>>>>>>>>> practical when >>>>>>>>>>>>>>>> those engines are implemented so differently. Plugging in code >>>>>>>>>>>>>>>> -- and >>>>>>>>>>>>>>>> especially custom user-supplied code -- seems inherently >>>>>>>>>>>>>>>> specialized to me >>>>>>>>>>>>>>>> and should be part of the engines' design. >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> How is this different from the views? I feel >>>>>>>>>>>>>>>> we can say exactly the same thing for Iceberg views, but yet >>>>>>>>>>>>>>>> we have >>>>>>>>>>>>>>>> Iceberg multi-dialect views implemented. Maybe it sounds like >>>>>>>>>>>>>>>> we are trying >>>>>>>>>>>>>>>> to draw a line between SQL vs other programming language as >>>>>>>>>>>>>>>> "code"? but I >>>>>>>>>>>>>>>> think SQL is just another type of code, and we are already >>>>>>>>>>>>>>>> talking about >>>>>>>>>>>>>>>> compiling all these different code dialects to an intermediate >>>>>>>>>>>>>>>> representation (using projects like Coral, Substrait), which >>>>>>>>>>>>>>>> will be stored >>>>>>>>>>>>>>>> as another type of representation of Iceberg view. I think the >>>>>>>>>>>>>>>> same >>>>>>>>>>>>>>>> functionality can be used for UDFs if developed. >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> I actually hink adding UDF support is a good >>>>>>>>>>>>>>>> idea, even just a multi-dialect one like view, and that can >>>>>>>>>>>>>>>> allow engines >>>>>>>>>>>>>>>> to for example parse a view SQL, and when a function >>>>>>>>>>>>>>>> referenced cannot be >>>>>>>>>>>>>>>> resolved, try to seek for a multi-dialect UDF definition. >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we can discuss more when we have the >>>>>>>>>>>>>>>> actual proposal published. >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> >>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp < >>>>>>>>>>>>>>>> sn...@snazy.de> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and portable and >>>>>>>>>>>>>>>> "non-centralized" as views are. The same performance concerns >>>>>>>>>>>>>>>> apply to >>>>>>>>>>>>>>>> views as well. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Iceberg should define a common base upon >>>>>>>>>>>>>>>> which engines can build, so the argument that UDFs aren't >>>>>>>>>>>>>>>> practical, >>>>>>>>>>>>>>>> because engines are different, is probably only a temporary >>>>>>>>>>>>>>>> concern. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg should also try to >>>>>>>>>>>>>>>> tackle the idea to make views portable, which is conceptually >>>>>>>>>>>>>>>> not that much >>>>>>>>>>>>>>>> different from portable UDFs. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch >>>>>>>>>>>>>>>> to the idea of having UDFs in Iceberg, especially not in this >>>>>>>>>>>>>>>> early stage. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea >>>>>>>>>>>>>>>> to add UDFs tracked by Iceberg catalogs. I think that Iceberg >>>>>>>>>>>>>>>> primarily >>>>>>>>>>>>>>>> deals with things that are centralized, like tables of data. >>>>>>>>>>>>>>>> While it would >>>>>>>>>>>>>>>> be great to have a common set of functions across engines, I >>>>>>>>>>>>>>>> don't see how >>>>>>>>>>>>>>>> that is practical when those engines are implemented so >>>>>>>>>>>>>>>> differently. >>>>>>>>>>>>>>>> Plugging in code -- and especially custom user-supplied code >>>>>>>>>>>>>>>> -- seems >>>>>>>>>>>>>>>> inherently specialized to me and should be part of the >>>>>>>>>>>>>>>> engines' design. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I guess we'll know more when you post the >>>>>>>>>>>>>>>> proposal, but I think this would be a very difficult area to >>>>>>>>>>>>>>>> tackle across >>>>>>>>>>>>>>>> engines, languages, and memory models without having a huge >>>>>>>>>>>>>>>> performance >>>>>>>>>>>>>>>> penalty. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat >>>>>>>>>>>>>>>> <ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge the community >>>>>>>>>>>>>>>> interest in storing the Versioned SQL UDFs in Iceberg. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We want to propose the spec addition for >>>>>>>>>>>>>>>> storing the versioned UDFs in Iceberg (inspired by view spec). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> These UDFs can operate similarly to views >>>>>>>>>>>>>>>> in that they are associated with tables, but they can accept >>>>>>>>>>>>>>>> arguments and >>>>>>>>>>>>>>>> produce return values, or even function as inline expressions. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, >>>>>>>>>>>>>>>> Snowflake, Databricks Spark supports SQL UDFs at catalog level >>>>>>>>>>>>>>>> [1]. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But storing them in Iceberg can enable >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Versioning of these UDFs. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Interoperability between the engines. >>>>>>>>>>>>>>>> Potentially engines can understand the UDFs written by other >>>>>>>>>>>>>>>> engines (with >>>>>>>>>>>>>>>> the translate layer). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that integrating this feature >>>>>>>>>>>>>>>> into Iceberg would be a valuable addition, and we're eager to >>>>>>>>>>>>>>>> collaborate >>>>>>>>>>>>>>>> with the community to develop a UDF specification. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stephen has already begun drafting a >>>>>>>>>>>>>>>> specification to propose to the community. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on this. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dremio - >>>>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Trino - >>>>>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Snowflake - >>>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Databricks - >>>>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Tabular >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Robert Stupp >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> @snazy >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> -- >>>>>>>>>>>>>>>> >> Ryan Blue >>>>>>>>>>>>>>>> >> Databricks >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Ryan Blue >>>>>>>>>>>> Databricks >>>>>>>>>>>> >>>>>>>>>>>