Hey everyone, No one joined the sync today. I came to know that Yufei is on holiday, and Ryan and others couldn't make it, similar to the last sync. It seems Yufei might have forgotten to transfer meeting ownership as well, as new members needed admin approval and couldn't join automatically this week. Also, I can understand it is summer holiday season for many.
I've updated the function signature schema and other open points. I believe we're very close to the final version of the spec. A meeting is indeed necessary to finalize this, but we don't have to wait for it to finish the review process. We had many meetings on this in the past already. So, please review the document at your earliest convenience. If we agree on the spec by next week, I can raise a PR. - Ajantha On Thu, Jul 3, 2025 at 4:03 AM Yufei Gu <flyrain...@gmail.com> wrote: > I’d propose to move the field `properties` from a top level field to a > field inside “version” along with a representation, so that properties are > versioned. A property like “deterministic” could change along with > representation over time. For example, we need to change “deterministic” > from true to false in case of adding a non-deterministic SQL > expression/function(e.g., now()) inside an UDF. Otherwise, rollback won't > be safe. > > That said, it's still an open question whether we need any non-versioned > properties. We can introduce them later if a use case arises. > > Yufei > > > On Wed, Jul 2, 2025 at 3:06 PM Yufei Gu <flyrain...@gmail.com> wrote: > >> Thanks for the summary, Ajantha! >> >> I’d prefer to keep the signature list separate from the representation >> history. Here are reasons: >> >> 1. Each version still enforces a single signature. Although the >> signatures array is global to the UDF, each version references just one >> signature ID. Rollbacks to historical versions remain safe. >> 2. We’ve separated the less frequently changing component >> (signatures) from the more dynamic one (representations) to reduce >> metadata >> file size. >> 3. Since signatures use Iceberg data types, they should remain >> unaffected by multi-dialect representation differences. >> >> Yufei >> >> >> On Mon, Jun 30, 2025 at 11:28 AM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>> Thanks to everyone who joined the sync. >>> Here is the meeting recording: >>> https://drive.google.com/file/d/1FcOSbHo9ZIVeZXdUlmoG42o-chB7Q15P/view?usp=sharing >>> >>> Summary: >>> We have discussed the action items from the last sync (*see Appendix C* in >>> the proposal doc) >>> >>> - Function overloading: Supported by few of the engines and in the >>> roadmaps of many engines. Iceberg will support it. We will maintain the >>> `FunctionIdentifier` (extends `TableIdentifer` but also have a member >>> containing the function argument's type list). And all operations like >>> load, rename, list, create and drop are based on `FunctionIdentifier`. >>> - Secure UDF: If we store it as a property in a bag, we need to >>> standardize the property name. Iceberg encryption may be orthogonal to >>> this >>> discussion. >>> - UDF with multi statement and procedural bodies are supported by >>> some engines. Iceberg will support it. Store the body as it is while >>> creating function by the engine. >>> >>> new discussions around >>> >>> - Standardizing the property names (deterministic, secure). >>> - About the rename function. >>> - Replace function. To check upto what level replace is supported >>> (considering function overloading) . >>> - Signature should be associated with representation? >>> >>> I think we are close on the spec. Please review the proposal >>> >>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing> >>> . >>> >>> Details for next Iceberg UDF sync: >>> >>> *Monday, July 14 · 9:00 – 10:00am*Time zone: America/Los_Angeles >>> Google Meet joining info >>> Video call link: https://meet.google.com/aui-czix-nbh >>> >>> - Ajantha >>> >>> On Mon, Jun 30, 2025 at 9:27 PM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> Can it be handled by Iceberg encryption? If the whole metadata is >>>> encrypted, we don't have to worry about just hiding the UDF body? Let us >>>> discuss more on the sync today. >>>> >>>> On Mon, Jun 30, 2025 at 9:22 PM Yufei Gu <flyrain...@gmail.com> wrote: >>>> >>>>> Yes, hiding the definition and disabling pushdown are required.We will >>>>> need a named key(e.g., secure) somewhere, no matter if it is a top level >>>>> property or a key as a part of the UDF properties. So that both UDF >>>>> creator >>>>> and consumer can recognize it. >>>>> >>>>> Yufei >>>>> >>>>> >>>>> On Thu, Jun 26, 2025 at 4:27 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>> >>>>>> Thanks for the extra detail. What do you think the spec would >>>>>> require? Would it require hiding the UDF definition from users and >>>>>> require >>>>>> specific pushdown cases be disabled? The use cases seem valid, but I'm >>>>>> trying to understand the requirements this places on engines and why it >>>>>> needs to be part of the spec, rather than part of the properties of the >>>>>> UDF. >>>>>> >>>>>> On Fri, Jun 20, 2025 at 3:56 PM Yufei Gu <flyrain...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Ryan, >>>>>>> >>>>>>> Here are the main use cases for secure UDFs: >>>>>>> >>>>>>> 1. >>>>>>> >>>>>>> Hiding UDF Definitions: This includes concealing the UDF body >>>>>>> and details like the list of imports, some of them aren’t applicable >>>>>>> to SQL >>>>>>> UDFs. >>>>>>> 2. >>>>>>> >>>>>>> Sandboxed Execution: Ensuring the UDF runs in an isolated >>>>>>> environment. Again, this typically doesn’t apply to SQL UDFs. >>>>>>> 3. >>>>>>> >>>>>>> Preventing Data Leakage at Execution Time: For example, secure >>>>>>> UDFs may disable certain optimizations—such as predicate pushdown—to >>>>>>> avoid >>>>>>> exposing sensitive data indirectly. [1] >>>>>>> >>>>>>> Given these scenarios, I agree with your point that the secure flag >>>>>>> is primarily an instruction to the engine to behave differently. While >>>>>>> it's >>>>>>> largely an engine-side behavior, we still need to include this flag in >>>>>>> the >>>>>>> UDF definition to indicate whether a UDF is secure, especially >>>>>>> considering >>>>>>> the perf penalty introduced by scenario #3. We should clearly recommend >>>>>>> that users avoid marking UDFs as secure unless it's truly necessary. >>>>>>> >>>>>>> [1] >>>>>>> https://docs.snowflake.com/en/developer-guide/pushdown-optimization#example-of-indirect-data-exposure-through-pushdown >>>>>>> Yufei >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 18, 2025 at 12:32 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>>>> >>>>>>>> Yufei, could you make the argument for supporting a "secure" UDF? >>>>>>>> What use case are you addressing and what specifically changes about >>>>>>>> how >>>>>>>> the UDF is handled? If the idea is to hide the UDF definition, do we >>>>>>>> need >>>>>>>> to include it? >>>>>>>> >>>>>>>> I think this would be a signal to a "trusted engine". When the >>>>>>>> engine interacts with the catalog it sends authorization information >>>>>>>> about >>>>>>>> itself in addition to the user that it is acting on behalf of. That >>>>>>>> way the >>>>>>>> catalog knows that the secure UDF can be sent to the engine and won't >>>>>>>> be >>>>>>>> shown to the user. The majority of this logic is on the REST server >>>>>>>> side, >>>>>>>> and the only part that is communicated to the client is the request >>>>>>>> not to >>>>>>>> show the UDF to the user, right? In that case should this be a property >>>>>>>> rather than part of the definition? Even if we state that the client >>>>>>>> "must" >>>>>>>> suppress the UDF definition, it's really just a request. Only trusted >>>>>>>> engines can be passed the UDF definition, so a spec requirement to >>>>>>>> suppress >>>>>>>> the definition isn't very meaningful. >>>>>>>> >>>>>>>> On Mon, Jun 16, 2025 at 5:42 PM Yufei Gu <flyrain...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for the summary, Ajantha! >>>>>>>>> >>>>>>>>> Multi-statement UDFs are definitely useful, but whether those >>>>>>>>> statements run within a single transaction should be treated as an >>>>>>>>> engine-level concern. The Iceberg UDF spec can spell out the >>>>>>>>> expectation, >>>>>>>>> yet the actual guarantee still depends on the runtime. Even if a UDF >>>>>>>>> declares itself transactional, the engine may or may not enforce it. >>>>>>>>> >>>>>>>>> One more thing: should we also introduce a “secure UDF” option >>>>>>>>> supported by some engines[1], so the body and any sensitive details >>>>>>>>> stay >>>>>>>>> hidden from callers? >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://docs.snowflake.com/en/developer-guide/secure-udf-procedure >>>>>>>>> >>>>>>>>> Yufei >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 16, 2025 at 12:02 PM Ajantha Bhat < >>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Thanks to everyone who joined the sync. >>>>>>>>>> Here is the meeting recording: >>>>>>>>>> https://drive.google.com/file/d/10_Getaasv6tDMGzeZQUgcUVwCUAaFxiz/view?usp=sharing >>>>>>>>>> Summary: >>>>>>>>>> >>>>>>>>>> - We have gone through the SQL UDF syntax supported by >>>>>>>>>> different engines (Snowflake, databricks, Dremio, Trino, OSS >>>>>>>>>> spark 4.0). >>>>>>>>>> - Each engine uses its own block separator, like $$ or '' or >>>>>>>>>> none. Action item was to check whether engines support >>>>>>>>>> multi-statement >>>>>>>>>> (transactional) UDF bodies. >>>>>>>>>> - Discussed about function overloading. Need to check whether >>>>>>>>>> these engines support function overloading for SQL UDFs. Postgres >>>>>>>>>> supports >>>>>>>>>> it! If yes, need to adopt the spec to handle it. >>>>>>>>>> - Started online spec review and discussed the deterministic >>>>>>>>>> flag and concluded that we keep the independent fields (like >>>>>>>>>> deterministic) >>>>>>>>>> in spec only if the majority of engines supports it. Else it will >>>>>>>>>> be passed >>>>>>>>>> in a property bag (engine specific). And it is the engine's >>>>>>>>>> responsibility to honor those optional properties. >>>>>>>>>> >>>>>>>>>> Feel free to review the current proposal document here >>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing>. >>>>>>>>>> >>>>>>>>>> Final spec will be put to review and vote once it is ready. >>>>>>>>>> >>>>>>>>>> Details for next Iceberg UDF sync: >>>>>>>>>> >>>>>>>>>> *Monday, June 30 · 9:00 – 10:00am*Time zone: America/Los_Angeles >>>>>>>>>> Google Meet joining info >>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>> >>>>>>>>>> - Ajantha >>>>>>>>>> >>>>>>>>>> On Wed, Jun 4, 2025 at 9:00 PM Ajantha Bhat < >>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks to everyone who joined the sync. >>>>>>>>>>> Here is the meeting recording: >>>>>>>>>>> https://drive.google.com/file/d/1WItItsNs3m3-no7_qWPHftGqVNOdpw5C/view?usp=sharing >>>>>>>>>>> >>>>>>>>>>> Summary: >>>>>>>>>>> >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> We discussed including Python support; the majority agreed *not >>>>>>>>>>> to* (see recording for details). >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> No strong opposition to versioning — it will be included to >>>>>>>>>>> support change tracking and similar use cases. >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> Suggestions were made to document how each catalog resolves >>>>>>>>>>> UDFs, similar to views and tables. >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> We agreed not to deviate from the existing table/view spec — >>>>>>>>>>> e.g., location will remain *required* for cross-catalog >>>>>>>>>>> compatibility. >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> We also discussed a bit about view interoperability as the >>>>>>>>>>> same things are applicable here. >>>>>>>>>>> >>>>>>>>>>> Feel free to review the proposal document >>>>>>>>>>> >>>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?pli=1&tab=t.0> >>>>>>>>>>> here. >>>>>>>>>>> With the current scope, it is similar to the view/table spec now. >>>>>>>>>>> Final spec will be put to review and vote once it is ready. >>>>>>>>>>> >>>>>>>>>>> Details for next Iceberg UDF sync: >>>>>>>>>>> >>>>>>>>>>> *Monday, June 16 · 9:00 – 10:00am*Time zone: America/Los_Angeles >>>>>>>>>>> Google Meet joining info >>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>> >>>>>>>>>>> - Ajantha >>>>>>>>>>> >>>>>>>>>>> On Wed, May 21, 2025 at 3:33 AM Yufei Gu <flyrain...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi folks, >>>>>>>>>>>> >>>>>>>>>>>> We’ve set up a dedicated bi-weekly community sync for the UDF >>>>>>>>>>>> project. Everyone’s welcome to drop in and share ideas! Here is >>>>>>>>>>>> the meeting >>>>>>>>>>>> link: >>>>>>>>>>>> >>>>>>>>>>>> Iceberg UDF sync >>>>>>>>>>>> Monday, June 2 · 9:00 – 10:00am >>>>>>>>>>>> Time zone: America/Los_Angeles >>>>>>>>>>>> Google Meet joining info >>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>>> >>>>>>>>>>>> Yufei >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, May 16, 2025 at 10:45 AM Ajantha Bhat < >>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Update on the progress. >>>>>>>>>>>>> >>>>>>>>>>>>> I had a meeting today with Yufei and Yun.zou to discuss the >>>>>>>>>>>>> UDF proposal. We covered several key points, though some are >>>>>>>>>>>>> still open for >>>>>>>>>>>>> further discussion: >>>>>>>>>>>>> >>>>>>>>>>>>> a) *UDF Versioning*: Do we truly need versioning for UDFs at >>>>>>>>>>>>> this stage? We explored the possibility of simplifying the >>>>>>>>>>>>> specification by >>>>>>>>>>>>> avoiding view replication, and potentially introducing versioning >>>>>>>>>>>>> support >>>>>>>>>>>>> later. UDTFs, being a superset of views in some ways, may not >>>>>>>>>>>>> require >>>>>>>>>>>>> versioning initially. >>>>>>>>>>>>> >>>>>>>>>>>>> b) *VarArgs Support*: While some query engines may not >>>>>>>>>>>>> support vararg syntax in CREATE FUNCTION, Iceberg UDFs could >>>>>>>>>>>>> represent such arguments as lists when supported by the engine. >>>>>>>>>>>>> >>>>>>>>>>>>> c) *Generics in UDFs*: Since Iceberg currently doesn’t >>>>>>>>>>>>> support generic types (e.g., object), we can only map >>>>>>>>>>>>> engine-specific types to Iceberg types. As a result, generic data >>>>>>>>>>>>> types >>>>>>>>>>>>> will not be supported in the initial version. >>>>>>>>>>>>> >>>>>>>>>>>>> d) *Python Support*: Incorporating Python as a language for >>>>>>>>>>>>> SQL UDFs seems promising, especially given its potential to >>>>>>>>>>>>> resolve >>>>>>>>>>>>> interoperability challenges. Some engines, however, require >>>>>>>>>>>>> platform >>>>>>>>>>>>> version and package dependency details to execute Python >>>>>>>>>>>>> code—this should >>>>>>>>>>>>> be captured in the specification. >>>>>>>>>>>>> >>>>>>>>>>>>> *Next Steps* >>>>>>>>>>>>> I will update the proposal document with two primary UDF use >>>>>>>>>>>>> cases: >>>>>>>>>>>>> >>>>>>>>>>>>> - >>>>>>>>>>>>> >>>>>>>>>>>>> Policy exchange between engines >>>>>>>>>>>>> - >>>>>>>>>>>>> >>>>>>>>>>>>> UDTF as a superset of view functionality >>>>>>>>>>>>> >>>>>>>>>>>>> The update will include corresponding syntax examples in both >>>>>>>>>>>>> SQL and Python, and detail how each use case is represented in >>>>>>>>>>>>> Iceberg >>>>>>>>>>>>> metadata. >>>>>>>>>>>>> >>>>>>>>>>>>> We also plan to set up regular syncs (open to more interested >>>>>>>>>>>>> participants) to continue refining and finalizing the UDF >>>>>>>>>>>>> specification. >>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Mar 12, 2025 at 9:16 PM Ajantha Bhat < >>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've updated the design document[1] based on the previous >>>>>>>>>>>>>> comments. Additionally, I've included the SQL UDF syntax >>>>>>>>>>>>>> supported by >>>>>>>>>>>>>> various vendors, including Dremio, Snowflake, Databricks, and >>>>>>>>>>>>>> Trino. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm happy to schedule a separate sync if a deeper discussion >>>>>>>>>>>>>> is needed. Let's keep moving forward, especially with the >>>>>>>>>>>>>> renewed interest >>>>>>>>>>>>>> from the community. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Feb 13, 2025 at 11:17 PM Ajantha Bhat < >>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hey everyone, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> During the last catalog community sync, there was >>>>>>>>>>>>>>> significant interest in storing UDFs in Iceberg and adding >>>>>>>>>>>>>>> endpoints for >>>>>>>>>>>>>>> UDF handling in the REST catalog spec. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I recently discussed this with Yufei to better understand >>>>>>>>>>>>>>> the new requirement of using UDFs for fine-grained access >>>>>>>>>>>>>>> control policies. >>>>>>>>>>>>>>> This expands the use cases beyond just versioned and >>>>>>>>>>>>>>> interoperable UDFs. >>>>>>>>>>>>>>> Additionally, I learnt that many vendors are interested in this >>>>>>>>>>>>>>> feature. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Given the strong community interest and support, I’d like to >>>>>>>>>>>>>>> take ownership of this effort and revive the work. I'll be >>>>>>>>>>>>>>> revisiting the >>>>>>>>>>>>>>> document I proposed long back and will share an updated >>>>>>>>>>>>>>> proposal by next >>>>>>>>>>>>>>> week. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Looking forward to storing UDFs in Iceberg! >>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 2:55 PM Dmitri Bourlatchkov >>>>>>>>>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The UDF spec does not require representations to be SQL. It >>>>>>>>>>>>>>>> merely does not specify (in this revision) how other >>>>>>>>>>>>>>>> representations are to >>>>>>>>>>>>>>>> be written. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This seems like an easy extension (adding a new type in the >>>>>>>>>>>>>>>> "Representations" section). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>> Dmitri. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 3:47 PM Ryan Blue >>>>>>>>>>>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Right now, SQL is an explicit requirement of the spec. It >>>>>>>>>>>>>>>>> leaves a way for future versions to add different >>>>>>>>>>>>>>>>> representations later, >>>>>>>>>>>>>>>>> but only SQL is supported. That was also the feedback to my >>>>>>>>>>>>>>>>> initial >>>>>>>>>>>>>>>>> skepticism about how it would work to add functions. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 12:44 PM Dmitri Bourlatchkov >>>>>>>>>>>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I do not think the spec is meant to allow only SQL >>>>>>>>>>>>>>>>>> representations, although it is certainly faviouring SQL in >>>>>>>>>>>>>>>>>> examples... It >>>>>>>>>>>>>>>>>> would be nice to add a non-SQL example, indeed. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>> Dmitri. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong < >>>>>>>>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Coming from PyIceberg, I have concerns as this proposal >>>>>>>>>>>>>>>>>>> focuses on SQL-based engines, while Python-based systems >>>>>>>>>>>>>>>>>>> often work with >>>>>>>>>>>>>>>>>>> data frames. Adding imperative languages like Python would >>>>>>>>>>>>>>>>>>> make this >>>>>>>>>>>>>>>>>>> proposal more inclusive. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen < >>>>>>>>>>>>>>>>>>> piotr.findei...@gmail.com>: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Walaa, thanks for asking! >>>>>>>>>>>>>>>>>>>> In the design doc linked before in this thread [1] i >>>>>>>>>>>>>>>>>>>> read >>>>>>>>>>>>>>>>>>>> "Without a common standard, the UDFs are hard to share >>>>>>>>>>>>>>>>>>>> among different engines." >>>>>>>>>>>>>>>>>>>> ("Background and Motivation" section). >>>>>>>>>>>>>>>>>>>> I agree with this statement. I don't fully understand >>>>>>>>>>>>>>>>>>>> yet how the proposed design addresses shareability between >>>>>>>>>>>>>>>>>>>> the engines >>>>>>>>>>>>>>>>>>>> though. >>>>>>>>>>>>>>>>>>>> I would use some help to understand this better. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Best >>>>>>>>>>>>>>>>>>>> Piotr >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [1] SQL User-Defined Function Spec >>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa < >>>>>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Piotr, what do you mean by making user-created >>>>>>>>>>>>>>>>>>>>> functions shareable >>>>>>>>>>>>>>>>>>>>> between engines? Do you mean UDFs written in >>>>>>>>>>>>>>>>>>>>> imperative code? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen >>>>>>>>>>>>>>>>>>>>> <piotr.findei...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Hi, >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Thank you Ajantha for creating this thread. The >>>>>>>>>>>>>>>>>>>>> Iceberg UDFs are an interesting idea! >>>>>>>>>>>>>>>>>>>>> > Is there a plan to make the user-created functions >>>>>>>>>>>>>>>>>>>>> sharable between the engines? >>>>>>>>>>>>>>>>>>>>> > If so, how would a CREATE FUNCTION statement look >>>>>>>>>>>>>>>>>>>>> like in e..g Spark or Trino? >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Meanwhile, added a few comments in the doc. >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Best >>>>>>>>>>>>>>>>>>>>> > Piotr >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue >>>>>>>>>>>>>>>>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> I just looked through the proposal and added >>>>>>>>>>>>>>>>>>>>> comments. I think it would be helpful to also have a >>>>>>>>>>>>>>>>>>>>> design doc that covers >>>>>>>>>>>>>>>>>>>>> the choices from the draft spec. For instance, the choice >>>>>>>>>>>>>>>>>>>>> to enumerate all >>>>>>>>>>>>>>>>>>>>> possible function input struts rather than allowing >>>>>>>>>>>>>>>>>>>>> generics and varargs. >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> Here’s a quick summary of my feedback: >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> I think that the choice to enumerate function >>>>>>>>>>>>>>>>>>>>> signatures is limiting. It would be nice to see a >>>>>>>>>>>>>>>>>>>>> discussion of the >>>>>>>>>>>>>>>>>>>>> trade-offs and a rationale for the choice. I think it >>>>>>>>>>>>>>>>>>>>> would also be very >>>>>>>>>>>>>>>>>>>>> helpful to have a few representative use cases for this >>>>>>>>>>>>>>>>>>>>> included in the >>>>>>>>>>>>>>>>>>>>> doc. That way the proposal can demonstrate that it solves >>>>>>>>>>>>>>>>>>>>> those use cases >>>>>>>>>>>>>>>>>>>>> with reasonable trade-offs. >>>>>>>>>>>>>>>>>>>>> >> There are a few instances where this is >>>>>>>>>>>>>>>>>>>>> inconsistent with conventions in other specs. For >>>>>>>>>>>>>>>>>>>>> example, using string IDs >>>>>>>>>>>>>>>>>>>>> rather than an integer. >>>>>>>>>>>>>>>>>>>>> >> This uses a very different model for spec >>>>>>>>>>>>>>>>>>>>> versioning than the Iceberg view and table specs. It >>>>>>>>>>>>>>>>>>>>> requires readers to >>>>>>>>>>>>>>>>>>>>> fail if there are any unknown fields, which prevents the >>>>>>>>>>>>>>>>>>>>> spec from adding >>>>>>>>>>>>>>>>>>>>> things that are fully backward-compatible. Other Iceberg >>>>>>>>>>>>>>>>>>>>> specs only require >>>>>>>>>>>>>>>>>>>>> a version change to introduce forward-incompatible >>>>>>>>>>>>>>>>>>>>> changes and I think that >>>>>>>>>>>>>>>>>>>>> this should do the same to avoid confusion. >>>>>>>>>>>>>>>>>>>>> >> It looks like the intent is to allow multiple >>>>>>>>>>>>>>>>>>>>> function signatures per verison, but it is unclear how to >>>>>>>>>>>>>>>>>>>>> encode them >>>>>>>>>>>>>>>>>>>>> because a version is associated with a single function >>>>>>>>>>>>>>>>>>>>> signature. >>>>>>>>>>>>>>>>>>>>> >> There is no review of SQL syntax for creating >>>>>>>>>>>>>>>>>>>>> functions across engines, so this doesn’t show that the >>>>>>>>>>>>>>>>>>>>> metadata proposed >>>>>>>>>>>>>>>>>>>>> is sufficient for cross-engine use cases. >>>>>>>>>>>>>>>>>>>>> >> The example for a table-valued function shows a >>>>>>>>>>>>>>>>>>>>> SELECT statement and it isn’t clear how this is distinct >>>>>>>>>>>>>>>>>>>>> from a view >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> >>> Thanks Walaa and Robert for the review on this. >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> >>> We didn't find any blocker for the spec. >>>>>>>>>>>>>>>>>>>>> >>> I will wait for a week and If no more review >>>>>>>>>>>>>>>>>>>>> comments, I will raise a PR for spec addition next week. >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> >>> If anyone else is interested, please have a look >>>>>>>>>>>>>>>>>>>>> at the proposal >>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> >>> - Ajantha >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin >>>>>>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> >>>> Hi Ajantha, >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> >>>> I have left some comments. It is an interesting >>>>>>>>>>>>>>>>>>>>> direction, but there might be some details that need to >>>>>>>>>>>>>>>>>>>>> be fine tuned. >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> >>>> The doc is here [1] for others who might be >>>>>>>>>>>>>>>>>>>>> interested. Resharing since I do not think it was >>>>>>>>>>>>>>>>>>>>> directly linked in the >>>>>>>>>>>>>>>>>>>>> thread. >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> >>>> [1] >>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> >>>> Thanks, >>>>>>>>>>>>>>>>>>>>> >>>> Walaa. >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> Hi, just another reminder since we didn't get >>>>>>>>>>>>>>>>>>>>> any review on the proposal. >>>>>>>>>>>>>>>>>>>>> >>>>> Initially proposed on June 4. >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>> Hi everyone, >>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>> We've only received one review so far (from >>>>>>>>>>>>>>>>>>>>> Benny). >>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>> We would appreciate more eyes on this. >>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>> Hi All, >>>>>>>>>>>>>>>>>>>>> >>>>>>> Please find the proposal link >>>>>>>>>>>>>>>>>>>>> >>>>>>> https://github.com/apache/iceberg/issues/10432 >>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>> Google doc link is attached in the proposal. >>>>>>>>>>>>>>>>>>>>> >>>>>>> And Thanks Stephen Lin for working on it. >>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>> Hope it gives more clarity to take the >>>>>>>>>>>>>>>>>>>>> decisions and how we want to implement it. >>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin >>>>>>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>> Thanks Jack. I actually meant >>>>>>>>>>>>>>>>>>>>> scalar/aggregate/table user defined functions. Here are >>>>>>>>>>>>>>>>>>>>> some examples of >>>>>>>>>>>>>>>>>>>>> what I meant in (2): >>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>> Hive GenericUDF: >>>>>>>>>>>>>>>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>>>>>>>>>>>>>>>>>>>> >>>>>>>> Trino user defined functions: >>>>>>>>>>>>>>>>>>>>> https://trino.io/docs/current/develop/functions.html >>>>>>>>>>>>>>>>>>>>> >>>>>>>> Flink user defined functions: >>>>>>>>>>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>> Probably what you referred to is a variation >>>>>>>>>>>>>>>>>>>>> of (1) where the API is data flow/data pipeline API >>>>>>>>>>>>>>>>>>>>> instead of SQL (e.g., >>>>>>>>>>>>>>>>>>>>> Spark Scala). Yes, that is also possible in the very long >>>>>>>>>>>>>>>>>>>>> run :) >>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>> >>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye < >>>>>>>>>>>>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>> > (2) Custom code written in imperative >>>>>>>>>>>>>>>>>>>>> function according to a Java/Scala/Python API, etc. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>> I think we could still explore some long >>>>>>>>>>>>>>>>>>>>> term opportunities in this case. Consider you register a >>>>>>>>>>>>>>>>>>>>> Spark temp view as >>>>>>>>>>>>>>>>>>>>> some sort of data frame read, then it could still be >>>>>>>>>>>>>>>>>>>>> resolved to a Spark >>>>>>>>>>>>>>>>>>>>> plan that is representable by an intermediate >>>>>>>>>>>>>>>>>>>>> representation. But I agree >>>>>>>>>>>>>>>>>>>>> this gets very complicated very soon, and just having the >>>>>>>>>>>>>>>>>>>>> case (1) covered >>>>>>>>>>>>>>>>>>>>> would already be a huge step forward. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>> -Jack >>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow < >>>>>>>>>>>>>>>>>>>>> btc...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> It's interesting to note that a tabular SQL >>>>>>>>>>>>>>>>>>>>> UDF can be used to build a parameterized view. So, >>>>>>>>>>>>>>>>>>>>> there's definitely a >>>>>>>>>>>>>>>>>>>>> lot in common between UDFs and views. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin >>>>>>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> I think there is a disconnect about what >>>>>>>>>>>>>>>>>>>>> is perceived as a "UDF". There are 2 flavors: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> (1) Functions that are defined by the user >>>>>>>>>>>>>>>>>>>>> whose definition is a composition of other built-in >>>>>>>>>>>>>>>>>>>>> functions/SQL >>>>>>>>>>>>>>>>>>>>> expressions. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> (2) Custom code written in imperative >>>>>>>>>>>>>>>>>>>>> function according to a Java/Scala/Python API, etc. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> All the examples in Ajantha's references >>>>>>>>>>>>>>>>>>>>> are pretty much from (1) and I think those have more >>>>>>>>>>>>>>>>>>>>> analogy to views due >>>>>>>>>>>>>>>>>>>>> to their SQL nature. Agree (2) is not practical to >>>>>>>>>>>>>>>>>>>>> maintain by Iceberg, but >>>>>>>>>>>>>>>>>>>>> I think Ajantha's use cases are around (1), and may be >>>>>>>>>>>>>>>>>>>>> worth evaluating. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha >>>>>>>>>>>>>>>>>>>>> Bhat <ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you post >>>>>>>>>>>>>>>>>>>>> the proposal, but I think this would be a very difficult >>>>>>>>>>>>>>>>>>>>> area to tackle >>>>>>>>>>>>>>>>>>>>> across engines, languages, and memory models without >>>>>>>>>>>>>>>>>>>>> having a huge >>>>>>>>>>>>>>>>>>>>> performance penalty. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL >>>>>>>>>>>>>>>>>>>>> representations of UDFs (similar to views as shared by >>>>>>>>>>>>>>>>>>>>> the reference links >>>>>>>>>>>>>>>>>>>>> above), the complexity involved will be similar to >>>>>>>>>>>>>>>>>>>>> managing views. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your >>>>>>>>>>>>>>>>>>>>> input. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> We will work on publishing the draft spec >>>>>>>>>>>>>>>>>>>>> (inspired by the view spec) this week to facilitate >>>>>>>>>>>>>>>>>>>>> further discussions. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye < >>>>>>>>>>>>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> > While it would be great to have a >>>>>>>>>>>>>>>>>>>>> common set of functions across engines, I don't see how >>>>>>>>>>>>>>>>>>>>> that is practical >>>>>>>>>>>>>>>>>>>>> when those engines are implemented so differently. >>>>>>>>>>>>>>>>>>>>> Plugging in code -- and >>>>>>>>>>>>>>>>>>>>> especially custom user-supplied code -- seems inherently >>>>>>>>>>>>>>>>>>>>> specialized to me >>>>>>>>>>>>>>>>>>>>> and should be part of the engines' design. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> How is this different from the views? I >>>>>>>>>>>>>>>>>>>>> feel we can say exactly the same thing for Iceberg views, >>>>>>>>>>>>>>>>>>>>> but yet we have >>>>>>>>>>>>>>>>>>>>> Iceberg multi-dialect views implemented. Maybe it sounds >>>>>>>>>>>>>>>>>>>>> like we are trying >>>>>>>>>>>>>>>>>>>>> to draw a line between SQL vs other programming language >>>>>>>>>>>>>>>>>>>>> as "code"? but I >>>>>>>>>>>>>>>>>>>>> think SQL is just another type of code, and we are >>>>>>>>>>>>>>>>>>>>> already talking about >>>>>>>>>>>>>>>>>>>>> compiling all these different code dialects to an >>>>>>>>>>>>>>>>>>>>> intermediate >>>>>>>>>>>>>>>>>>>>> representation (using projects like Coral, Substrait), >>>>>>>>>>>>>>>>>>>>> which will be stored >>>>>>>>>>>>>>>>>>>>> as another type of representation of Iceberg view. I >>>>>>>>>>>>>>>>>>>>> think the same >>>>>>>>>>>>>>>>>>>>> functionality can be used for UDFs if developed. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I actually hink adding UDF support is a >>>>>>>>>>>>>>>>>>>>> good idea, even just a multi-dialect one like view, and >>>>>>>>>>>>>>>>>>>>> that can allow >>>>>>>>>>>>>>>>>>>>> engines to for example parse a view SQL, and when a >>>>>>>>>>>>>>>>>>>>> function referenced >>>>>>>>>>>>>>>>>>>>> cannot be resolved, try to seek for a multi-dialect UDF >>>>>>>>>>>>>>>>>>>>> definition. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we can discuss more when we have >>>>>>>>>>>>>>>>>>>>> the actual proposal published. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert >>>>>>>>>>>>>>>>>>>>> Stupp <sn...@snazy.de> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and >>>>>>>>>>>>>>>>>>>>> portable and "non-centralized" as views are. The same >>>>>>>>>>>>>>>>>>>>> performance concerns >>>>>>>>>>>>>>>>>>>>> apply to views as well. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Iceberg should define a common base >>>>>>>>>>>>>>>>>>>>> upon which engines can build, so the argument that UDFs >>>>>>>>>>>>>>>>>>>>> aren't practical, >>>>>>>>>>>>>>>>>>>>> because engines are different, is probably only a >>>>>>>>>>>>>>>>>>>>> temporary concern. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg should also >>>>>>>>>>>>>>>>>>>>> try to tackle the idea to make views portable, which is >>>>>>>>>>>>>>>>>>>>> conceptually not >>>>>>>>>>>>>>>>>>>>> that much different from portable UDFs. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative >>>>>>>>>>>>>>>>>>>>> touch to the idea of having UDFs in Iceberg, especially >>>>>>>>>>>>>>>>>>>>> not in this early >>>>>>>>>>>>>>>>>>>>> stage. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether it's a good >>>>>>>>>>>>>>>>>>>>> idea to add UDFs tracked by Iceberg catalogs. I think >>>>>>>>>>>>>>>>>>>>> that Iceberg >>>>>>>>>>>>>>>>>>>>> primarily deals with things that are centralized, like >>>>>>>>>>>>>>>>>>>>> tables of data. >>>>>>>>>>>>>>>>>>>>> While it would be great to have a common set of functions >>>>>>>>>>>>>>>>>>>>> across engines, I >>>>>>>>>>>>>>>>>>>>> don't see how that is practical when those engines are >>>>>>>>>>>>>>>>>>>>> implemented so >>>>>>>>>>>>>>>>>>>>> differently. Plugging in code -- and especially custom >>>>>>>>>>>>>>>>>>>>> user-supplied code >>>>>>>>>>>>>>>>>>>>> -- seems inherently specialized to me and should be part >>>>>>>>>>>>>>>>>>>>> of the engines' >>>>>>>>>>>>>>>>>>>>> design. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I guess we'll know more when you post >>>>>>>>>>>>>>>>>>>>> the proposal, but I think this would be a very difficult >>>>>>>>>>>>>>>>>>>>> area to tackle >>>>>>>>>>>>>>>>>>>>> across engines, languages, and memory models without >>>>>>>>>>>>>>>>>>>>> having a huge >>>>>>>>>>>>>>>>>>>>> performance penalty. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha >>>>>>>>>>>>>>>>>>>>> Bhat <ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge the >>>>>>>>>>>>>>>>>>>>> community interest in storing the Versioned SQL UDFs in >>>>>>>>>>>>>>>>>>>>> Iceberg. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We want to propose the spec addition >>>>>>>>>>>>>>>>>>>>> for storing the versioned UDFs in Iceberg (inspired by >>>>>>>>>>>>>>>>>>>>> view spec). >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> These UDFs can operate similarly to >>>>>>>>>>>>>>>>>>>>> views in that they are associated with tables, but they >>>>>>>>>>>>>>>>>>>>> can accept >>>>>>>>>>>>>>>>>>>>> arguments and produce return values, or even function as >>>>>>>>>>>>>>>>>>>>> inline expressions. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, >>>>>>>>>>>>>>>>>>>>> Snowflake, Databricks Spark supports SQL UDFs at catalog >>>>>>>>>>>>>>>>>>>>> level [1]. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But storing them in Iceberg can enable >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Versioning of these UDFs. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Interoperability between the >>>>>>>>>>>>>>>>>>>>> engines. Potentially engines can understand the UDFs >>>>>>>>>>>>>>>>>>>>> written by other >>>>>>>>>>>>>>>>>>>>> engines (with the translate layer). >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that integrating this >>>>>>>>>>>>>>>>>>>>> feature into Iceberg would be a valuable addition, and >>>>>>>>>>>>>>>>>>>>> we're eager to >>>>>>>>>>>>>>>>>>>>> collaborate with the community to develop a UDF >>>>>>>>>>>>>>>>>>>>> specification. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stephen has already begun drafting a >>>>>>>>>>>>>>>>>>>>> specification to propose to the community. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on this. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dremio - >>>>>>>>>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Trino - >>>>>>>>>>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Snowflake - >>>>>>>>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Databricks - >>>>>>>>>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Tabular >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Robert Stupp >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> @snazy >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> -- >>>>>>>>>>>>>>>>>>>>> >> Ryan Blue >>>>>>>>>>>>>>>>>>>>> >> Databricks >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>>> Databricks >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>