Thanks Jack. I actually meant scalar/aggregate/table user defined functions. Here are some examples of what I meant in (2):
Hive GenericUDF: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java Trino user defined functions: https://trino.io/docs/current/develop/functions.html Flink user defined functions: https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ Probably what you referred to is a variation of (1) where the API is data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that is also possible in the very long run :) Thanks, Walaa. On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> wrote: > > (2) Custom code written in imperative function according to a > Java/Scala/Python API, etc. > > I think we could still explore some long term opportunities in this case. > Consider you register a Spark temp view as some sort of data frame read, > then it could still be resolved to a Spark plan that is representable by an > intermediate representation. But I agree this gets very complicated very > soon, and just having the case (1) covered would already be a huge step > forward. > > -Jack > > > On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> wrote: > >> It's interesting to note that a tabular SQL UDF can be used to build a >> *parameterized >> *view. So, there's definitely a lot in common between UDFs and views. >> >> Thanks >> >> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> I think there is a disconnect about what is perceived as a "UDF". There >>> are 2 flavors: >>> >>> (1) Functions that are defined by the user whose definition is a >>> composition of other built-in functions/SQL expressions. >>> (2) Custom code written in imperative function according to a >>> Java/Scala/Python API, etc. >>> >>> All the examples in Ajantha's references are pretty much from (1) and I >>> think those have more analogy to views due to their SQL nature. Agree (2) >>> is not practical to maintain by Iceberg, but I think Ajantha's use cases >>> are around (1), and may be worth evaluating. >>> >>> Thanks, >>> Walaa. >>> >>> >>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> I guess we'll know more when you post the proposal, but I think this >>>>> would be a very difficult area to tackle across engines, languages, and >>>>> memory models without having a huge performance penalty. >>>> >>>> Assuming Iceberg initially supports SQL representations of UDFs >>>> (similar to views as shared by the reference links above), the complexity >>>> involved will be similar to managing views. >>>> >>>> Thanks, Ryan, Robert, and Jack, for your input. >>>> We will work on publishing the draft spec (inspired by the view spec) >>>> this week to facilitate further discussions. >>>> >>>> - Ajantha >>>> >>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> > While it would be great to have a common set of functions across >>>>> engines, I don't see how that is practical when those engines are >>>>> implemented so differently. Plugging in code -- and especially custom >>>>> user-supplied code -- seems inherently specialized to me and should be >>>>> part >>>>> of the engines' design. >>>>> >>>>> How is this different from the views? I feel we can say exactly the >>>>> same thing for Iceberg views, but yet we have Iceberg multi-dialect views >>>>> implemented. Maybe it sounds like we are trying to draw a line between SQL >>>>> vs other programming language as "code"? but I think SQL is just another >>>>> type of code, and we are already talking about compiling all these >>>>> different code dialects to an intermediate representation (using projects >>>>> like Coral, Substrait), which will be stored as another type of >>>>> representation of Iceberg view. I think the same functionality can be used >>>>> for UDFs if developed. >>>>> >>>>> I actually hink adding UDF support is a good idea, even just a >>>>> multi-dialect one like view, and that can allow engines to for example >>>>> parse a view SQL, and when a function referenced cannot be resolved, try >>>>> to >>>>> seek for a multi-dialect UDF definition. >>>>> >>>>> I guess we can discuss more when we have the actual proposal published. >>>>> >>>>> Best, >>>>> Jack Ye >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de> wrote: >>>>> >>>>>> UDFs are as engine specific and portable and "non-centralized" as >>>>>> views are. The same performance concerns apply to views as well. >>>>>> Iceberg should define a common base upon which engines can build, so >>>>>> the argument that UDFs aren't practical, because engines are different, >>>>>> is >>>>>> probably only a temporary concern. >>>>>> >>>>>> In the long term, Iceberg should also try to tackle the idea to make >>>>>> views portable, which is conceptually not that much different from >>>>>> portable >>>>>> UDFs. >>>>>> >>>>>> >>>>>> PS: I'm not a fan of adding a negative touch to the idea of having >>>>>> UDFs in Iceberg, especially not in this early stage. >>>>>> >>>>>> >>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>> >>>>>> Thanks, Ajantha. >>>>>> >>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked by >>>>>> Iceberg catalogs. I think that Iceberg primarily deals with things that >>>>>> are >>>>>> centralized, like tables of data. While it would be great to have a >>>>>> common >>>>>> set of functions across engines, I don't see how that is practical when >>>>>> those engines are implemented so differently. Plugging in code -- and >>>>>> especially custom user-supplied code -- seems inherently specialized to >>>>>> me >>>>>> and should be part of the engines' design. >>>>>> >>>>>> I guess we'll know more when you post the proposal, but I think this >>>>>> would be a very difficult area to tackle across engines, languages, and >>>>>> memory models without having a huge performance penalty. >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Everyone, >>>>>>> >>>>>>> This is a discussion to gauge the community interest in storing the >>>>>>> Versioned SQL UDFs in Iceberg. >>>>>>> We want to propose the spec addition for storing the versioned UDFs >>>>>>> in Iceberg (inspired by view spec). >>>>>>> >>>>>>> These UDFs can operate similarly to views in that they are >>>>>>> associated with tables, but they can accept arguments and produce return >>>>>>> values, or even function as inline expressions. >>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks Spark >>>>>>> supports SQL UDFs at catalog level [1]. >>>>>>> But storing them in Iceberg can enable >>>>>>> - Versioning of these UDFs. >>>>>>> - Interoperability between the engines. Potentially engines can >>>>>>> understand the UDFs written by other engines (with the translate layer). >>>>>>> >>>>>>> We believe that integrating this feature into Iceberg would be a >>>>>>> valuable addition, and we're eager to collaborate with the community to >>>>>>> develop a UDF specification. >>>>>>> Stephen <stephen....@dremio.com> has already begun drafting a >>>>>>> specification to propose to the community. >>>>>>> >>>>>>> Let us know your thoughts on this. >>>>>>> >>>>>>> [1] >>>>>>> Dremio - >>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html >>>>>>> Snowflake - >>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>> Databricks - >>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>> >>>>>>> - Ajantha >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>>> -- >>>>>> Robert Stupp >>>>>> @snazy >>>>>> >>>>>>