Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Walaa Eldin Moustafa Tue, 28 May 2024 15:31:55 -0700

Thanks Jack. I actually meant scalar/aggregate/table user defined
functions. Here are some examples of what I meant in (2):


Hive GenericUDF:
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
Trino user defined functions:
https://trino.io/docs/current/develop/functions.html
Flink user defined functions:
https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/

Probably what you referred to is a variation of (1) where the API is data
flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that is
also possible in the very long run :)

Thanks,
Walaa.




On Tue, May 28, 2024 at 2:57 PM Jack Ye <[email protected]> wrote:

> > (2) Custom code written in imperative function according to a
> Java/Scala/Python API, etc.
>
> I think we could still explore some long term opportunities in this case.
> Consider you register a Spark temp view as some sort of data frame read,
> then it could still be resolved to a Spark plan that is representable by an
> intermediate representation. But I agree this gets very complicated very
> soon, and just having the case (1) covered would already be a huge step
> forward.
>
> -Jack
>
>
> On Tue, May 28, 2024 at 1:40 PM Benny Chow <[email protected]> wrote:
>
>> It's interesting to note that a tabular SQL UDF can be used to build a 
>> *parameterized
>> *view.  So, there's definitely a lot in common between UDFs and views.
>>
>> Thanks
>>
>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa <
>> [email protected]> wrote:
>>
>>> I think there is a disconnect about what is perceived as a "UDF". There
>>> are 2 flavors:
>>>
>>> (1) Functions that are defined by the user whose definition is a
>>> composition of other built-in functions/SQL expressions.
>>> (2) Custom code written in imperative function according to a
>>> Java/Scala/Python API, etc.
>>>
>>> All the examples in Ajantha's references are pretty much from (1) and I
>>> think those have more analogy to views due to their SQL nature. Agree (2)
>>> is not practical to maintain by Iceberg, but I think Ajantha's use cases
>>> are around (1), and may be worth evaluating.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <[email protected]>
>>> wrote:
>>>
>>>> I guess we'll know more when you post the proposal, but I think this
>>>>> would be a very difficult area to tackle across engines, languages, and
>>>>> memory models without having a huge performance penalty.
>>>>
>>>> Assuming Iceberg initially supports SQL representations of UDFs
>>>> (similar to views as shared by the reference links above), the complexity
>>>> involved will be similar to managing views.
>>>>
>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>>> We will work on publishing the draft spec (inspired by the view spec)
>>>> this week to facilitate further discussions.
>>>>
>>>> - Ajantha
>>>>
>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <[email protected]> wrote:
>>>>
>>>>> > While it would be great to have a common set of functions across
>>>>> engines, I don't see how that is practical when those engines are
>>>>> implemented so differently. Plugging in code -- and especially custom
>>>>> user-supplied code -- seems inherently specialized to me and should be 
>>>>> part
>>>>> of the engines' design.
>>>>>
>>>>> How is this different from the views? I feel we can say exactly the
>>>>> same thing for Iceberg views, but yet we have Iceberg multi-dialect views
>>>>> implemented. Maybe it sounds like we are trying to draw a line between SQL
>>>>> vs other programming language as "code"? but I think SQL is just another
>>>>> type of code, and we are already talking about compiling all these
>>>>> different code dialects to an intermediate representation (using projects
>>>>> like Coral, Substrait), which will be stored as another type of
>>>>> representation of Iceberg view. I think the same functionality can be used
>>>>> for UDFs if developed.
>>>>>
>>>>> I actually hink adding UDF support is a good idea, even just a
>>>>> multi-dialect one like view, and that can allow engines to for example
>>>>> parse a view SQL, and when a function referenced cannot be resolved, try 
>>>>> to
>>>>> seek for a multi-dialect UDF definition.
>>>>>
>>>>> I guess we can discuss more when we have the actual proposal published.
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <[email protected]> wrote:
>>>>>
>>>>>> UDFs are as engine specific and portable and "non-centralized" as
>>>>>> views are. The same performance concerns apply to views as well.
>>>>>> Iceberg should define a common base upon which engines can build, so
>>>>>> the argument that UDFs aren't practical, because engines are different, 
>>>>>> is
>>>>>> probably only a temporary concern.
>>>>>>
>>>>>> In the long term, Iceberg should also try to tackle the idea to make
>>>>>> views portable, which is conceptually not that much different from 
>>>>>> portable
>>>>>> UDFs.
>>>>>>
>>>>>>
>>>>>> PS: I'm not a fan of adding a negative touch to the idea of having
>>>>>> UDFs in Iceberg, especially not in this early stage.
>>>>>>
>>>>>>
>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>>>>
>>>>>> Thanks, Ajantha.
>>>>>>
>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked by
>>>>>> Iceberg catalogs. I think that Iceberg primarily deals with things that 
>>>>>> are
>>>>>> centralized, like tables of data. While it would be great to have a 
>>>>>> common
>>>>>> set of functions across engines, I don't see how that is practical when
>>>>>> those engines are implemented so differently. Plugging in code -- and
>>>>>> especially custom user-supplied code -- seems inherently specialized to 
>>>>>> me
>>>>>> and should be part of the engines' design.
>>>>>>
>>>>>> I guess we'll know more when you post the proposal, but I think this
>>>>>> would be a very difficult area to tackle across engines, languages, and
>>>>>> memory models without having a huge performance penalty.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> This is a discussion to gauge the community interest in storing the
>>>>>>> Versioned SQL UDFs in Iceberg.
>>>>>>> We want to propose the spec addition for storing the versioned UDFs
>>>>>>> in Iceberg (inspired by view spec).
>>>>>>>
>>>>>>> These UDFs can operate similarly to views in that they are
>>>>>>> associated with tables, but they can accept arguments and produce return
>>>>>>> values, or even function as inline expressions.
>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks Spark
>>>>>>> supports SQL UDFs at catalog level [1].
>>>>>>> But storing them in Iceberg can enable
>>>>>>> - Versioning of these UDFs.
>>>>>>> - Interoperability between the engines. Potentially engines can
>>>>>>> understand the UDFs written by other engines (with the translate layer).
>>>>>>>
>>>>>>> We believe that integrating this feature into Iceberg would be a
>>>>>>> valuable addition, and we're eager to collaborate with the community to
>>>>>>> develop a UDF specification.
>>>>>>> Stephen <[email protected]> has already begun drafting a
>>>>>>> specification to propose to the community.
>>>>>>>
>>>>>>> Let us know your thoughts on this.
>>>>>>>
>>>>>>> [1]
>>>>>>> Dremio -
>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html
>>>>>>> Snowflake -
>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>> Databricks -
>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>
>>>>>>> - Ajantha
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>> --
>>>>>> Robert Stupp
>>>>>> @snazy
>>>>>>
>>>>>>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to