Re: [DISCUSS] SPIP: FunctionCatalog

Andrew Melo Tue, 16 Feb 2021 09:00:52 -0800

Hello Ryan,

This proposal looks very interesting. Would future goals for this
functionality include both support for aggregation functions, as well
as support for processing ColumnBatch-es (instead of Row/InternalRow)?


Thanks
Andrew

On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>
> Thanks for the positive feedback, everyone. It sounds like there is a clear 
> path forward for calling functions. Even without a prototype, the `invoke` 
> plans show that Wenchen's suggested optimization can be done, and 
> incorporating it as an optional extension to this proposal solves many of the 
> unknowns.
>
> With that area now understood, is there any discussion about other parts of 
> the proposal, besides the function call interface?
>
> On Fri, Feb 12, 2021 at 10:40 PM Chao Sun <sunc...@apache.org> wrote:
>>
>> This is an important feature which can unblock several other projects 
>> including bucket join support for DataSource v2, complete support for 
>> enforcing DataSource v2 distribution requirements on the write path, etc. I 
>> like Ryan's proposals which look simple and elegant, with nice support on 
>> function overloading and variadic arguments. On the other hand, I think 
>> Wenchen made a very good point about performance. Overall, I'm excited to 
>> see active discussions on this topic and believe the community will come to 
>> a proposal with the best of both sides.
>>
>> Chao
>>
>> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>>
>>> +1 for Liang-chi's.
>>>
>>> Thanks Ryan and Wenchen for leading this.
>>>
>>>
>>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh <vii...@gmail.com>님이 작성:
>>>>
>>>> Basically I think the proposal makes sense to me and I'd like to support 
>>>> the
>>>> SPIP as it looks like we have strong need for the important feature.
>>>>
>>>> Thanks Ryan for working on this and I do also look forward to Wenchen's
>>>> implementation. Thanks for the discussion too.
>>>>
>>>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>>>> alternative to me. Besides Wenchen's alternative implementation, is there a
>>>> chance we also have the SupportsInvoke for comparison?
>>>>
>>>>
>>>> John Zhuge wrote
>>>> > Excited to see our Spark community rallying behind this important 
>>>> > feature!
>>>> >
>>>> > The proposal lays a solid foundation of minimal feature set with careful
>>>> > considerations for future optimizations and extensions. Can't wait to see
>>>> > it leading to more advanced functionalities like views with shared custom
>>>> > functions, function pushdown, lambda, etc. It has already borne fruit 
>>>> > from
>>>> > the constructive collaborations in this thread. Looking forward to
>>>> > Wenchen's prototype and further discussions including the SupportsInvoke
>>>> > extension proposed by Ryan.
>>>> >
>>>> >
>>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley &lt;
>>>>
>>>> > owen.omalley@
>>>>
>>>> > &gt;
>>>> > wrote:
>>>> >
>>>> >> I think this proposal is a very good thing giving Spark a standard way 
>>>> >> of
>>>> >> getting to and calling UDFs.
>>>> >>
>>>> >> I like having the ScalarFunction as the API to call the UDFs. It is
>>>> >> simple, yet covers all of the polymorphic type cases well. I think it
>>>> >> would
>>>> >> also simplify using the functions in other contexts like pushing down
>>>> >> filters into the ORC & Parquet readers although there are a lot of
>>>> >> details
>>>> >> that would need to be considered there.
>>>> >>
>>>> >> .. Owen
>>>> >>
>>>> >>
>>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen &lt;
>>>>
>>>> > ekrogen@.com
>>>>
>>>> > &gt;
>>>> >> wrote:
>>>> >>
>>>> >>> I agree that there is a strong need for a FunctionCatalog within Spark
>>>> >>> to
>>>> >>> provide support for shareable UDFs, as well as make movement towards
>>>> >>> more
>>>> >>> advanced functionality like views which themselves depend on UDFs, so I
>>>> >>> support this SPIP wholeheartedly.
>>>> >>>
>>>> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>>>> >>> and
>>>> >>> extensible. I generally think Wenchen's proposal is easier for a user 
>>>> >>> to
>>>> >>> work with in the common case, but has greater potential for confusing
>>>> >>> and
>>>> >>> hard-to-debug behavior due to use of reflective method signature
>>>> >>> searches.
>>>> >>> The merits on both sides can hopefully be more properly examined with
>>>> >>> code,
>>>> >>> so I look forward to seeing an implementation of Wenchen's ideas to
>>>> >>> provide
>>>> >>> a more concrete comparison. I am optimistic that we will not let the
>>>> >>> debate
>>>> >>> over this point unreasonably stall the SPIP from making progress.
>>>> >>>
>>>> >>> Thank you to both Wenchen and Ryan for your detailed consideration and
>>>> >>> evaluation of these ideas!
>>>> >>> ------------------------------
>>>> >>> *From:* Dongjoon Hyun &lt;
>>>>
>>>> > dongjoon.hyun@
>>>>
>>>> > &gt;
>>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>>> >>> *To:* Ryan Blue &lt;
>>>>
>>>> > blue@
>>>>
>>>> > &gt;
>>>> >>> *Cc:* Holden Karau &lt;
>>>>
>>>> > holden@
>>>>
>>>> > &gt;; Hyukjin Kwon <
>>>> >>>
>>>>
>>>> > gurwls223@
>>>>
>>>> >>; Spark Dev List &lt;
>>>>
>>>> > dev@.apache
>>>>
>>>> > &gt;; Wenchen Fan
>>>> >>> &lt;
>>>>
>>>> > cloud0fan@
>>>>
>>>> > &gt;
>>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>> >>>
>>>> >>> BTW, I forgot to add my opinion explicitly in this thread because I was
>>>> >>> on the PR before this thread.
>>>> >>>
>>>> >>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
>>>> >>> there for almost two years.
>>>> >>> 2. I already gave my +1 on that PR last Saturday because I agreed with
>>>> >>> the latest updated design docs and AS-IS PR.
>>>> >>>
>>>> >>> And, the rest of the progress in this thread is also very satisfying to
>>>> >>> me.
>>>> >>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>>>> >>>
>>>> >>> To All:
>>>> >>> Please take a look at the design doc and the PR, and give us some
>>>> >>> opinions.
>>>> >>> We really need your participation in order to make DSv2 more complete.
>>>> >>> This will unblock other DSv2 features, too.
>>>> >>>
>>>> >>> Bests,
>>>> >>> Dongjoon.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun &lt;
>>>>
>>>> > dongjoon.hyun@
>>>>
>>>> > &gt;
>>>> >>> wrote:
>>>> >>>
>>>> >>> Hi, Ryan.
>>>> >>>
>>>> >>> We didn't move past anything (both yours and Wenchen's). What Wenchen
>>>> >>> suggested is double-checking the alternatives with the implementation 
>>>> >>> to
>>>> >>> give more momentum to our discussion.
>>>> >>>
>>>> >>> Your new suggestion about optional extention also sounds like a new
>>>> >>> reasonable alternative to me.
>>>> >>>
>>>> >>> We are still discussing this topic together and I hope we can make a
>>>> >>> conclude at this time (for Apache Spark 3.2) without being stucked like
>>>> >>> last time.
>>>> >>>
>>>> >>> I really appreciate your leadership in this dicsussion and the moving
>>>> >>> direction of this discussion looks constructive to me. Let's give some
>>>> >>> time
>>>> >>> to the alternatives.
>>>> >>>
>>>> >>> Bests,
>>>> >>> Dongjoon.
>>>> >>>
>>>> >>> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue &lt;
>>>>
>>>> > blue@
>>>>
>>>> > &gt; wrote:
>>>> >>>
>>>> >>> I don’t think we should so quickly move past the drawbacks of this
>>>> >>> approach. The problems are significant enough that using invoke is not
>>>> >>> sufficient on its own. But, I think we can add it as an optional
>>>> >>> extension
>>>> >>> to shore up the weaknesses.
>>>> >>>
>>>> >>> Here’s a summary of the drawbacks:
>>>> >>>
>>>> >>>    - Magic function signatures are error-prone
>>>> >>>    - Spark would need considerable code to help users find what went
>>>> >>>    wrong
>>>> >>>    - Spark would likely need to coerce arguments (e.g., String,
>>>> >>>    Option[Int]) for usability
>>>> >>>    - It is unclear how Spark will find the Java Method to call
>>>> >>>    - Use cases that require varargs fall back to casting; users will
>>>> >>>    also get this wrong (cast to String instead of UTF8String)
>>>> >>>    - The non-codegen path is significantly slower
>>>> >>>
>>>> >>> The benefit of invoke is to avoid moving data into a row, like this:
>>>> >>>
>>>> >>> -- using invoke
>>>> >>> int result = udfFunction(x, y)
>>>> >>>
>>>> >>> -- using row
>>>> >>> udfRow.update(0, x); -- actual: values[0] = x;
>>>> >>> udfRow.update(1, y);
>>>> >>> int result = udfFunction(udfRow);
>>>> >>>
>>>> >>> And, again, that won’t actually help much in cases that require 
>>>> >>> varargs.
>>>> >>>
>>>> >>> I suggest we add a new marker trait for BoundMethod called
>>>> >>> SupportsInvoke.
>>>> >>> If that interface is implemented, then Spark will look for a method 
>>>> >>> that
>>>> >>> matches the expected signature based on the bound input type. If it
>>>> >>> isn’t
>>>> >>> found, Spark can print a warning and fall back to the InternalRow call:
>>>> >>> “Cannot find udfFunction(int, int)”.
>>>> >>>
>>>> >>> This approach allows the invoke optimization, but solves many of the
>>>> >>> problems:
>>>> >>>
>>>> >>>    - The method to invoke is found using the proposed load and bind
>>>> >>>    approach
>>>> >>>    - Magic function signatures are optional and do not cause runtime
>>>> >>>    failures
>>>> >>>    - Because this is an optional optimization, Spark can be more strict
>>>> >>>    about types
>>>> >>>    - Varargs cases can still use rows
>>>> >>>    - Non-codegen can use an evaluation method rather than falling back
>>>> >>>    to slow Java reflection
>>>> >>>
>>>> >>> This seems like a good extension to me; this provides a plan for
>>>> >>> optimizing the UDF call to avoid building a row, while the existing
>>>> >>> proposal covers the other cases well and addresses how to locate these
>>>> >>> function calls.
>>>> >>>
>>>> >>> This also highlights that the approach used in DSv2 and this proposal 
>>>> >>> is
>>>> >>> working: start small and use extensions to layer on more complex
>>>> >>> support.
>>>> >>>
>>>> >>> On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun &lt;
>>>>
>>>> > dongjoon.hyun@
>>>>
>>>> > &gt;
>>>> >>> wrote:
>>>> >>>
>>>> >>> Thank you all for making a giant move forward for Apache Spark 3.2.0.
>>>> >>> I'm really looking forward to seeing Wenchen's implementation.
>>>> >>> That would be greatly helpful to make a decision!
>>>> >>>
>>>> >>> > I'll implement my idea after the holiday and then we can have
>>>> >>> more effective discussions. We can also do benchmarks and get some real
>>>> >>> numbers.
>>>> >>> > FYI: the Presto UDF API
>>>> >>> &lt;https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067978066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=iMWmHqqXPcT7EK%2Bovyzhy%2BZpU6Llih%2BwdZD53wvobmc%3D&amp;reserved=0&gt;
>>>> >>> also
>>>> >>> takes individual parameters instead of the row parameter. I think this
>>>> >>> direction at least worth a try so that we can see the performance
>>>> >>> difference. It's also mentioned in the design doc as an alternative
>>>> >>> (Trino).
>>>> >>>
>>>> >>> Bests,
>>>> >>> Dongjoon.
>>>> >>>
>>>> >>>
>>>> >>> On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan &lt;
>>>>
>>>> > cloud0fan@
>>>>
>>>> > &gt; wrote:
>>>> >>>
>>>> >>> FYI: the Presto UDF API
>>>> >>> &lt;https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZSBCR7yx3PpwL4KY9V73JG42Z02ZodqkjxC0LweHt1g%3D&amp;reserved=0&gt;
>>>> >>> also takes individual parameters instead of the row parameter. I think
>>>> >>> this
>>>> >>> direction at least worth a try so that we can see the performance
>>>> >>> difference. It's also mentioned in the design doc as an alternative
>>>> >>> (Trino).
>>>> >>>
>>>> >>> On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan &lt;
>>>>
>>>> > cloud0fan@
>>>>
>>>> > &gt; wrote:
>>>> >>>
>>>> >>> Hi Holden,
>>>> >>>
>>>> >>> As Hyukjin said, following existing designs is not the principle of DS
>>>> >>> v2
>>>> >>> API design. We should make sure the DS v2 API makes sense. AFAIK we
>>>> >>> didn't
>>>> >>> fully follow the catalog API design from Hive and I believe Ryan also
>>>> >>> agrees with it.
>>>> >>>
>>>> >>> I think the problem here is we were discussing some very detailed 
>>>> >>> things
>>>> >>> without actual code. I'll implement my idea after the holiday and then
>>>> >>> we
>>>> >>> can have more effective discussions. We can also do benchmarks and get
>>>> >>> some
>>>> >>> real numbers.
>>>> >>>
>>>> >>> In the meantime, we can continue to discuss other parts of this
>>>> >>> proposal,
>>>> >>> and make a prototype if possible. Spark SQL has many active
>>>> >>> contributors/committers and this thread doesn't get much attention yet.
>>>> >>>
>>>> >>> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon &lt;
>>>>
>>>> > gurwls223@
>>>>
>>>> > &gt; wrote:
>>>> >>>
>>>> >>> Just dropping a few lines. I remember that one of the goals in DSv2 is
>>>> >>> to
>>>> >>> correct the mistakes we made in the current Spark codes.
>>>> >>> It would not have much point if we will happen to just follow and mimic
>>>> >>> what Spark currently does. It might just end up with another copy of
>>>> >>> Spark
>>>> >>> APIs, e.g. Expression (internal) APIs. I sincerely would like to avoid
>>>> >>> this
>>>> >>> I do believe we have been stuck mainly due to trying to come up with a
>>>> >>> better design. We already have an ugly picture of the current Spark 
>>>> >>> APIs
>>>> >>> to
>>>> >>> draw a better bigger picture.
>>>> >>>
>>>> >>>
>>>> >>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau &lt;
>>>>
>>>> > holden@
>>>>
>>>> > &gt;님이 작성:
>>>> >>>
>>>> >>> I think this proposal is a good set of trade-offs and has existed in 
>>>> >>> the
>>>> >>> community for a long period of time. I especially appreciate how the
>>>> >>> design
>>>> >>> is focused on a minimal useful component, with future optimizations
>>>> >>> considered from a point of view of making sure it's flexible, but 
>>>> >>> actual
>>>> >>> concrete decisions left for the future once we see how this API is 
>>>> >>> used.
>>>> >>> I
>>>> >>> think if we try and optimize everything right out of the gate, we'll
>>>> >>> quickly get stuck (again) and not make any progress.
>>>> >>>
>>>> >>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue &lt;
>>>>
>>>> > blue@
>>>>
>>>> > &gt; wrote:
>>>> >>>
>>>> >>> Hi everyone,
>>>> >>>
>>>> >>> I'd like to start a discussion for adding a FunctionCatalog interface 
>>>> >>> to
>>>> >>> catalog plugins. This will allow catalogs to expose functions to Spark,
>>>> >>> similar to how the TableCatalog interface allows a catalog to expose
>>>> >>> tables. The proposal doc is available here:
>>>> >>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>> >>> &lt;https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U%2Fedit&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Kyth8%2FhNUZ6GXG2FsgcknZ7t7s0%2BpxnDMPyxvsxLLqE%3D&amp;reserved=0&gt;
>>>> >>>
>>>> >>> Here's a high-level summary of some of the main design choices:
>>>> >>> * Adds the ability to list and load functions, not to create or modify
>>>> >>> them in an external catalog
>>>> >>> * Supports scalar, aggregate, and partial aggregate functions
>>>> >>> * Uses load and bind steps for better error messages and simpler
>>>> >>> implementations
>>>> >>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
>>>> >>> data
>>>> >>> * Can be extended using mix-in interfaces to add vectorization, 
>>>> >>> codegen,
>>>> >>> and other future features
>>>> >>>
>>>> >>> There is also a PR with the proposed API:
>>>> >>> https://github.com/apache/spark/pull/24559/files
>>>> >>> &lt;https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F24559%2Ffiles&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t3ZCqffdsrmCY3X%2FT8x1oMjMcNUiQ0wQNk%2ByAXQx1Io%3D&amp;reserved=0&gt;
>>>> >>>
>>>> >>> Let's discuss the proposal here rather than on that PR, to get better
>>>> >>> visibility. Also, please take the time to read the proposal first. That
>>>> >>> really helps clear up misconceptions.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Ryan Blue
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Twitter: https://twitter.com/holdenkarau
>>>> >>> &lt;https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fVfSPIyazuUYv8VLfNu%2BUIHdc3ePM1AAKKH%2BlnIicF8%3D&amp;reserved=0&gt;
>>>> >>> Books (Learning Spark, High Performance Spark, etc.):
>>>> >>> https://amzn.to/2MaRAG9
>>>> >>> &lt;https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NbRl9kK%2B6Wy0jWmDnztYp3JCPNLuJvmFsLHUrXzEhlk%3D&amp;reserved=0&gt;
>>>> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> >>> &lt;https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060068007935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=OWXOBELzO3hBa2JI%2FOSBZ3oNyLq0yr%2FGXMkNn7bqYDM%3D&amp;reserved=0&gt;
>>>> >>>
>>>> >>> --
>>>> >>> Ryan Blue
>>>> >>>
>>>> >>>
>>>> >
>>>> > --
>>>> > John Zhuge
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: FunctionCatalog

Reply via email to