Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Andrew Lamb Mon, 01 Jul 2024 04:32:34 -0700

I have been thinking about this project more, and the more I think about it
the more I like it.


For example of the kind of leverage a library like this might bring, we
might consider changing the implementation of Arrow UDF to re-use the
underlying buffers when possible (e.g. via unary_mut[1]). This would likely
provide an across the board efficiency improvement for no costs to
downstream crates.

Andrew

[1]:
https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary_mut

On Sat, Jun 29, 2024 at 1:47 AM Xuanwo <xua...@apache.org> wrote:

> > That said, wherever it ends up, there should be the agreement of
> > individuals to accept maintenance of it. Since it's in rust, that would
> > generally fall to the arrow-rs contributors and/or the DataFusion
> > contributors IMO.
> >
> > It would be good for it to be part of the community, but only if it's not
> > going to end up just bitrotting somewhere.
>
> Thanks Matt. This concern does make sense.
>
> Arrow UDF is extensively used within RisingWave and Databend. We, the
> initial
> committers from both RisingWave and Databend, are eager to take
> responsibility
> for maintaining these crates.
>
> Additionally, some of us are involved in other Apache Projects, so we
> understand
> how the Apache Way functions. We will focus on community growth to ensure
> this
> project remains active.
>
> On Sat, Jun 29, 2024, at 13:29, Matt Topol wrote:
> >> This UDF implementation doesn’t depend on DataFusion. It can work with
> > any data in the arrow format.
> >
> > Given this I'm in agreement with Antoine that it would be weird for it to
> > be maintained within the DataFusion repo as opposed to it's own repo (as
> > we've done in the past for things like nanoarrow and arrow-experiments).
> >
> > That said, wherever it ends up, there should be the agreement of
> > individuals to accept maintenance of it. Since it's in rust, that would
> > generally fall to the arrow-rs contributors and/or the DataFusion
> > contributors IMO.
> >
> > It would be good for it to be part of the community, but only if it's not
> > going to end up just bitrotting somewhere.
> >
> > --Matt
> >
> > On Fri, Jun 28, 2024, 8:49 PM Xuanwo <xua...@apache.org> wrote:
> >
> >> Hi,
> >>
> >> This UDF implementation doesn’t depend on DataFusion. It can work with
> any
> >> data in the arrow format.
> >>
> >> It has the potential power to make users write ONE UDF function that
> works
> >> for different query engines as we showed up in databend and risingwave.
> >>
> >> So I personally think it should be part of arrow community.
> >>
> >> On Sat, Jun 29, 2024, at 05:06, Antoine Pitrou wrote:
> >> > Is this UDF implementation based on DataFusion? If so, it makes sense
> >> > for it to be part of the DataFusion project.
> >> >
> >> > OTOH, if it can work with any data in the Arrow format, then it would
> >> > sound weird to maintain it in the DataFusion repo IMHO.
> >> >
> >> > Regards
> >> >
> >> > Antoine.
> >> >
> >> >
> >> > Le 28/06/2024 à 21:52, Andrew Lamb a écrit :
> >> >> To be clear, if the arrow community thinks this would be better
> >> organized /
> >> >> administered in the Apache DataFusion project (especially if it is
> >> aligned
> >> >> with Rust) I think it would be good to discuss donating there
> >> >>
> >> >> On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <al...@influxdata.com>
> >> wrote:
> >> >>
> >> >>> I think there are two aspects:
> >> >>> 1. The actual mechanics of implementing functions
> >> >>> 2. The actual library of udf functions (e.g. sin, cos, nullif, etc)
> >> >>>
> >> >>> I agree 2 is not something that belongs naturally in the arrow
> project
> >> and
> >> >>> is better aligned with query engines
> >> >>>
> >> >>> However I think 1 is worth considering.
> >> >>>
> >> >>> As I understand it, the problem arrow_udf solves is avoiding some of
> >> the
> >> >>> boilerplate  required to make vectorized udfs. So instead of
> writing a
> >> >>> special eval_gcd function like this
> >> >>>
> >> >>> ```
> >> >>> fn gcd(l: i64, r: i64) -> i64 {
> >> >>>   // do gcd calculation
> >> >>> }
> >> >>>
> >> >>> // implement vectorized version
> >> >>> fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef {
> >> >>>    let left = left.as_primitive<Int64Type>();
> >> >>>    let right = right.as_primitive<Int64Type>();
> >> >>>    res = binary(left, right, |l, r| gcd(l, r));
> >> >>>    Arc::new(res)
> >> >>> }
> >> >>> ```
> >> >>>
> >> >>> The user simply annotates the scalar function and have the library
> code
> >> >>> gen the array version
> >> >>> ```
> >> >>> #[function("gcd(int64, int64) -> int64", output = "eval_gcd")]
> >> >>> fn gcd(l: i64, r: i64) -> i64 {
> >> >>>   // do gcd calculation
> >> >>> }
> >> >>> ```
> >> >>>
> >> >>> We have a lot of boilerplate / non idea macro stuff in DataFusion
> that
> >> I
> >> >>> think this would help a lot.
> >> >>>
> >> >>> Andrew
> >> >>>
> >> >>>
> >> >>> On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies
> >> >>> <r.taylordav...@googlemail.com.invalid> wrote:
> >> >>>
> >> >>>> I wonder if the DataFusion project might be a more natural home for
> >> this
> >> >>>> functionality? UDFs are more of a query engine concept, whereas
> >> arrow-rs is
> >> >>>> more focused on purely physical execution?
> >> >>>>
> >> >>>> On 28 June 2024 19:41:39 BST, Runji Wang <wangrunji0...@163.com>
> >> wrote:
> >> >>>>> Hi Felipe,
> >> >>>>>
> >> >>>>> Vectorization will be applied whenever possible. When all input
> and
> >> >>>> output types of a function are primitive (int16, int32, int64,
> >> float32,
> >> >>>> float64) and do not involve any Option or Result, the macro will
> >> >>>> automatically generate code based on unary <
> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> or
> binary <
> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.binary.html>
> kernels,
> >> >>>> which potentially allows for vectorization.
> >> >>>>>
> >> >>>>> Both examples you showed are not vectorized. The `div` function is
> >> due
> >> >>>> to the Result output, while `gcd` is due to the loop in its
> >> implementation.
> >> >>>> However, if the function is simple enough, like an `add` function:
> >> >>>>>
> >> >>>>> #[function("add(int, int) -> int")]
> >> >>>>> fn add(a: i32, b: i32) -> i32 {
> >> >>>>>     a + b
> >> >>>>> }
> >> >>>>>
> >> >>>>> It can be auto-vectorized by llvm.
> >> >>>>>
> >> >>>>> Runji
> >> >>>>>
> >> >>>>>
> >> >>>>> On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote:
> >> >>>>>> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb <
> al...@influxdata.com>
> >> >>>> wrote:
> >> >>>>>>>
> >> >>>>>>> Hi Xuanwo,
> >> >>>>>>>
> >> >>>>>>> Sorry for the delay in responding. I think  the ability to
> easily
> >> >>>> write
> >> >>>>>>> functions that "feel" like native functions in whatever language
> >> and
> >> >>>> be
> >> >>>>>>> able to generate arrow / vectorized versions of them is quite
> >> >>>> valuable.
> >> >>>>>>> This is my understanding of what this proposal is about.
> >> >>>>>>
> >> >>>>>> My understanding is that it's not vectorized. From the examples
> in
> >> >>>>>> risingwavelabs/arrow-udf, <
> >> https://github.com/risingwavelabs/arrow-udf>
> >> >>>> it
> >> >>>>>> looks like the macros generate code that gathers values from
> columns
> >> >>>> into
> >> >>>>>> local scalars that are passed as scalar parameters to user
> >> functions.
> >> >>>> Is
> >> >>>>>> the hope here that rustc/llvm will auto-vectorize the code?
> >> >>>>>>
> >> >>>>>> #[function("gcd(int, int) -> int")]
> >> >>>>>> fn gcd(mut a: i32, mut b: i32) -> i32 {
> >> >>>>>>      while b != 0 {
> >> >>>>>>          (a, b) = (b, a % b);
> >> >>>>>>      }
> >> >>>>>>      a
> >> >>>>>> }
> >> >>>>>>
> >> >>>>>> #[function("div(int, int) -> int")]
> >> >>>>>> fn div(x: i32, y: i32) -> Result<i32, &'static str> {
> >> >>>>>>      if y == 0 {
> >> >>>>>>          return Err("division by zero");
> >> >>>>>>      }
> >> >>>>>>      Ok(x / y)
> >> >>>>>> }
> >> >>>>>>
> >> >>>>>>> I left some additional comments on the markdown.
> >> >>>>>>>
> >> >>>>>>> One thing that might be worth doing is articulate some other
> >> >>>> potential
> >> >>>>>>> locations for where the code might go. One option, as I think
> you
> >> >>>> propose,
> >> >>>>>>> is to make its own repository.  Another option could be to
> donate
> >> >>>> the code
> >> >>>>>>> and put the various language bindings in the same repo as the
> arrow
> >> >>>>>>> language implementations (e.g arrow-rs, arrow for python, etc)
> >> which
> >> >>>> would
> >> >>>>>>> likely make it easier to maintain and discover.
> >> >>>>>>>
> >> >>>>>>> I am curious about what other devs / users feel about this?
> >> >>>>>>>
> >> >>>>>>> Andrew
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xu...@apache.org>
> wrote:
> >> >>>>>>>
> >> >>>>>>>> Hello, everyone.
> >> >>>>>>>>
> >> >>>>>>>> I start this thread to disscuss the donation of a User-Defined
> >> >>>> Function
> >> >>>>>>>> Framework for Apache Arrow.
> >> >>>>>>>>
> >> >>>>>>>> Feel free to review and leave your comments here. For live
> review,
> >> >>>>>> please
> >> >>>>>>>> visit:
> >> >>>>>>>>
> >> >>>>>>>> https://hackmd.io/@xuanwo/apache-arrow-udf
> >> >>>>>>>>
> >> >>>>>>>> The original content also pasted here for a quick reading:
> >> >>>>>>>>
> >> >>>>>>>> ------
> >> >>>>>>>>
> >> >>>>>>>> ## Abstract
> >> >>>>>>>>
> >> >>>>>>>> Arrow UDF is a User-Defined Function Framework for Apache
> Arrow.
> >> >>>>>>>>
> >> >>>>>>>> ## Proposal
> >> >>>>>>>>
> >> >>>>>>>> Arrow UDF allows user to easily create and run user-defined
> >> >>>> functions
> >> >>>>>>>> (UDF) in Rust, Python, Java or JavaScript based on Apache
> Arrow.
> >> >>>> The
> >> >>>>>>>> functions can be executed natively, or in WebAssembly, or in a
> >> >>>> remote
> >> >>>>>>>> server via Arrow Flight.
> >> >>>>>>>>
> >> >>>>>>>> Arrow UDF was originally designed to be used by the RisingWave
> >> >>>> project
> >> >>>>>> but
> >> >>>>>>>> is now being used by Databend and several database startups.
> >> >>>>>>>>
> >> >>>>>>>> We believe that the Arrow UDF project will provide diversity
> value
> >> >>>> to
> >> >>>>>> the
> >> >>>>>>>> entire Arrow community.
> >> >>>>>>>>
> >> >>>>>>>> ## Background
> >> >>>>>>>>
> >> >>>>>>>> Arrow UDF is being developed by an open-source community from
> day
> >> >>>> one
> >> >>>>>> and
> >> >>>>>>>> is owned by RisingWaveLabs. The project has been launched in
> >> >>>> December
> >> >>>>>> 2023.
> >> >>>>>>>>
> >> >>>>>>>> ## Initial Goals
> >> >>>>>>>>
> >> >>>>>>>> By transferring ownership of the project to the Apache Arrow,
> >> >>>> Arrow UDF
> >> >>>>>>>> expects to ensure its neutrality and further encourage and
> >> >>>> facilitate
> >> >>>>>> the
> >> >>>>>>>> adoption of Arrow UDF by the community.
> >> >>>>>>>>
> >> >>>>>>>> ## Current Status
> >> >>>>>>>>
> >> >>>>>>>> Contributors: 5
> >> >>>>>>>>
> >> >>>>>>>> Users:
> >> >>>>>>>>
> >> >>>>>>>> -   [RisingWave]: A Distributed SQL Database for Stream
> >> Processing.
> >> >>>>>>>> -   [Databend]: An open-source cloud data warehouse that
> serves as
> >> >>>> a
> >> >>>>>>>> cost-effective alternative to Snowflake.
> >> >>>>>>>>
> >> >>>>>>>> ## Documentation
> >> >>>>>>>>
> >> >>>>>>>> The document of Arrow UDF is hosted at
> >> >>>>>>>> https://docs.rs/arrow-udf/latest/arrow_udf/.
> >> >>>>>>>>
> >> >>>>>>>> ## Initial Source
> >> >>>>>>>>
> >> >>>>>>>> The project currently holds a GitHub repository and multiple
> >> >>>> packages:
> >> >>>>>>>>
> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf
> >> >>>>>>>>
> >> >>>>>>>> Rust:
> >> >>>>>>>>
> >> >>>>>>>> - https://crates.io/arrow-udf/
> >> >>>>>>>> - https://crates.io/arrow-udf-python/
> >> >>>>>>>> - https://crates.io/arrow-udf-js/
> >> >>>>>>>> - https://crates.io/arrow-udf-js-deno/
> >> >>>>>>>> - https://crates.io/arrow-udf-wasm/
> >> >>>>>>>>
> >> >>>>>>>> Python:
> >> >>>>>>>>
> >> >>>>>>>> - https://pypi.org/project/arrow-udf/
> >> >>>>>>>>
> >> >>>>>>>> Those packge will retain its name, while the repository will be
> >> >>>> moved to
> >> >>>>>>>> apache org.
> >> >>>>>>>>
> >> >>>>>>>> ## Required Resources
> >> >>>>>>>>
> >> >>>>>>>> ### Mailing Lists
> >> >>>>>>>>
> >> >>>>>>>> We can reuse the existing mailing lists that arrow have.
> >> >>>>>>>>
> >> >>>>>>>> ### Git Repositories
> >> >>>>>>>>
> >> >>>>>>>> From
> >> >>>>>>>>
> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf
> >> >>>>>>>>
> >> >>>>>>>> To
> >> >>>>>>>>
> >> >>>>>>>> - https://gitbox.apache.org/asf/repos/arrow-udf
> >> >>>>>>>> - https://github.com/apache/arrow-udf
> >> >>>>>>>>
> >> >>>>>>>> ### Issue Tracking
> >> >>>>>>>>
> >> >>>>>>>> The project would like to continue using GitHub Issues.
> >> >>>>>>>>
> >> >>>>>>>> ### Other Resources
> >> >>>>>>>>
> >> >>>>>>>> The project has already chosen GitHub actions as continuous
> >> >>>> integration
> >> >>>>>>>> tools.
> >> >>>>>>>>
> >> >>>>>>>> ## Initial Committers
> >> >>>>>>>>
> >> >>>>>>>> - Runji Wang wangrunji0...@163.com
> >> >>>>>>>> - Giovanny Gutiérrez
> >> >>>>>>>> - sundy-li sund...@apache.org
> >> >>>>>>>> - Xuanwo xua...@apache.org
> >> >>>>>>>> - Max Justus Spransy maxjus...@gmail.com
> >> >>>>>>>>
> >> >>>>>>>> [RisingWave]: https://github.com/risingwavelabs/risingwave
> >> >>>>>>>> [Databend]: https://github.com/datafuselabs/databend
> >> >>>>>>>>
> >> >>>>>>>> Xuanwo
> >> >>>>>>>>
> >> >>>>>>
> >> >>>
> >> >>>
> >> >>
> >>
> >> --
> >> Xuanwo
> >>
>
> --
> Xuanwo
>

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Reply via email to