I have been thinking about this project more, and the more I think about it the more I like it.
For example of the kind of leverage a library like this might bring, we might consider changing the implementation of Arrow UDF to re-use the underlying buffers when possible (e.g. via unary_mut[1]). This would likely provide an across the board efficiency improvement for no costs to downstream crates. Andrew [1]: https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary_mut On Sat, Jun 29, 2024 at 1:47 AM Xuanwo <xua...@apache.org> wrote: > > That said, wherever it ends up, there should be the agreement of > > individuals to accept maintenance of it. Since it's in rust, that would > > generally fall to the arrow-rs contributors and/or the DataFusion > > contributors IMO. > > > > It would be good for it to be part of the community, but only if it's not > > going to end up just bitrotting somewhere. > > Thanks Matt. This concern does make sense. > > Arrow UDF is extensively used within RisingWave and Databend. We, the > initial > committers from both RisingWave and Databend, are eager to take > responsibility > for maintaining these crates. > > Additionally, some of us are involved in other Apache Projects, so we > understand > how the Apache Way functions. We will focus on community growth to ensure > this > project remains active. > > On Sat, Jun 29, 2024, at 13:29, Matt Topol wrote: > >> This UDF implementation doesn’t depend on DataFusion. It can work with > > any data in the arrow format. > > > > Given this I'm in agreement with Antoine that it would be weird for it to > > be maintained within the DataFusion repo as opposed to it's own repo (as > > we've done in the past for things like nanoarrow and arrow-experiments). > > > > That said, wherever it ends up, there should be the agreement of > > individuals to accept maintenance of it. Since it's in rust, that would > > generally fall to the arrow-rs contributors and/or the DataFusion > > contributors IMO. > > > > It would be good for it to be part of the community, but only if it's not > > going to end up just bitrotting somewhere. > > > > --Matt > > > > On Fri, Jun 28, 2024, 8:49 PM Xuanwo <xua...@apache.org> wrote: > > > >> Hi, > >> > >> This UDF implementation doesn’t depend on DataFusion. It can work with > any > >> data in the arrow format. > >> > >> It has the potential power to make users write ONE UDF function that > works > >> for different query engines as we showed up in databend and risingwave. > >> > >> So I personally think it should be part of arrow community. > >> > >> On Sat, Jun 29, 2024, at 05:06, Antoine Pitrou wrote: > >> > Is this UDF implementation based on DataFusion? If so, it makes sense > >> > for it to be part of the DataFusion project. > >> > > >> > OTOH, if it can work with any data in the Arrow format, then it would > >> > sound weird to maintain it in the DataFusion repo IMHO. > >> > > >> > Regards > >> > > >> > Antoine. > >> > > >> > > >> > Le 28/06/2024 à 21:52, Andrew Lamb a écrit : > >> >> To be clear, if the arrow community thinks this would be better > >> organized / > >> >> administered in the Apache DataFusion project (especially if it is > >> aligned > >> >> with Rust) I think it would be good to discuss donating there > >> >> > >> >> On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <al...@influxdata.com> > >> wrote: > >> >> > >> >>> I think there are two aspects: > >> >>> 1. The actual mechanics of implementing functions > >> >>> 2. The actual library of udf functions (e.g. sin, cos, nullif, etc) > >> >>> > >> >>> I agree 2 is not something that belongs naturally in the arrow > project > >> and > >> >>> is better aligned with query engines > >> >>> > >> >>> However I think 1 is worth considering. > >> >>> > >> >>> As I understand it, the problem arrow_udf solves is avoiding some of > >> the > >> >>> boilerplate required to make vectorized udfs. So instead of > writing a > >> >>> special eval_gcd function like this > >> >>> > >> >>> ``` > >> >>> fn gcd(l: i64, r: i64) -> i64 { > >> >>> // do gcd calculation > >> >>> } > >> >>> > >> >>> // implement vectorized version > >> >>> fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef { > >> >>> let left = left.as_primitive<Int64Type>(); > >> >>> let right = right.as_primitive<Int64Type>(); > >> >>> res = binary(left, right, |l, r| gcd(l, r)); > >> >>> Arc::new(res) > >> >>> } > >> >>> ``` > >> >>> > >> >>> The user simply annotates the scalar function and have the library > code > >> >>> gen the array version > >> >>> ``` > >> >>> #[function("gcd(int64, int64) -> int64", output = "eval_gcd")] > >> >>> fn gcd(l: i64, r: i64) -> i64 { > >> >>> // do gcd calculation > >> >>> } > >> >>> ``` > >> >>> > >> >>> We have a lot of boilerplate / non idea macro stuff in DataFusion > that > >> I > >> >>> think this would help a lot. > >> >>> > >> >>> Andrew > >> >>> > >> >>> > >> >>> On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies > >> >>> <r.taylordav...@googlemail.com.invalid> wrote: > >> >>> > >> >>>> I wonder if the DataFusion project might be a more natural home for > >> this > >> >>>> functionality? UDFs are more of a query engine concept, whereas > >> arrow-rs is > >> >>>> more focused on purely physical execution? > >> >>>> > >> >>>> On 28 June 2024 19:41:39 BST, Runji Wang <wangrunji0...@163.com> > >> wrote: > >> >>>>> Hi Felipe, > >> >>>>> > >> >>>>> Vectorization will be applied whenever possible. When all input > and > >> >>>> output types of a function are primitive (int16, int32, int64, > >> float32, > >> >>>> float64) and do not involve any Option or Result, the macro will > >> >>>> automatically generate code based on unary < > >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> or > binary < > >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.binary.html> > kernels, > >> >>>> which potentially allows for vectorization. > >> >>>>> > >> >>>>> Both examples you showed are not vectorized. The `div` function is > >> due > >> >>>> to the Result output, while `gcd` is due to the loop in its > >> implementation. > >> >>>> However, if the function is simple enough, like an `add` function: > >> >>>>> > >> >>>>> #[function("add(int, int) -> int")] > >> >>>>> fn add(a: i32, b: i32) -> i32 { > >> >>>>> a + b > >> >>>>> } > >> >>>>> > >> >>>>> It can be auto-vectorized by llvm. > >> >>>>> > >> >>>>> Runji > >> >>>>> > >> >>>>> > >> >>>>> On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote: > >> >>>>>> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb < > al...@influxdata.com> > >> >>>> wrote: > >> >>>>>>> > >> >>>>>>> Hi Xuanwo, > >> >>>>>>> > >> >>>>>>> Sorry for the delay in responding. I think the ability to > easily > >> >>>> write > >> >>>>>>> functions that "feel" like native functions in whatever language > >> and > >> >>>> be > >> >>>>>>> able to generate arrow / vectorized versions of them is quite > >> >>>> valuable. > >> >>>>>>> This is my understanding of what this proposal is about. > >> >>>>>> > >> >>>>>> My understanding is that it's not vectorized. From the examples > in > >> >>>>>> risingwavelabs/arrow-udf, < > >> https://github.com/risingwavelabs/arrow-udf> > >> >>>> it > >> >>>>>> looks like the macros generate code that gathers values from > columns > >> >>>> into > >> >>>>>> local scalars that are passed as scalar parameters to user > >> functions. > >> >>>> Is > >> >>>>>> the hope here that rustc/llvm will auto-vectorize the code? > >> >>>>>> > >> >>>>>> #[function("gcd(int, int) -> int")] > >> >>>>>> fn gcd(mut a: i32, mut b: i32) -> i32 { > >> >>>>>> while b != 0 { > >> >>>>>> (a, b) = (b, a % b); > >> >>>>>> } > >> >>>>>> a > >> >>>>>> } > >> >>>>>> > >> >>>>>> #[function("div(int, int) -> int")] > >> >>>>>> fn div(x: i32, y: i32) -> Result<i32, &'static str> { > >> >>>>>> if y == 0 { > >> >>>>>> return Err("division by zero"); > >> >>>>>> } > >> >>>>>> Ok(x / y) > >> >>>>>> } > >> >>>>>> > >> >>>>>>> I left some additional comments on the markdown. > >> >>>>>>> > >> >>>>>>> One thing that might be worth doing is articulate some other > >> >>>> potential > >> >>>>>>> locations for where the code might go. One option, as I think > you > >> >>>> propose, > >> >>>>>>> is to make its own repository. Another option could be to > donate > >> >>>> the code > >> >>>>>>> and put the various language bindings in the same repo as the > arrow > >> >>>>>>> language implementations (e.g arrow-rs, arrow for python, etc) > >> which > >> >>>> would > >> >>>>>>> likely make it easier to maintain and discover. > >> >>>>>>> > >> >>>>>>> I am curious about what other devs / users feel about this? > >> >>>>>>> > >> >>>>>>> Andrew > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xu...@apache.org> > wrote: > >> >>>>>>> > >> >>>>>>>> Hello, everyone. > >> >>>>>>>> > >> >>>>>>>> I start this thread to disscuss the donation of a User-Defined > >> >>>> Function > >> >>>>>>>> Framework for Apache Arrow. > >> >>>>>>>> > >> >>>>>>>> Feel free to review and leave your comments here. For live > review, > >> >>>>>> please > >> >>>>>>>> visit: > >> >>>>>>>> > >> >>>>>>>> https://hackmd.io/@xuanwo/apache-arrow-udf > >> >>>>>>>> > >> >>>>>>>> The original content also pasted here for a quick reading: > >> >>>>>>>> > >> >>>>>>>> ------ > >> >>>>>>>> > >> >>>>>>>> ## Abstract > >> >>>>>>>> > >> >>>>>>>> Arrow UDF is a User-Defined Function Framework for Apache > Arrow. > >> >>>>>>>> > >> >>>>>>>> ## Proposal > >> >>>>>>>> > >> >>>>>>>> Arrow UDF allows user to easily create and run user-defined > >> >>>> functions > >> >>>>>>>> (UDF) in Rust, Python, Java or JavaScript based on Apache > Arrow. > >> >>>> The > >> >>>>>>>> functions can be executed natively, or in WebAssembly, or in a > >> >>>> remote > >> >>>>>>>> server via Arrow Flight. > >> >>>>>>>> > >> >>>>>>>> Arrow UDF was originally designed to be used by the RisingWave > >> >>>> project > >> >>>>>> but > >> >>>>>>>> is now being used by Databend and several database startups. > >> >>>>>>>> > >> >>>>>>>> We believe that the Arrow UDF project will provide diversity > value > >> >>>> to > >> >>>>>> the > >> >>>>>>>> entire Arrow community. > >> >>>>>>>> > >> >>>>>>>> ## Background > >> >>>>>>>> > >> >>>>>>>> Arrow UDF is being developed by an open-source community from > day > >> >>>> one > >> >>>>>> and > >> >>>>>>>> is owned by RisingWaveLabs. The project has been launched in > >> >>>> December > >> >>>>>> 2023. > >> >>>>>>>> > >> >>>>>>>> ## Initial Goals > >> >>>>>>>> > >> >>>>>>>> By transferring ownership of the project to the Apache Arrow, > >> >>>> Arrow UDF > >> >>>>>>>> expects to ensure its neutrality and further encourage and > >> >>>> facilitate > >> >>>>>> the > >> >>>>>>>> adoption of Arrow UDF by the community. > >> >>>>>>>> > >> >>>>>>>> ## Current Status > >> >>>>>>>> > >> >>>>>>>> Contributors: 5 > >> >>>>>>>> > >> >>>>>>>> Users: > >> >>>>>>>> > >> >>>>>>>> - [RisingWave]: A Distributed SQL Database for Stream > >> Processing. > >> >>>>>>>> - [Databend]: An open-source cloud data warehouse that > serves as > >> >>>> a > >> >>>>>>>> cost-effective alternative to Snowflake. > >> >>>>>>>> > >> >>>>>>>> ## Documentation > >> >>>>>>>> > >> >>>>>>>> The document of Arrow UDF is hosted at > >> >>>>>>>> https://docs.rs/arrow-udf/latest/arrow_udf/. > >> >>>>>>>> > >> >>>>>>>> ## Initial Source > >> >>>>>>>> > >> >>>>>>>> The project currently holds a GitHub repository and multiple > >> >>>> packages: > >> >>>>>>>> > >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf > >> >>>>>>>> > >> >>>>>>>> Rust: > >> >>>>>>>> > >> >>>>>>>> - https://crates.io/arrow-udf/ > >> >>>>>>>> - https://crates.io/arrow-udf-python/ > >> >>>>>>>> - https://crates.io/arrow-udf-js/ > >> >>>>>>>> - https://crates.io/arrow-udf-js-deno/ > >> >>>>>>>> - https://crates.io/arrow-udf-wasm/ > >> >>>>>>>> > >> >>>>>>>> Python: > >> >>>>>>>> > >> >>>>>>>> - https://pypi.org/project/arrow-udf/ > >> >>>>>>>> > >> >>>>>>>> Those packge will retain its name, while the repository will be > >> >>>> moved to > >> >>>>>>>> apache org. > >> >>>>>>>> > >> >>>>>>>> ## Required Resources > >> >>>>>>>> > >> >>>>>>>> ### Mailing Lists > >> >>>>>>>> > >> >>>>>>>> We can reuse the existing mailing lists that arrow have. > >> >>>>>>>> > >> >>>>>>>> ### Git Repositories > >> >>>>>>>> > >> >>>>>>>> From > >> >>>>>>>> > >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf > >> >>>>>>>> > >> >>>>>>>> To > >> >>>>>>>> > >> >>>>>>>> - https://gitbox.apache.org/asf/repos/arrow-udf > >> >>>>>>>> - https://github.com/apache/arrow-udf > >> >>>>>>>> > >> >>>>>>>> ### Issue Tracking > >> >>>>>>>> > >> >>>>>>>> The project would like to continue using GitHub Issues. > >> >>>>>>>> > >> >>>>>>>> ### Other Resources > >> >>>>>>>> > >> >>>>>>>> The project has already chosen GitHub actions as continuous > >> >>>> integration > >> >>>>>>>> tools. > >> >>>>>>>> > >> >>>>>>>> ## Initial Committers > >> >>>>>>>> > >> >>>>>>>> - Runji Wang wangrunji0...@163.com > >> >>>>>>>> - Giovanny Gutiérrez > >> >>>>>>>> - sundy-li sund...@apache.org > >> >>>>>>>> - Xuanwo xua...@apache.org > >> >>>>>>>> - Max Justus Spransy maxjus...@gmail.com > >> >>>>>>>> > >> >>>>>>>> [RisingWave]: https://github.com/risingwavelabs/risingwave > >> >>>>>>>> [Databend]: https://github.com/datafuselabs/databend > >> >>>>>>>> > >> >>>>>>>> Xuanwo > >> >>>>>>>> > >> >>>>>> > >> >>> > >> >>> > >> >> > >> > >> -- > >> Xuanwo > >> > > -- > Xuanwo >