To be clear, if the arrow community thinks this would be better organized / administered in the Apache DataFusion project (especially if it is aligned with Rust) I think it would be good to discuss donating there
On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <al...@influxdata.com> wrote: > I think there are two aspects: > 1. The actual mechanics of implementing functions > 2. The actual library of udf functions (e.g. sin, cos, nullif, etc) > > I agree 2 is not something that belongs naturally in the arrow project and > is better aligned with query engines > > However I think 1 is worth considering. > > As I understand it, the problem arrow_udf solves is avoiding some of the > boilerplate required to make vectorized udfs. So instead of writing a > special eval_gcd function like this > > ``` > fn gcd(l: i64, r: i64) -> i64 { > // do gcd calculation > } > > // implement vectorized version > fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef { > let left = left.as_primitive<Int64Type>(); > let right = right.as_primitive<Int64Type>(); > res = binary(left, right, |l, r| gcd(l, r)); > Arc::new(res) > } > ``` > > The user simply annotates the scalar function and have the library code > gen the array version > ``` > #[function("gcd(int64, int64) -> int64", output = "eval_gcd")] > fn gcd(l: i64, r: i64) -> i64 { > // do gcd calculation > } > ``` > > We have a lot of boilerplate / non idea macro stuff in DataFusion that I > think this would help a lot. > > Andrew > > > On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid> wrote: > >> I wonder if the DataFusion project might be a more natural home for this >> functionality? UDFs are more of a query engine concept, whereas arrow-rs is >> more focused on purely physical execution? >> >> On 28 June 2024 19:41:39 BST, Runji Wang <wangrunji0...@163.com> wrote: >> >Hi Felipe, >> > >> >Vectorization will be applied whenever possible. When all input and >> output types of a function are primitive (int16, int32, int64, float32, >> float64) and do not involve any Option or Result, the macro will >> automatically generate code based on unary < >> https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> or binary < >> https://docs.rs/arrow/latest/arrow/compute/fn.binary.html> kernels, >> which potentially allows for vectorization. >> > >> >Both examples you showed are not vectorized. The `div` function is due >> to the Result output, while `gcd` is due to the loop in its implementation. >> However, if the function is simple enough, like an `add` function: >> > >> >#[function("add(int, int) -> int")] >> >fn add(a: i32, b: i32) -> i32 { >> > a + b >> >} >> > >> >It can be auto-vectorized by llvm. >> > >> >Runji >> > >> > >> >On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote: >> >> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb <al...@influxdata.com> >> wrote: >> >> > >> >> > Hi Xuanwo, >> >> > >> >> > Sorry for the delay in responding. I think the ability to easily >> write >> >> > functions that "feel" like native functions in whatever language and >> be >> >> > able to generate arrow / vectorized versions of them is quite >> valuable. >> >> > This is my understanding of what this proposal is about. >> >> >> >> My understanding is that it's not vectorized. From the examples in >> >> risingwavelabs/arrow-udf, <https://github.com/risingwavelabs/arrow-udf> >> it >> >> looks like the macros generate code that gathers values from columns >> into >> >> local scalars that are passed as scalar parameters to user functions. >> Is >> >> the hope here that rustc/llvm will auto-vectorize the code? >> >> >> >> #[function("gcd(int, int) -> int")] >> >> fn gcd(mut a: i32, mut b: i32) -> i32 { >> >> while b != 0 { >> >> (a, b) = (b, a % b); >> >> } >> >> a >> >> } >> >> >> >> #[function("div(int, int) -> int")] >> >> fn div(x: i32, y: i32) -> Result<i32, &'static str> { >> >> if y == 0 { >> >> return Err("division by zero"); >> >> } >> >> Ok(x / y) >> >> } >> >> >> >> > I left some additional comments on the markdown. >> >> > >> >> > One thing that might be worth doing is articulate some other >> potential >> >> > locations for where the code might go. One option, as I think you >> propose, >> >> > is to make its own repository. Another option could be to donate >> the code >> >> > and put the various language bindings in the same repo as the arrow >> >> > language implementations (e.g arrow-rs, arrow for python, etc) which >> would >> >> > likely make it easier to maintain and discover. >> >> > >> >> > I am curious about what other devs / users feel about this? >> >> > >> >> > Andrew >> >> > >> >> > >> >> > >> >> > On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xu...@apache.org> wrote: >> >> > >> >> > > Hello, everyone. >> >> > > >> >> > > I start this thread to disscuss the donation of a User-Defined >> Function >> >> > > Framework for Apache Arrow. >> >> > > >> >> > > Feel free to review and leave your comments here. For live review, >> >> please >> >> > > visit: >> >> > > >> >> > > https://hackmd.io/@xuanwo/apache-arrow-udf >> >> > > >> >> > > The original content also pasted here for a quick reading: >> >> > > >> >> > > ------ >> >> > > >> >> > > ## Abstract >> >> > > >> >> > > Arrow UDF is a User-Defined Function Framework for Apache Arrow. >> >> > > >> >> > > ## Proposal >> >> > > >> >> > > Arrow UDF allows user to easily create and run user-defined >> functions >> >> > > (UDF) in Rust, Python, Java or JavaScript based on Apache Arrow. >> The >> >> > > functions can be executed natively, or in WebAssembly, or in a >> remote >> >> > > server via Arrow Flight. >> >> > > >> >> > > Arrow UDF was originally designed to be used by the RisingWave >> project >> >> but >> >> > > is now being used by Databend and several database startups. >> >> > > >> >> > > We believe that the Arrow UDF project will provide diversity value >> to >> >> the >> >> > > entire Arrow community. >> >> > > >> >> > > ## Background >> >> > > >> >> > > Arrow UDF is being developed by an open-source community from day >> one >> >> and >> >> > > is owned by RisingWaveLabs. The project has been launched in >> December >> >> 2023. >> >> > > >> >> > > ## Initial Goals >> >> > > >> >> > > By transferring ownership of the project to the Apache Arrow, >> Arrow UDF >> >> > > expects to ensure its neutrality and further encourage and >> facilitate >> >> the >> >> > > adoption of Arrow UDF by the community. >> >> > > >> >> > > ## Current Status >> >> > > >> >> > > Contributors: 5 >> >> > > >> >> > > Users: >> >> > > >> >> > > - [RisingWave]: A Distributed SQL Database for Stream Processing. >> >> > > - [Databend]: An open-source cloud data warehouse that serves as >> a >> >> > > cost-effective alternative to Snowflake. >> >> > > >> >> > > ## Documentation >> >> > > >> >> > > The document of Arrow UDF is hosted at >> >> > > https://docs.rs/arrow-udf/latest/arrow_udf/. >> >> > > >> >> > > ## Initial Source >> >> > > >> >> > > The project currently holds a GitHub repository and multiple >> packages: >> >> > > >> >> > > - https://github.com/risingwavelabs/arrow-udf >> >> > > >> >> > > Rust: >> >> > > >> >> > > - https://crates.io/arrow-udf/ >> >> > > - https://crates.io/arrow-udf-python/ >> >> > > - https://crates.io/arrow-udf-js/ >> >> > > - https://crates.io/arrow-udf-js-deno/ >> >> > > - https://crates.io/arrow-udf-wasm/ >> >> > > >> >> > > Python: >> >> > > >> >> > > - https://pypi.org/project/arrow-udf/ >> >> > > >> >> > > Those packge will retain its name, while the repository will be >> moved to >> >> > > apache org. >> >> > > >> >> > > ## Required Resources >> >> > > >> >> > > ### Mailing Lists >> >> > > >> >> > > We can reuse the existing mailing lists that arrow have. >> >> > > >> >> > > ### Git Repositories >> >> > > >> >> > > From >> >> > > >> >> > > - https://github.com/risingwavelabs/arrow-udf >> >> > > >> >> > > To >> >> > > >> >> > > - https://gitbox.apache.org/asf/repos/arrow-udf >> >> > > - https://github.com/apache/arrow-udf >> >> > > >> >> > > ### Issue Tracking >> >> > > >> >> > > The project would like to continue using GitHub Issues. >> >> > > >> >> > > ### Other Resources >> >> > > >> >> > > The project has already chosen GitHub actions as continuous >> integration >> >> > > tools. >> >> > > >> >> > > ## Initial Committers >> >> > > >> >> > > - Runji Wang wangrunji0...@163.com >> >> > > - Giovanny Gutiérrez >> >> > > - sundy-li sund...@apache.org >> >> > > - Xuanwo xua...@apache.org >> >> > > - Max Justus Spransy maxjus...@gmail.com >> >> > > >> >> > > [RisingWave]: https://github.com/risingwavelabs/risingwave >> >> > > [Databend]: https://github.com/datafuselabs/databend >> >> > > >> >> > > Xuanwo >> >> > > >> >> > >