To be clear, if the arrow community thinks this would be better organized /
administered in the Apache DataFusion project (especially if it is aligned
with Rust) I think it would be good to discuss donating there

On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <al...@influxdata.com> wrote:

> I think there are two aspects:
> 1. The actual mechanics of implementing functions
> 2. The actual library of udf functions (e.g. sin, cos, nullif, etc)
>
> I agree 2 is not something that belongs naturally in the arrow project and
> is better aligned with query engines
>
> However I think 1 is worth considering.
>
> As I understand it, the problem arrow_udf solves is avoiding some of the
> boilerplate  required to make vectorized udfs. So instead of writing a
> special eval_gcd function like this
>
> ```
> fn gcd(l: i64, r: i64) -> i64 {
>  // do gcd calculation
> }
>
> // implement vectorized version
> fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef {
>   let left = left.as_primitive<Int64Type>();
>   let right = right.as_primitive<Int64Type>();
>   res = binary(left, right, |l, r| gcd(l, r));
>   Arc::new(res)
> }
> ```
>
> The user simply annotates the scalar function and have the library code
> gen the array version
> ```
> #[function("gcd(int64, int64) -> int64", output = "eval_gcd")]
> fn gcd(l: i64, r: i64) -> i64 {
>  // do gcd calculation
> }
> ```
>
> We have a lot of boilerplate / non idea macro stuff in DataFusion that I
> think this would help a lot.
>
> Andrew
>
>
> On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies
> <r.taylordav...@googlemail.com.invalid> wrote:
>
>> I wonder if the DataFusion project might be a more natural home for this
>> functionality? UDFs are more of a query engine concept, whereas arrow-rs is
>> more focused on purely physical execution?
>>
>> On 28 June 2024 19:41:39 BST, Runji Wang <wangrunji0...@163.com> wrote:
>> >Hi Felipe,
>> >
>> >Vectorization will be applied whenever possible. When all input and
>> output types of a function are primitive (int16, int32, int64, float32,
>> float64) and do not involve any Option or Result, the macro will
>> automatically generate code based on unary <
>> https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> or binary <
>> https://docs.rs/arrow/latest/arrow/compute/fn.binary.html> kernels,
>> which potentially allows for vectorization.
>> >
>> >Both examples you showed are not vectorized. The `div` function is due
>> to the Result output, while `gcd` is due to the loop in its implementation.
>> However, if the function is simple enough, like an `add` function:
>> >
>> >#[function("add(int, int) -> int")]
>> >fn add(a: i32, b: i32) -> i32 {
>> >    a + b
>> >}
>> >
>> >It can be auto-vectorized by llvm.
>> >
>> >Runji
>> >
>> >
>> >On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote:
>> >> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb <al...@influxdata.com>
>> wrote:
>> >> >
>> >> > Hi Xuanwo,
>> >> >
>> >> > Sorry for the delay in responding. I think  the ability to easily
>> write
>> >> > functions that "feel" like native functions in whatever language and
>> be
>> >> > able to generate arrow / vectorized versions of them is quite
>> valuable.
>> >> > This is my understanding of what this proposal is about.
>> >>
>> >> My understanding is that it's not vectorized. From the examples in
>> >> risingwavelabs/arrow-udf, <https://github.com/risingwavelabs/arrow-udf>
>> it
>> >> looks like the macros generate code that gathers values from columns
>> into
>> >> local scalars that are passed as scalar parameters to user functions.
>> Is
>> >> the hope here that rustc/llvm will auto-vectorize the code?
>> >>
>> >> #[function("gcd(int, int) -> int")]
>> >> fn gcd(mut a: i32, mut b: i32) -> i32 {
>> >>     while b != 0 {
>> >>         (a, b) = (b, a % b);
>> >>     }
>> >>     a
>> >> }
>> >>
>> >> #[function("div(int, int) -> int")]
>> >> fn div(x: i32, y: i32) -> Result<i32, &'static str> {
>> >>     if y == 0 {
>> >>         return Err("division by zero");
>> >>     }
>> >>     Ok(x / y)
>> >> }
>> >>
>> >> > I left some additional comments on the markdown.
>> >> >
>> >> > One thing that might be worth doing is articulate some other
>> potential
>> >> > locations for where the code might go. One option, as I think you
>> propose,
>> >> > is to make its own repository.  Another option could be to donate
>> the code
>> >> > and put the various language bindings in the same repo as the arrow
>> >> > language implementations (e.g arrow-rs, arrow for python, etc) which
>> would
>> >> > likely make it easier to maintain and discover.
>> >> >
>> >> > I am curious about what other devs / users feel about this?
>> >> >
>> >> > Andrew
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xu...@apache.org> wrote:
>> >> >
>> >> > > Hello, everyone.
>> >> > >
>> >> > > I start this thread to disscuss the donation of a User-Defined
>> Function
>> >> > > Framework for Apache Arrow.
>> >> > >
>> >> > > Feel free to review and leave your comments here. For live review,
>> >> please
>> >> > > visit:
>> >> > >
>> >> > > https://hackmd.io/@xuanwo/apache-arrow-udf
>> >> > >
>> >> > > The original content also pasted here for a quick reading:
>> >> > >
>> >> > > ------
>> >> > >
>> >> > > ## Abstract
>> >> > >
>> >> > > Arrow UDF is a User-Defined Function Framework for Apache Arrow.
>> >> > >
>> >> > > ## Proposal
>> >> > >
>> >> > > Arrow UDF allows user to easily create and run user-defined
>> functions
>> >> > > (UDF) in Rust, Python, Java or JavaScript based on Apache Arrow.
>> The
>> >> > > functions can be executed natively, or in WebAssembly, or in a
>> remote
>> >> > > server via Arrow Flight.
>> >> > >
>> >> > > Arrow UDF was originally designed to be used by the RisingWave
>> project
>> >> but
>> >> > > is now being used by Databend and several database startups.
>> >> > >
>> >> > > We believe that the Arrow UDF project will provide diversity value
>> to
>> >> the
>> >> > > entire Arrow community.
>> >> > >
>> >> > > ## Background
>> >> > >
>> >> > > Arrow UDF is being developed by an open-source community from day
>> one
>> >> and
>> >> > > is owned by RisingWaveLabs. The project has been launched in
>> December
>> >> 2023.
>> >> > >
>> >> > > ## Initial Goals
>> >> > >
>> >> > > By transferring ownership of the project to the Apache Arrow,
>> Arrow UDF
>> >> > > expects to ensure its neutrality and further encourage and
>> facilitate
>> >> the
>> >> > > adoption of Arrow UDF by the community.
>> >> > >
>> >> > > ## Current Status
>> >> > >
>> >> > > Contributors: 5
>> >> > >
>> >> > > Users:
>> >> > >
>> >> > > -   [RisingWave]: A Distributed SQL Database for Stream Processing.
>> >> > > -   [Databend]: An open-source cloud data warehouse that serves as
>> a
>> >> > > cost-effective alternative to Snowflake.
>> >> > >
>> >> > > ## Documentation
>> >> > >
>> >> > > The document of Arrow UDF is hosted at
>> >> > > https://docs.rs/arrow-udf/latest/arrow_udf/.
>> >> > >
>> >> > > ## Initial Source
>> >> > >
>> >> > > The project currently holds a GitHub repository and multiple
>> packages:
>> >> > >
>> >> > > - https://github.com/risingwavelabs/arrow-udf
>> >> > >
>> >> > > Rust:
>> >> > >
>> >> > > - https://crates.io/arrow-udf/
>> >> > > - https://crates.io/arrow-udf-python/
>> >> > > - https://crates.io/arrow-udf-js/
>> >> > > - https://crates.io/arrow-udf-js-deno/
>> >> > > - https://crates.io/arrow-udf-wasm/
>> >> > >
>> >> > > Python:
>> >> > >
>> >> > > - https://pypi.org/project/arrow-udf/
>> >> > >
>> >> > > Those packge will retain its name, while the repository will be
>> moved to
>> >> > > apache org.
>> >> > >
>> >> > > ## Required Resources
>> >> > >
>> >> > > ### Mailing Lists
>> >> > >
>> >> > > We can reuse the existing mailing lists that arrow have.
>> >> > >
>> >> > > ### Git Repositories
>> >> > >
>> >> > > From
>> >> > >
>> >> > > - https://github.com/risingwavelabs/arrow-udf
>> >> > >
>> >> > > To
>> >> > >
>> >> > > - https://gitbox.apache.org/asf/repos/arrow-udf
>> >> > > - https://github.com/apache/arrow-udf
>> >> > >
>> >> > > ### Issue Tracking
>> >> > >
>> >> > > The project would like to continue using GitHub Issues.
>> >> > >
>> >> > > ### Other Resources
>> >> > >
>> >> > > The project has already chosen GitHub actions as continuous
>> integration
>> >> > > tools.
>> >> > >
>> >> > > ## Initial Committers
>> >> > >
>> >> > > - Runji Wang wangrunji0...@163.com
>> >> > > - Giovanny Gutiérrez
>> >> > > - sundy-li sund...@apache.org
>> >> > > - Xuanwo xua...@apache.org
>> >> > > - Max Justus Spransy maxjus...@gmail.com
>> >> > >
>> >> > > [RisingWave]: https://github.com/risingwavelabs/risingwave
>> >> > > [Databend]: https://github.com/datafuselabs/databend
>> >> > >
>> >> > > Xuanwo
>> >> > >
>> >>
>
>

Reply via email to