crepererum opened a new issue, #9326:
URL: https://github.com/apache/arrow-datafusion/issues/9326

   ### Is your feature request related to a problem or challenge?
   
   Databases / ETL solutions built on top of DataFusion can use UDFs (in their 
various forms) to extend the functionality of DataFusion, e.g. to add new 
scalar or aggregation functions. This extensibility however does NOT 
automatically extend to their users, since they cannot and (for security 
reasons) should not add code to the running system. So the U in UDF currently 
stands for "DataFusion API User", not for "End-User".
   
   [WASM](https://webassembly.org/) provides a way to run user code in a secure 
sandbox under an "unknown" host (i.e. the user does not need to know about the 
operating system or CPU architecture). A DataFusion-based solution can use that 
to implement UDFs. However, since the calling convention from the WASM payload 
to the UDF are solution-defined, the end user is likely to have a hard time 
with it, and there is likely only a non-existing/small ecosystem for tooling to 
develop UDFs.
   
   Defining the UDF WASM interface in DataFusion -- potentially in 
collaboration with the Arrow (since we need to get Arrow data across the WASM 
memory boundary) -- would likely facilitate a wider ecosystem and a more 
streamlined solution. Prior art to this is Arrow Flight, which is now being 
integrated into more and more server and client implementations.
   
   ### Describe the solution you'd like
   
   1. Define/find a way to pass Arrow data in/out a WASM payload.
   2. Define the WASM calling convention for the different types of UDFs 
(scalar, aggregation, window functions, ...). Make sure to version that 
interface so we can advance it later (e.g. by using new WASM features).
   3. Implement UDFs using 
[wasmtime](https://github.com/bytecodealliance/wasmtime/) in DataFusion.
   4. Offer some easy blueprint / framework to develop UDFs in at least two 
languages.
   
   ### Describe alternatives you've considered
   
   - Doing this as part of the DataFusion-based solution (i.e. downstream). See 
drawbacks illustrated within the intro.
   - Use other UDF interface types like Arrow IPC & a Python payload. That 
clearly has security issues and is harder to deploy/manage.
   
   ### Additional context
   
   Projects that might be helpful:
   
   - [WASM Arrow](https://github.com/domoritz/arrow-wasm) (not an official 
Arrow project, yet)
   - [wasmtime](https://github.com/bytecodealliance/wasmtime/)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to