[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

GitBox Tue, 01 Mar 2022 18:44:10 -0800


gaojun2048 commented on pull request #1881:
URL: 
https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1056084581



   > 
   
   Yes, the udf plugin is designed for those who use Ballista as a computing 
engine, but do not want to modify the source code of ballista. We use ballista 
in production and we need ballista to be able to use our custom udf. As a user 
of ballista, I am reluctant to modify the source code of ballista directly, 
because it means that I need to recompile ballista myself, and in the future, 
when I want to upgrade ballista to the latest version of the community, I need 
to do more merges work. If I use the udf plugin, I only need to maintain the 
custom udf code. When I upgrade the version of ballista, I only need to modify 
the version number of the datafusion dependency in the code, and then recompile 
these udf dynamic libraries. I believe this is a more friendly way for those 
who actually use ballista as a computing engine.
   
   In my opinion, people who use datafusion and people who use ballista are 
different people, and the udf plugin is more suitable for ballista than 
datafusion.
   1. People who use datafusion generally develop their own computing engines 
on the basis of datafusion. In this case, they often do not need udf plugins. 
They only need to put the udf code into their own computing engines, and they 
decide for themselves. When to call register_udf to register udf into 
datafusion. If needed, they can handle the serialization and deserialization of 
custom UDFs in their own computing engine to achieve distributed scheduling.
   2. People who use ballista generally only use ballista as a computing 
engine. They often do not have a deep understanding of the source code of 
datafusion. It is very difficult to directly modify the source code of ballista 
and datafusion. They may update the version of ballista frequently, and 
modifying the source code of ballista's datafusion means that each upgrade 
requires merge code and recompile, which is a very big burden for them. In 
particular, it should be pointed out that there is no way for udf to work in 
ballista now, because serialization and deserialization of udf need to know the 
specific implementation of udf, which cannot be achieved without modifying the 
source code of ballista and datafusion. The role of the udf plugin in this case 
is very obvious. They only need to maintain their own udf code and do not need 
to pay attention to the code changes of ballista's datafusion.
   3. I don't think scalar_functions and aggregate_functions in 
ExecutionContext need to be modified as these are for those who use datafusion 
but not ballista. So I think I should modify the code and migrate the plugin 
mod into the ballista crate instead of staying in datafusion.
   
   Thanks a lot, can you give me more advice on these?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Reply via email to