rtpsw commented on PR #13500: URL: https://github.com/apache/arrow/pull/13500#issuecomment-1176817686
> The approach, if I'm understanding correctly, is to use C++ to make two passes through the plan (or maybe its one pass). The first pass gets all the UDFs out of the plan. Pyarrow then unpickles and registers those UDFs. The second actually consumes the plan, using a registry that contains those unpickled functions. This is a fair description. For the purpose of alignment with my corresponding Substrait proposal, could you confirm the data associated with each UDF is appropriate/acceptable? If so, I'll ensure it gets expressed in the Substrait plan, even if it end up being organized differently there. > This wouldn't be my first approach. I think I'd prefer adding another callback like the consumer_factory for UDF handling. This would make it easier to handle situations where there are alternative UDF handlers. Or, for example, a C++ or R user that still wants to be able to run python UDFs. However, I'm not opposed to this approach. The end pyarrow interface to the user is still just "substrait in->data out" so if we wanted to move to a different approach in the future that would be fine. The current approach does not block using a UDF handler. I think the only real difference is that in my approach the data for all UDFs is packed together and crosses the C++/Python boundary once. Given this data, one can write a loop that calls any UDF handler on any of the UDF records, with optional record filtering and other such enhancements if needed. This would be an alternative to the current behavior you described as "Pyarrow then unpickles and registers those UDFs"; I don't think this needs to be implemented right away, but I'm open to arguments in favor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org