adriangb commented on issue #173:
URL: https://github.com/apache/arrow-ballista/issues/173#issuecomment-1387435057

   > I think being able to run Python UDFs is a must, almost not even worth 
having Python UDF support if dependencies can't be used. This is just my 
opinion and not a fact.
   
   Agreed that if you want to allow custom Python code to run you need to allow 
3rd party dependencies. But 3rd party dependencies are a whole big can of worms 
in Python, to the point where I would avoid opening it if you can. Hence the 
suggestion for HTTP UDFs which also have other benefits/use cases.
   
   I think a key aspect of this is to allow users to stick to the workflows 
they know instead of having to build a new one. For example as a data 
engineer/backend dev I manage a large project with multiple deployable 
artifacts that get bundled into containers and deployed on k8s. I have the 
knowledge and infrastructure in place to handle all of the complexities 
involved in this (e.g. locking dependencies across deployable alá Cargo 
workspaces). Any sort of new dependency management paradigm that does not fit 
in with this is extra work and possibly a source of bugs. That includes the SSH 
into a node model and Airflow's same dependency everywhere model 🙃. I think a 
good model would be a sort of "UDF executor web framework" that does some 
hand-holding but ultimately leaves packaging and dependencies up to users:
   
   ```python
   from ballista import UDFExecutorApp, udf
   from pyarrow import RecordBatchReader
   
   @udf.aggregator(name="override_the_name")
   def some_aggregation_function(reader: RecordBatchReader) -> 
Iterable[RecordBatch]:
       ...
   
   def main() -> None:
       app = UDFExecutorApp([some_aggregation_function])
       app.serve(scheduler_host=...)
   ```
   
   I'm just making something up here but the point is to keep a relatively 
familiar Flask-style API but abstract away the fiddly bits.
   
   This would register itself with the scheduler as being able to execute 
`"override_the_name"`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to