adriangb commented on issue #173:
URL: https://github.com/apache/arrow-ballista/issues/173#issuecomment-1387435057
> I think being able to run Python UDFs is a must, almost not even worth
having Python UDF support if dependencies can't be used. This is just my
opinion and not a fact.
Agreed that if you want to allow custom Python code to run you need to allow
3rd party dependencies. But 3rd party dependencies are a whole big can of worms
in Python, to the point where I would avoid opening it if you can. Hence the
suggestion for HTTP UDFs which also have other benefits/use cases.
I think a key aspect of this is to allow users to stick to the workflows
they know instead of having to build a new one. For example as a data
engineer/backend dev I manage a large project with multiple deployable
artifacts that get bundled into containers and deployed on k8s. I have the
knowledge and infrastructure in place to handle all of the complexities
involved in this (e.g. locking dependencies across deployable alá Cargo
workspaces). Any sort of new dependency management paradigm that does not fit
in with this is extra work and possibly a source of bugs. That includes the SSH
into a node model and Airflow's same dependency everywhere model 🙃. I think a
good model would be a sort of "UDF executor web framework" that does some
hand-holding but ultimately leaves packaging and dependencies up to users:
```python
from ballista import UDFExecutorApp, udf
from pyarrow import RecordBatchReader
@udf.aggregator(name="override_the_name")
def some_aggregation_function(reader: RecordBatchReader) ->
Iterable[RecordBatch]:
...
def main() -> None:
app = UDFExecutorApp([some_aggregation_function])
app.serve(scheduler_host=...)
```
I'm just making something up here but the point is to keep a relatively
familiar Flask-style API but abstract away the fiddly bits.
This would register itself with the scheduler as being able to execute
`"override_the_name"`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]