[ https://issues.apache.org/jira/browse/ARROW-15635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vibhatha Lakmal Abeykoon updated ARROW-15635: --------------------------------------------- Description: The objective is to list down a set of tasks required to provide UDF support for Apache Arrow streaming execution engine. In the first iteration we will be focusing on providing support for Python-based UDFs which can support Python functions. The UDF Integration is going to pan out with a series of sub-tasks associated with the development and PoCs. Note that this is going to be the first iteration of UDF integrations with a limited scope. This ticket will cover the following topics; # POC for UDF integration: The objective is to evaluate the existing components in the source and evaluate the required modifications and new building blocks required to integrate UDFs. # The language will be limited to C+{+}/{+}Python users can register Python function as a UDF and use it with an `apply` method on Arrow Tables or provide a computation API endpoint via arrow::compute API. Note that the C+ API already provides a way to register custom functions via the function registry API. At the moment this is not exposed to Python. # Planned features for this ticket are; ## Scalar UDFs : UDFs executed per value (per row) ## Vector UDFs : UDFs executed per batch (a full array or partial array) ## Aggregate UDFs : UDFs associated with an aggregation operation # Integration limitations ## Doesn't support custom data types which doesn't support Numpy or Pandas ## Complex processing with parallelism within UDFs are not supported ## Parallel UDFs are not supported in the initial version of UDFs. Allthough we are documenting what is required and a rough sketch for the next phase. was: The objective is to list down a set of tasks required to provide UDF support for Apache Arrow streaming execution engine. In the first iteration we will be focusing on providing support for Python-based UDFs which can support Python functions. The UDF Integration is going to pan out with a series of sub-tasks associated with the development and PoCs. Note that this is going to be the first iteration of UDF integrations with a limited scope. This ticket will cover the following topics; # POC for UDF integration: The objective is to evaluate the existing components in the source and evaluate the required modifications and new building blocks required to integrate UDFs. # The language will be limited to C+{+}/{+}Python users can register Python function as a UDF and use it with an `apply` method on Arrow Tables or provide a computation API endpoint via arrow::compute API. Note that the C+ API already provides a way to register custom functions via the function registry API. At the moment this is not exposed to Python. # Planned features for this ticket are; ## Scalar UDFs : UDFs executed per value (per row) ## Vector UDFs : UDFs executed per batch (a full array or partial array) ## Aggregate UDFs : UDFs associated with an aggregation operation # Integration limitations ## Doesn't support custom data types which doesn't support Numpy or Pandas ## Complex processing with parallelism within UDFs are not supported > [C++][Python] UDF Integration > ------------------------------ > > Key: ARROW-15635 > URL: https://issues.apache.org/jira/browse/ARROW-15635 > Project: Apache Arrow > Issue Type: Task > Components: C++, Python > Reporter: Vibhatha Lakmal Abeykoon > Assignee: Vibhatha Lakmal Abeykoon > Priority: Major > > The objective is to list down a set of tasks required to provide UDF support > for Apache Arrow streaming execution engine. In the first iteration we will > be focusing on providing support for Python-based UDFs which can support > Python functions. > The UDF Integration is going to pan out with a series of sub-tasks associated > with the development and PoCs. Note that this is going to be the first > iteration of UDF integrations with a limited scope. This ticket will cover > the following topics; > # POC for UDF integration: The objective is to evaluate the existing > components in the source and evaluate the required modifications and new > building blocks required to integrate UDFs. > # The language will be limited to C+{+}/{+}Python users can register Python > function as a UDF and use it with an `apply` method on Arrow Tables or > provide a computation API endpoint via arrow::compute API. Note that the C+ > API already provides a way to register custom functions via the function > registry API. At the moment this is not exposed to Python. > # Planned features for this ticket are; > ## Scalar UDFs : UDFs executed per value (per row) > ## Vector UDFs : UDFs executed per batch (a full array or partial array) > ## Aggregate UDFs : UDFs associated with an aggregation operation > # Integration limitations > ## Doesn't support custom data types which doesn't support Numpy or Pandas > ## Complex processing with parallelism within UDFs are not supported > ## Parallel UDFs are not supported in the initial version of UDFs. Allthough > we are documenting what is required and a rough sketch for the next phase. -- This message was sent by Atlassian Jira (v8.20.1#820001)