[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497082#comment-17497082 ]
Vibhatha Lakmal Abeykoon edited comment on ARROW-15765 at 2/24/22, 1:46 AM: ---------------------------------------------------------------------------- As [~westonpace] explained, we are working on a UDF PoC. At the moment how you register a function can be as follows; {code:java} import pyarrow as pa from pyarrow import compute as pc from pyarrow.compute import call_function, register_pyfunction from pyarrow.compute import Arity, InputType func_doc = {} func_doc["summary"] = "summary" func_doc["description"] = "desc" func_doc["arg_names"] = ["number"] func_doc["options_class"] = "SomeOptions" func_doc["options_required"] = False arity = Arity.unary() func_name = "python_udf" in_types = [InputType.array(pa.x())] out_type = pa.int64() def simple_function(arrow_array): return call_function("add", [arrow_array, 1]) callback = simple_function register_pyfunction(func_name, arity, func_doc, in_types, out_type, callback) func1 = pc.get_function(func_name) a1 = pc.call_function(func_name, [pa.array([20])]){code} When registering the function user has to explicitly mention what is the arity and what are the input and output types of the UDF. We can ease this by taking all the information from the type-hints itself. This is only to improve the usability. For instance the user will write the function like this {code:java} def simple_function(arrow_array: pa.Int32Array) -> pa.Int32Array: return call_function("add", [arrow_array, 1]) {code} When registering user would only write {code:java} register_pyfunction(func_name, simple_function) {code} We will extract the docs from comments or let user pass (optional) and the arity, input and output types by inspecting the function signature. Spark is already providing that support. When we go this route, we will extract all the information from the UDF signature. At the moment I am using inspect API to extract those information. Next step is to extract from the type hint info: `pa.Int32Array` that this is a `pa.Array` of type `pa.int32()`. This is the objective of this exercise. [~apitrou] does it clear things out? Do you need more information to know why we need this feature? was (Author: vibhatha): As [~westonpace] explained, we are working on a UDF PoC. At the moment how you register a function can be as follows; {code:java} import pyarrow as pa from pyarrow import compute as pc from pyarrow.compute import call_function, register_pyfunction from pyarrow.compute import Arity, InputType func_doc = {} func_doc["summary"] = "summary" func_doc["description"] = "desc" func_doc["arg_names"] = ["number"] func_doc["options_class"] = "SomeOptions" func_doc["options_required"] = False arity = Arity.unary() func_name = "python_udf" in_types = [InputType.array(pa.x())] out_type = pa.int64() def simple_function(arrow_array): return call_function("add", [arrow_array, 1]) callback = simple_function register_pyfunction(func_name, arity, func_doc, in_types, out_type, callback) func1 = pc.get_function(func_name) a1 = pc.call_function(func_name, [pa.array([20])]){code} When registering the function user has to explicitly mention what is the arity and what are the input and output types of the UDF. We can ease this by taking all the information from the type-hints itself. This is only to improve the usability. For instance the user will write the function like this {code:java} def simple_function(arrow_array: pa.Int32Array) -> pa.Int32Array: return call_function("add", [arrow_array, 1]) {code} When registering user would only write {code:java} register_pyfunction(func_name, simple_function) {code} We will extract the docs from comments or let user pass (optional) and the arity, input and output types by inspecting the function signature. Spark is already providing that support. When we go this route, we will extract all the information from the UDF signature. At the moment I am using inspect API to extract those information. Next step is to extract from the type hint info: `pa.Int32Array` that this is a `pa.Array` of type `pa.int32()`. This is the objective of this exercise. [~apitrou] does it clear things out? Do you need more information to know why we need this feature? > [Python] Extracting Type information from Python Objects > -------------------------------------------------------- > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python > Reporter: Vibhatha Lakmal Abeykoon > Assignee: Vibhatha Lakmal Abeykoon > Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)