[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497082#comment-17497082
 ] 

Vibhatha Lakmal Abeykoon edited comment on ARROW-15765 at 2/24/22, 12:33 AM:
-----------------------------------------------------------------------------

As [~westonpace] explained, we are working on a UDF PoC. At the moment how you 
register a function can be as follows; 
{code:java}
import pyarrow as pa 
from pyarrow import compute as pc 
from pyarrow.compute import call_function, register_pyfunction
from pyarrow.compute import Arity, InputType 

func_doc = {} 
func_doc["summary"] = "summary" 
func_doc["description"] = "desc" 
func_doc["arg_names"] = ["number"] 
func_doc["options_class"] = "SomeOptions" 
func_doc["options_required"] = False 
arity = Arity.unary() 
func_name = "python_udf" 
in_types = [InputType.array(pa.x())] 
out_type = pa.int64() 

def py_function(arrow_array): 
    return call_function("add", [arrow_array, 1])  

callback = simple_function

register_pyfunction(func_name, arity, func_doc, in_types, out_type, callback) 

func1 = pc.get_function(func_name)

a1 = pc.call_function(func_name, [pa.array([20])]){code}
 
When registering the function user has to explicitly mention what is the arity 
and what are the input and output types of the UDF. We can ease this by taking 
all the information from the type-hints itself. This is only to improve the 
usability.


For instance the user will write the function like this

 
{code:java}
def py_function(arrow_array: pa.Int32Array) -> pa.Int32Array: 
    return call_function("add", [arrow_array, 1])  {code}
When registering user would only write
{code:java}
register_pyfunction(func_name, callback)  {code}
We will extract the docs from comments or let user pass (optional) and the 
arity, input and output types by inspecting the function signature. 

Spark is already providing that support. When we go this route, we will extract 
all the information from the UDF signature. At the moment I am using inspect 
API to extract those information. 
 
Next step is to extract from the type hint info: `pa.Int32Array` that this is a 
`pa.Array` of type `pa.int32()`. This is the objective of this exercise.
 
[~apitrou] does it clear things out? Do you need more information to know why 
we need this feature?  

 


was (Author: vibhatha):
As [~westonpace] explained, we are working on a UDF PoC. At the moment how you 
register a function can be as follows; 
{code:java}
import pyarrow as pa 
from pyarrow import compute as pc 
from pyarrow.compute import call_function, register_pyfunction
from pyarrow.compute import Arity, InputType 

func_doc = {} 
func_doc["summary"] = "summary" 
func_doc["description"] = "desc" 
func_doc["arg_names"] = ["number"] 
func_doc["options_class"] = "SomeOptions" 
func_doc["options_required"] = False 
arity = Arity.unary() 
func_name = "python_udf" 
in_types = [InputType.array(pa.x())] 
out_type = pa.int64() 

def py_function(arrow_array): 
    p_new_array = call_function("add", [arrow_array, 1]) 
    return p_new_array 

callback = simple_functionregister_pyfunction(func_name, arity, func_doc, 
in_types, out_type, callback) 

func1 = pc.get_function(func_name)

a1 = pc.call_function(func_name, [pa.array([20])]){code}
 
When registering the function user has to explicitly mention what is the arity 
and what are the input and output types of the UDF. We can ease this by taking 
all the information from the type-hints itself. This is only to improve the 
usability.
 
Spark is already providing that support. When we go this route, we will extract 
all the information from the UDF signature. At the moment I am using inspect 
API to extract those information.
 
Next step is to extract from the type hint info: `pa.Int32Array` that this is a 
`pa.Array` of type `pa.int32()`. This is the objective of this exercise.
 
[~apitrou] does it clear things out? Do you need more information to know why 
we need this feature?  
 

> [Python] Extracting Type information from Python Objects
> --------------------------------------------------------
>
>                 Key: ARROW-15765
>                 URL: https://issues.apache.org/jira/browse/ARROW-15765
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Vibhatha Lakmal Abeykoon
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to