[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500828#comment-17500828 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- Sure, the pasting code in the code block was a pain, it didn't capture the newlines. I will follow this method next time. Thanks [~apitrou] :) > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500821#comment-17500821 ] Antoine Pitrou commented on ARROW-15765: [~vibhatha] You can post your code on e.g. https://gist.github.com/ instead of editing the same comment several times :-) This will produce less notifications for people who watch this issue. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500816#comment-17500816 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- [~apitrou] [~jorisvandenbossche] [~westonpace] This is a very naive attempt on extracting types on a so-called class which extends the Generics. The following example reflects extracting types from such a class. {code:java} from typing import Any, List import inspectfrom typing import TypeVar, Generic from logging import LoggerT = TypeVar('T') class Array(object): def __init__(self, data: List[Any]): self._data = data @property def data(self): return self._data def __repr__(self): str1 = "" for datum in self.data: str1 += str(datum) + ", " return str1 class ArrayLike(Array, Generic[T]): def __init__(self, data:List[Any]): super(data) class DataType(object): def __init__(self): pass class Int32Type(DataType): def __init__(self): self._type_id = "int32" @property def id(self): return self._type_id #test a : Array = None data : List[int] = [10, 20 , 30] b: ArrayLike[Int32Type] = Array(data) print(b) # define a function with the generics def sample_udf(array: ArrayLike[Int32Type]) -> ArrayLike[Int32Type]: return arraysig = inspect.signature(sample_udf) input_types = sig.parameters.values() annotations = [val.annotation for val in input_types] annotation = annotations[0] inner_type = annotation.__args__[0] inner_typeouter_type = annotation.__origin__ outer_typeexpr_arg = inner_type == Int32Type assert(expr_arg) expr_outer = outer_type == ArrayLike assert(expr_outer) {code} > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498231#comment-17498231 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- Sure, I will give it a try and post what I find out. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498093#comment-17498093 ] Antoine Pitrou commented on ARROW-15765: Someone could experiment with the typing generic approach indeed and see if it works. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497856#comment-17497856 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- [~jorisvandenbossche] the new typing generics look interesting. Is it practical to adopt this now. I am referring to the Python versions we support now. Is it wise to use it in the UDF integration and not do what I am suggesting to do in this jira. [~apitrou] Numba jit approach is nice and it looks like an advance feature for UDFs someday. I will keep this in mind. As [~westonpace] suggested, some of our main motivations are to support the user and try to provide user friendly options when we write TPCx-BB queries and similar applications. If the suggestion from [~jorisvandenbossche] to use advance typing is feasible, is it wise to use that instead of doing this change if it succeeds in solving our underlying problem. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497838#comment-17497838 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- I want to clarify a point, if I have not clearly mentioned the reason for the necessity of the typing information earlier in the thread. If I am not mistaken, here the main issue is not what UDF internally is doing for the data. We just need to register it in the function registry without taking the input and output types from the user explicitly. It is just a nice to have a feature which could look great in terms of presentability and usability with new Python upgrades. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497733#comment-17497733 ] Antoine Pitrou commented on ARROW-15765: > Another dimension to consider is whether a UDF would care if an array were >dictionary encoded or not? We probably want a way to express that too. If you want a UDF to have different implementations based on the parameter types, you can't do that using type annotations. What you could do is use a two-step approach like in Numba's {{generated_jit}}: https://numba.pydata.org/numba-doc/dev/user/generated-jit.html > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497720#comment-17497720 ] Weston Pace commented on ARROW-15765: - For a concrete use case consider a user that wants to integrate some kind of Arrow native geojson library. They would have extension types for geojson data types and custom functions that can do things like normalize coordinates to some kind of different reference or format coordinates in a particular way. In this case the UDFs would be taking in extension arrays for custom data types which I think would have its own typings-based considerations. Another possible example that comes from the TPCx-BB benchmark is doing sentiment analysis on strings (is this user comment a positive comment or a negative comment?) If we had an arrow-native natural language processing library we could hook in an extract_sentiment operation which took in strings and returns ? (maybe doubles?). As far as I know the type information itself is only used for validation and casting purposes. Another dimension to consider is whether a UDF would care if an array were dictionary encoded or not? We probably want a way to express that too. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497696#comment-17497696 ] Joris Van den Bossche commented on ARROW-15765: --- In context of a full query plan, I think it is important to know the output types given the input types, to be able to resolve the types in your full query? I am wondering if we could make use of some of the newer typing features, which would allow to do something like {code:python} def simple_function(arrow_array: pa.Array[pa.int32()]) -> pa.Array[pa.int32()]: return call_function("add", [arrow_array, 1]) {code} I think such an object with which you can use [] is called a "generic" in typing terminology (https://docs.python.org/3.11/library/typing.html#generics), and it would allow to more easily get the type of the values in the container. On the other hand it creates a bit a separate typing syntax ({{pa.Array}} is not actually itself a useful class, it's always subclasses you get in practice). > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497567#comment-17497567 ] Antoine Pitrou commented on ARROW-15765: Of course, another question is: do you need to know the types at all? Without some concrete use cases it's hard to tell. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497566#comment-17497566 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- Should we design this feature or as [~jorisvandenbossche] and [~westonpace] suggested, we can use the inverse option to get the type from the Array type and not exposing this to the user? This issue is at the moment mainly focusing on the UDF usability piece rather than improving a core functionality for Arrow Python API. But it could be useful, but beyond the scope of this usecase it is not very clear to me how useful it is going to be to the user. What do you think? > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497545#comment-17497545 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- [~apitrou] I see your point. There are pitfalls and limitations to this approach. This is mainly a usability piece. I also have a doubt, is it worth investing time on it if the the applications of this becomes niche. But it feels like a nice to have a feature to at least support some widely used UDF function signatures. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497523#comment-17497523 ] Antoine Pitrou commented on ARROW-15765: Note that this approach limits the expressivity of the type annotations. For example, if you write: {code:python} def compute_func(a: pa.ListArray) -> pa.ListArray: ... {code} ... you are not able to tell what the value type of the list type is. Similarly with parametrized types such as timestamps or decimals. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497094#comment-17497094 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- There can be limitations in places when user just want to use a lambda in a groupby where we expose UDFs. That needs to be handled internally for group by ops. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497082#comment-17497082 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- As [~westonpace] explained, we are working on a UDF PoC. At the moment how you register a function can be as follows; {code:java} import pyarrow as pa from pyarrow import compute as pc from pyarrow.compute import call_function, register_pyfunction from pyarrow.compute import Arity, InputType func_doc = {} func_doc["summary"] = "summary" func_doc["description"] = "desc" func_doc["arg_names"] = ["number"] func_doc["options_class"] = "SomeOptions" func_doc["options_required"] = False arity = Arity.unary() func_name = "python_udf" in_types = [InputType.array(pa.x())] out_type = pa.int64() def py_function(arrow_array): p_new_array = call_function("add", [arrow_array, 1]) return p_new_array callback = simple_functionregister_pyfunction(func_name, arity, func_doc, in_types, out_type, callback) func1 = pc.get_function(func_name) a1 = pc.call_function(func_name, [pa.array([20])]){code} When registering the function user has to explicitly mention what is the arity and what are the input and output types of the UDF. We can ease this by taking all the information from the type-hints itself. This is only to improve the usability. Spark is already providing that support. When we go this route, we will extract all the information from the UDF signature. At the moment I am using inspect API to extract those information. Next step is to extract from the type hint info: `pa.Int32Array` that this is a `pa.Array` of type `pa.int32()`. This is the objective of this exercise. [~apitrou] does it clear things out? Do you need more information to know why we need this feature? > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497011#comment-17497011 ] Weston Pace commented on ARROW-15765: - This is indeed about user-defined functions. Vibhatha has been working on an implementation. You can see the current progress here: https://github.com/apache/arrow/compare/master...vibhatha:test-udf-vibhatha I suspect the need has to do with registering a function like: {code} def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: return pc.call_function("add", [array1, array2]) {code} with our function registry (which will want to know the arity and types of each argument). Vibhatha can probably give a more complete answer. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496945#comment-17496945 ] Antoine Pitrou commented on ARROW-15765: Let's step back a bit. Is this about user-defined functions? If so, perhaps someone should actually start working on these and explain what the actual need is? > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496942#comment-17496942 ] Weston Pace commented on ARROW-15765: - In that case extending classes is not going to help. There isn't really anything that makes sense to extend from C++ as the object {{Int32Array}} has no runtime equivalent in C++ (i.e. there is no such thing as reflection in C++). I think Joris' suggestion from zulip is simplest. Let's invert {{_array_classes}} and {{_scalar_classes}} > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496881#comment-17496881 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- [~westonpace] Exactly, a reflection task. Need to extract the types before data get's in here. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496879#comment-17496879 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- Getting the type from the data, that's totally correct [~westonpace] , I agree. The thing is we need the type information extracted from the function signature, not from the data. So can we use this approach? Did I get it wrong? > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496877#comment-17496877 ] Weston Pace commented on ARROW-15765: - Ah, from Zulip I see the goal is to go from Int32Array to DataType(int32) without instantiating an instance. So is this more of a metaprogramming / reflection task? > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496876#comment-17496876 ] Weston Pace commented on ARROW-15765: - Perhaps I am missing some key piece but don't all arrays (Int32Array, Int64Array, etc.) extend Array which already has a type field? {noformat} cdef class Array(_PandasConvertible): cdef: shared_ptr[CArray] sp_array CArray* ap cdef readonly: DataType type {noformat} So couldn't you do `array1.type`? {noformat} >>> x = pa.array([1, 2, 3]) >>> x.type DataType(int64) {noformat} > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496806#comment-17496806 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- Yes the id would be kind of hard to grasp for the user. One intention is actually making expose to the user, so in their development activities this could be helpful for some advanced use case similar to UDFs. I cannot exactly say what are such cases, but it could be useful. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496801#comment-17496801 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- You're right, we can use that. Just tried to show the expected outcome more clearly. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496797#comment-17496797 ] Alessandro Molina commented on ARROW-15765: --- [~vibhatha] I'm not sure why you want `CDataType` itself, `Array.type` already provides the `DataType` which is the Python equivalent of `CDataType` > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496792#comment-17496792 ] Alessandro Molina commented on ARROW-15765: --- Probably exposing `name` in `CDataType` would be a starting point. For nested types like lists, struct etc.. we will have to dive into the sub_fields, but for basic types the `name` might already allow to easily identify the type. (We could already use `id` but I guess that's less immediately understandable) > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496789#comment-17496789 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- I am referring to the classes here: https://github.com/apache/arrow/blob/0363df1b44274707228af7274102bbe50cdb68be/python/pyarrow/lib.pxd#L317 > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496784#comment-17496784 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- Ah Yes you're right. Mixed up with the C++ naming. To expose this one, I guess we have to extend the classes {code:java} cdef class Int32Array(IntegerArray): pass{code} to something like {code:java} cdef class Int32Array(IntegerArray): cdef shared_ptr[CDataType] get_type(){code} And expose it as property to Python? Or is there a better approach for this? > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496760#comment-17496760 ] Antoine Pitrou commented on ARROW-15765: Well, {{pa.Int32Type}} doesn't exist: {code:python} >>> ty = pa.int32() >>> ty.__class__ pyarrow.lib.DataType {code} > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496755#comment-17496755 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- [~apitrou] [~westonpace] [~jorisvandenbossche] [~amol-] Your thoughts on this? > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)