jorgecarleitao opened a new pull request #8080: URL: https://github.com/apache/arrow/pull/8080
@alamb and @andygrove , I was able to split #8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in #7967. It offers a solution that covers the three main cases: * single return type, such as `sqrt -> f64` * finite set of return types, such as `concat` (utf8 and largeUTF8) * potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type) I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large. --------------- This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in #8032 and #7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below). Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function. The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure). This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao: 1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk)) 2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](https://github.com/apache/arrow/pull/7967#issuecomment-682832105)) 3. we would like users to plan built-in functions without accessing the registry (see [here](https://github.com/apache/arrow/pull/8032#issuecomment-679327189) and mailing list) 4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`). 5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized. These points are incompatible because: * 1. and 2. requires access to built-in function's return type function during planning * 4. and 5. requires access the registry to know the built-in's return type * 3. forbids us from accessing the registry during planning This PR solves this incompatibility by leveraging the following: * builtin functions have a well declared return type during planning, since they are part of the source code * builtin functions do not need to be in our function's registry The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry. The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs: 1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files 2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files 3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
