jorgecarleitao opened a new pull request #8080:
URL: https://github.com/apache/arrow/pull/8080


   @alamb and @andygrove , I was able to split #8032 in two, so that they 
address different problems. This PR is specific to the problem that we have 
been discussing in #7967. It offers a solution that covers the three main cases:
   
   * single return type, such as `sqrt -> f64`
   * finite set of return types, such as `concat` (utf8 and largeUTF8)
   * potentially infinite set of return types, such as `array` (Array of any 
primitive or non-primitive type)
   
   I believe that this implementation is closer to option 1 that @alamb 
enumerated here. It is so because so far I was unable to offer an 
implementation for option 3, because functions such as `array` have an 
arbitrary return type (it can be any valid type, primitive or non-primitive), 
and thus we can't write them as `array_TYPE` as the number of cases is 
potentially large.
   
   ---------------
   
   This PR is exclusive to *built-in functions* of variable return type and it 
does not care about UDFs. It addresses a limitation of our current logical 
planning, that has been thoroughly discussed in #8032 and #7967, that logical 
planning needs to specify a specific return type when planning usage of UDFs 
and built-in functions (details below).
   
   Notation: `return type function`: a function mapping the functions' argument 
types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is 
an example of the signature of a typical one argument string function.
   
   The primary difference between built-ins and UDFs is that built-in's return 
type function is always known (hard-coded), while the return type function of a 
UDF is known by accessing the registry where it is registered on (it is a 
non-static closure).
   
   This PR is required to address an incompatibility of the following 
requirements that I gathered from discussions between @alamb, @andygrove and 
@jorgecarleitao:
   
   1. we want to have typing information during logical planning (see 
[here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk))
   2. we want to have functions that require their return type to depend on 
their input. Examples include `array` (any type to any other type) and 
`concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see 
[here](https://github.com/apache/arrow/pull/7967#issuecomment-682832105))
   3. we would like users to plan built-in functions without accessing the 
registry (see 
[here](https://github.com/apache/arrow/pull/8032#issuecomment-679327189) and 
mailing list)
   4. a UDFs return type function needs to be retrieved from the registry 
(`ExecutionContextState`).
   5. Currently, all our built-in functions are declared as UDFs and registered 
on the registry when the context is initialized.
   
   These points are incompatible because:
   
   * 1. and 2. requires access to built-in function's return type function 
during planning
   * 4. and 5. requires access the registry to know the built-in's return type
   * 3. forbids us from accessing the registry during planning
   
   This PR solves this incompatibility by leveraging the following:
   
   * builtin functions have a well declared return type during planning, since 
they are part of the source code
   * builtin functions do not need to be in our function's registry
   
   The first commit in this PR makes the existing logical node 
`Expr::ScalarFunction` to be exclusive for built-in functions, and moves our 
UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the 
planning of built-in functions to no longer require access the registry.
   
   The second commit in this PR introduces the necessary functionality for 
built-in functions to support all types of complex signatures. Examples of 
usage of this functionality are in the following PRs:
   
   1. add support for math functions that accept f32: 
https://github.com/jorgecarleitao/arrow/pull/4/files
   2. add `concat`, of an arbitrary number of arguments of type utf8: 
https://github.com/jorgecarleitao/arrow/pull/5/files
   3. add `array` function, supporting an arbitrary number of arguments with 
uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to