jorgecarleitao commented on pull request #7967:
URL: https://github.com/apache/arrow/pull/7967#issuecomment-678739758


   This is currently failing for an interesting reason, and I need some help in 
decision making.
   
   TL;DR options:
   
   1. wait for #8024 and stop using the type coercer, since it breaks schema 
invariance under optimization
   2. short-term copy-paste some of the type coercion on the optimizer to the 
SQL planner
   
   I am more inclined to wait for the typing to be debt free before merging 
this, but I am willing to hack something up here to make have this in, if there 
is any sense of urgency on it.
   
   -------
   
   To recap, the main goal of this PR is to allow scalar UDFs to accept 
multiple input types during planning (still returning the same output type). 
This generally speeds up calculations as smaller types are faster than larger 
types. E.g. numerically, `sqrt(float32) -> float64` is faster than 
`sqrt(CAST(float32 AS float64)) -> float64`.
   
   Due to how the `get_supertype` is implemented and how the sql planner uses 
it, the statement 
   
   ```SELECT sqrt(c11) FROM ...``` 
   
   is (in master) converted to a logical plan already with a coercion: the 
column is named `"sqrt(CAST(c11 AS float64))"` when `c11` is not a float64.
   
   This PR removed the coercion from the sql planner as I was trying to not 
have to copy-paste code the new logic from the type coercer into the SQL 
planner. However, because our type coercer currently breaks schema invariance 
under optimization, this now fails the test of that invariant: the plan's 
schema has a field named `"sqrt(c11)"`, but the type coercer changes it to 
`"sqrt(CAST(c11 AS float64))"`.
   
   Note that this is independent of the changes that this PR proposes - even 
with a single allowed datatype, we still need to perform numerical coercion. 
What changed here was that the type coercion rules for multiple data types are 
a bit more complex, and I was trying to not have to copy-paste code from the 
type coercer into the SQL planner (in other words, there is some coupling 
between the two via the `get_supertype`). IMO the root cause is the type 
coercer optimizer, that breaks schema invariance because it operates on the 
logical plane.
   
   One option is to stop using the type coercer optimizer at the logical level, 
and instead apply cast operations at the physical level, as #8024 moves towards 
that.
   
   Another option is to copy-paste [the type coercion that this PR proposes on 
type coercer] to the SQL planner, thus guaranteeing that the coercion is 
consistent between the planner and the type coercer.
   
   @andygrove and @alamb , what do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to