Re: "Death of Schema-on-Read"

Ted Dunning Sun, 08 Apr 2018 10:58:43 -0700

I have been thinking about this email and I still don't understand some of
the comments.

On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha <amansi...@apache.org> wrote:

> On the subject of CAST pushdown to Scans, there are potential drawbacks
> ...
>
>    - In general, the planner will see a Scan-Project where the Project has
>    CAST functions.  But the Project can have arbitrary expressions,  e.g
>    CAST(a as INT) * 5  or a combination of 2 CAST functions or non-CAST
>    functions etc.   It would be quite expensive to examine each expression
>    (there could be hundreds) to determine whether it is eligible to be
> pushed
>    to the Scan.
>

How is this different than filter and project pushdown? There could be
hundreds of those and it could be difficult for Calcite to find appropriate
pushdowns. But I have never heard of any problem.

The reasons that I think that cast pushdown would be much easier include:

- for a first approximation, no type inference would be needed.

- because of the first point, only the roots of arithmetic expressions
would need to be examined. If they have casts, then pushdown should be
tried. If not, don't do it.

- cast pushdown is always a win if supported so there is no large increase
in the complexity of the cost-based optimization search space.

- the traversal of all expressions is already required and already done in
order to find the set of columns that are being extracted. As such, cast
pushdown can be done in the same motions as project pushdown.

>    - Expressing Nullability is not possible with CAST.   If a column should
>    be tagged as  (not)nullable, CAST syntax does not allow that.
>

This may be true. But nullability crosses the cast cleanly. Thus, filter
expressions like [x is not NULL] can be used to constrain nullability and
there is no requirement that the two constraints (the cast and the
nullability) need not be near each other syntactically. Furthermore, if the
query does not specify nullability, then the scanner is free to do so.

>    - Drill currently supports CASTing to a SQL data type, but not to the
>    complex types such as arrays and maps.  We would have to add support for
>    that from a language perspective as well as the run-time.  This would be
>    non-trivial effort.
>

Well, there is a trivial subset of this effort in that casting a.b.c is
easy to express. Anything more complex is hard for current scanners to use
anyway.

So deferring most of the work on complex types is a fine solution. It isn't
like SQL has nice syntax for casting of anything.

Re: "Death of Schema-on-Read"

Reply via email to