Re: "Death of Schema-on-Read"

Paul Rogers Fri, 06 Apr 2018 19:39:07 -0700

Hi Aman,

As we get into details, I suggested to Hanu that we move the discussion into a 
JIRA ticket.


 >On the subject of CAST pushdown to Scans, there are potential drawbacks

 >  - In general, the planner will see a Scan-Project where the Project has  
CAST functions.  But the Project can have arbitrary expressions,  e.g  CAST(a 
as INT) * 5

Suggestion: push the CAST(a AS INT) down to the scan, do the a * 5 in the 
Project operator.

>  or a combination of 2 CAST functions 

If the user does a two-stage cast, CAST(CAST(a AS INT) AS BIGINT), then one 
simple rule is to push only the innermost cast downwards.

> or non-CAST functions etc.

Just keep it in Project.

 >    It would be quite expensive to examine each expression (there could be 
hundreds) to determine whether it is eligible to be pushed to the Scan.

Just push CAST(<column_ref> AS <type>). Even that would be a huge win. Note, 
for CSV, it might have to be CAST(columns[2] AS INT), since "columns" is 
special for CSV.

>   - Expressing Nullability is not possible with CAST.  If a column should be 
> tagged as  (not)nullable, CAST syntax does not allow that.

Can we just add keywords: CAST(a AS INT NULL), CAST(b AS VARCHAR NOT NULL) ?

 >  - Drill currently supports CASTing to a SQL data type, but not to the 
complex types such as arrays and maps.  We would have to add support for that 
from a language perspective as well as the run-time.  This would be non-trivial 
effort.

The term "complex type" is always confusing. Consider a map. The rules would 
apply recursively to the members of the map. (Problem: today, if I reference a 
map member, Drill pulls it to the top level: SELECT m.a creates a new top-level 
field, it does not select "a" within "m". We need to fix that anyway.  So, 
CAST(m.a AS INT) should imply the type of column "a" within map "m".

For arrays, the problem is more complex. Perhaps more syntax: CAST(a[] AS INT) 
to force array elements to INT. Maybe use CAST(a[][] AS INT) for a repeated 
list (2D array).

Unions don't need a solution as they are their own solution (they can hold 
multiple types.) Same for (non-repeated) lists. 

To resolve runs of nulls, maybe allow CAST(m AS MAP). Or we can imply that "m" 
is a Map from the expression CAST(m.a AS INT). For arrays, the previously 
suggested CAST(a[] AS INT). If columns "a" or "m" turn out to be a non-null 
scalar, then we have no good answer.

CAST cannot solve the nasty cases of JSON in which some fields are complex, 
some scalar. E.g. {a: 10} {a: [20]} or {m: "foo"} {m: {value: "foo"}}. I 
suppose no solution is perfect...

I'm sure that, if someone gets a chance to desig this feature, they'll find 
lots more issues. Maybe cast push-down is only a partial solution. But, it 
seems to solve so many of the JSON and CSV cases that I've seen that it seems 
too good to pass up.

Thanks,


- Paul

Re: "Death of Schema-on-Read"

Reply via email to