[ 
https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430875#comment-16430875
 ] 

Paul Rogers commented on DRILL-6312:
------------------------------------

See the e-mail thread which makes the case that the core underlying issue is 
the need for an ability to specify schema. Inferring the schema from the query 
is nice, but neither necessary nor sufficient. Schema is the property of *data* 
not the *query*. Data schema must be specified a priori and available for use 
in queries.

Think of it this way. A user will explore data when they first see it and 
schema free is helpful. Once that expiration is done, the user wants to capture 
the learnings in the form of a schema so that the next 100 queries can make use 
of that information. Said another way, Tableau uses don't want to rediscover 
the same schema over and over; discover happens once, getting work done with 
the schema happens many times thereafter.

If the schema hint mechanism is provided, then it is handy to populate it from 
type information in the query, just as it is handy to populate it from the Hive 
metastore (if available) or some other external system.

The key need is the hint mechanism with an API. It is a second-order feature to 
provide implementations of the API based on schema type inference or data 
sampling (etc.)

> Enable pushing of cast expressions to the scanner for better schema discovery.
> ------------------------------------------------------------------------------
>
>                 Key: DRILL-6312
>                 URL: https://issues.apache.org/jira/browse/DRILL-6312
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Execution - Relational Operators, Query Planning & 
> Optimization
>    Affects Versions: 1.13.0
>            Reporter: Hanumath Rao Maduri
>            Priority: Major
>
> Drill is a schema less engine which tries to infer the schema from disparate 
> sources at the read time. Currently the scanners infer the schema for each 
> batch depending upon the data for that column in the corresponding batch. 
> This solves many uses cases but can error out when the data is too different 
> between batches like int and array[int] etc... (There are other cases as well 
> but just to give one example).
> There is also a mechanism to create a view by type casting the columns to 
> appropriate type. This solves issues in some cases but fails in many other 
> cases. This is due to the fact that cast expression is not being pushed down 
> to the scanner but staying at the project or filter etc operators up the 
> query plan.
> This JIRA is to fix this by propagating the type information embedded in the 
> cast function to the scanners so that scanners can cast the incoming data 
> appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to