[jira] [Commented] (DRILL-6383) View column types, modes are plan-time guesses, not actual types

Paul Rogers (JIRA) Wed, 02 May 2018 22:19:05 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461943#comment-16461943
 ]


Paul Rogers commented on DRILL-6383:
------------------------------------

All the above would seem to be academic, but it affects user-visible behavior.

 
{noformat}
DESCRIBE myView5;
+--------------+------------+--------------+
| COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
+--------------+------------+--------------+
| custId       | INTEGER    | YES          |
+--------------+------------+--------------+

SELECT modeOf(custId) FROM myView5 LIMIT 1;
+-----------+
|  EXPR$0   |
+-----------+
| NOT NULL  |
+-----------+
{noformat}

And:

{noformat}
DESCRIBE myView4;
+--------------+------------+--------------+
| COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
+--------------+------------+--------------+
| columns      | ANY        | YES          |
+--------------+------------+--------------+

SELECT typeOf(columns), modeOf(columns) FROM myView4 LIMIT 1;
+----------+---------+
|  EXPR$0  | EXPR$1  |
+----------+---------+
| VARCHAR  | ARRAY   |
+----------+---------+
{noformat}

Since the types are wrong, the `DESCRIBE` is useless for an application to 
understand the types of data types it will need when running the query.

> View column types, modes are plan-time guesses, not actual types
> ----------------------------------------------------------------
>
>                 Key: DRILL-6383
>                 URL: https://issues.apache.org/jira/browse/DRILL-6383
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Create a view views and look at the list of columns within the view. You'll 
> see that they are often wrong in name, type and mode.
> Consider a very simple CSV file with headers:
> {noformat}
> custId,name,balance,status
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> 128,Barney,1.23,PAST DUE,30
> {noformat}
> Define the simplest possible view:
> {noformat}
> CREATE VIEW myView2 AS SELECT * FROM `csvh/cust.csvh`;
> {noformat}
> Then look at the view file:
> {noformat}
> {
>   "name" : "myView2",
>   "sql" : "SELECT *\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "**",
>     "type" : "DYNAMIC_STAR",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> It is clear that the view simply captured the plan-time list of the new 
> double-star for the wildcard. Since this is not a true type, it should not 
> have an `isNullable` attribute.
> OK, we have to spell out the columns:
> {noformat}
> CREATE VIEW myView3 AS SELECT custId  FROM `csvh/cust.csvh`;
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView3",
>   "sql" : "SELECT `custId`\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "custId",
>     "type" : "ANY",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> The name is correct. The type is `ANY`, which is wrong. Since this is a CSV 
> file, the column type is `VARCHAR`. Further, because this is a CSV file which 
> headers, the mode is REQUIRED, but is listed as nullable. To verify:
> {noformat}
> SELECT sqlTypeOf(custId), modeOf(custId) FROM myView3 LIMIT 1;
> +--------------------+-----------+
> |       EXPR$0       |  EXPR$1   |
> +--------------------+-----------+
> | CHARACTER VARYING  | NOT NULL  |
> +--------------------+-----------+
> {noformat}
> Now, let's try a CSV file without headers:
> {noformat}
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> {noformat}
> {noformat}
> CREATE VIEW myView4 AS SELECT columns FROM `csv/cust.csv`;
> SELECT * FROM myView4;
> +--------------------------------+
> |            columns             |
> +--------------------------------+
> | ["123","Fred","456.78"]        |
> | ["125","Betty","98.76","VIP"]  |
> +--------------------------------+
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView4",
>   "sql" : "SELECT `columns`\nFROM `csv/cust.csv`",
>   "fields" : [ {
>     "name" : "columns",
>     "type" : "ANY",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> This is almost non-sensical. `columns` is reported as type `ANY` and 
> nullable. But, `columns` is Repeated `VARCHAR` and repeated types cannot be 
> nullable.
> The conclusion is that the type information is virtually worthless and the 
> `isNullable` information is worse than worthless: it is plain wrong.
> The type information is valid only if the planner can inver types:
> {noformat}
> CREATE VIEW myView5 AS
>   SELECT CAST(custId AS INTEGER) AS custId FROM `csvh/cust.csvh`;
> {noformat}
> The view file:
> {noformat}
> {
>   "name" : "myView5",
>   "sql" : "SELECT CAST(`custId` AS INTEGER) AS `custId`\nFROM 
> `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "custId",
>     "type" : "INTEGER",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> Note that the `type` is inferred from the cast, but `isNullable` is wrong 
> because the underlying column is non-nullable:
> {noformat}
> SELECT modeOf(custId) FROM myView5 LIMIT 1;
> +-----------+
> |  EXPR$0   |
> +-----------+
> | NOT NULL  |
> +-----------+
> {noformat}
> Expected that Drill would run the underlying query as a `LIMIT 0` query to 
> extract the actual column types, and use that in the view.
> Or, expected that Drill would simply omit the column list from the view if 
> the data is meaningless.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6383) View column types, modes are plan-time guesses, not actual types

Reply via email to