[jira] [Commented] (DRILL-4211) Column aliases not pushed down to JDBC stores in some cases when Drill expects aliased columns to be returned.

Paul Rogers (JIRA) Thu, 10 Aug 2017 18:42:09 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122661#comment-16122661
 ]


Paul Rogers commented on DRILL-4211:
------------------------------------

As we also discussed, Drill has a gap when handling ambiguous columns. At run 
time, Drill retains only the column name with none of the additional 
information used at planning time. That information includes:

_storage_plugin_ . _schema_ . _table_ . _column_

Consider the following queries:

{code}
SELECT `dfs.myDir`.`customers.csv`.col, `dfs.yourDir`.`customers.csv`.col
FROM `dfs.myDir`.`customers.csv`, `dfs.yourDir`.`customers.csv`
WHERE `dfs.myDir`.`customers.csv`.col = `dfs.yourDir`.`customers.csv`.col

SELECT `customers1.csv`.col, `customers2.csv`.col
FROM `customers1.csv`, `customers2.csv`
WHERE `customers1.csv`.col = `customers2.csv`.col

SELECT `dfs.myDir`.`customers.csv`.col, *
WHERE col = 10
{code}

(This assumes that Drill allows fully-qualfied names in the WHERE clause...)

In each case, the columns are unambiguous to a human. In the first, the fully 
qualified path is unique. In the second, the works space is implied, but the 
table.column name is unique. In the third, "col" will appear twice, but it is 
clear that the "col" in the WHERE clause refers to the explicit "col", even if 
the wildcard expands to "col, foo, bar".

At run time, however, Drill uses only the tail column name. In fact, another 
JIRA asks that dots in column names be treated as part of the name. (This 
allows a JSON field name of the form "customers.col"...)

The most general solution would be to modify Drill's runtime naming system to 
provide fully qualified names:

* The full plugin, workspace, table, column name, represented as a 
{{SchemaPath}}.
* A "hint" in a name that says that it came from a wildcard expansion rather 
than explicit.
* When names are ambiguous (and so we want to add a "0" to the name, prefer to 
add the suffix to names that come from wildcard expansion.

There are probably other rules needed to ensure that all semantically valid 
references in a query are translated into unique references at run time.

Note that, because Drill is schema-on-read, some of the name resolution 
normally done by the planner (which considers fully qualified names) must be 
pushed into the execution framework in Drill.

> Column aliases not pushed down to JDBC stores in some cases when Drill 
> expects aliased columns to be returned.
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-4211
>                 URL: https://issues.apache.org/jira/browse/DRILL-4211
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.3.0, 1.11.0
>         Environment: Postgres db storage
>            Reporter: Robert Hamilton-Smith
>            Assignee: Timothy Farkas
>              Labels: newbie
>
> When making an sql statement that incorporates a join to a table and then a 
> self join to that table to get a parent value , Drill brings back 
> inconsistent results. 
> Here is the sql in postgres with correct output:
> {code:sql}
> select trx.categoryguid,
> cat.categoryname, w1.categoryname as parentcat
> from transactions trx
> join categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID)
> join categories w1 on (cat.categoryparentguid = w1.categoryguid)
> where cat.categoryparentguid IS NOT NULL;
> {code}
> Output:
> ||categoryid||categoryname||parentcategory||
> |id1|restaurants|food&Dining|
> |id1|restaurants|food&Dining|
> |id2|Coffee Shops|food&Dining|
> |id2|Coffee Shops|food&Dining|
> When run in Drill with correct storage prefix:
> {code:sql}
> select trx.categoryguid,
> cat.categoryname, w1.categoryname as parentcat
> from db.schema.transactions trx
> join db.schema.categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID)
> join db.schema.wpfm_categories w1 on (cat.categoryparentguid = 
> w1.categoryguid)
> where cat.categoryparentguid IS NOT NULL
> {code}
> Results are:
> ||categoryid||categoryname||parentcategory||
> |id1|restaurants|null|
> |id1|restaurants|null|
> |id2|Coffee Shops|null|
> |id2|Coffee Shops|null|
> Physical plan is:
> {code:sql}
> 00-00    Screen : rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) 
> categoryname, VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = 
> {110.0 rows, 110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64293
> 00-01      Project(categoryguid=[$0], categoryname=[$1], parentcat=[$2]) : 
> rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname, 
> VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = {100.0 rows, 
> 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64292
> 00-02        Project(categoryguid=[$9], categoryname=[$41], parentcat=[$47]) 
> : rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname, 
> VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = {100.0 rows, 
> 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64291
> 00-03          Jdbc(sql=[SELECT *
> FROM "public"."transactions"
> INNER JOIN (SELECT *
> FROM "public"."categories"
> WHERE "categoryparentguid" IS NOT NULL) AS "t" ON 
> "transactions"."categoryguid" = "t"."categoryguid"
> INNER JOIN "public"."categories" AS "categories0" ON "t"."categoryparentguid" 
> = "categories0"."categoryguid"]) : rowType = RecordType(VARCHAR(255) 
> transactionguid, VARCHAR(255) relatedtransactionguid, VARCHAR(255) 
> transactioncode, DECIMAL(1, 0) transactionpending, VARCHAR(50) 
> transactionrefobjecttype, VARCHAR(255) transactionrefobjectguid, 
> VARCHAR(1024) transactionrefobjectvalue, TIMESTAMP(6) transactiondate, 
> VARCHAR(256) transactiondescription, VARCHAR(50) categoryguid, VARCHAR(3) 
> transactioncurrency, DECIMAL(15, 3) transactionoldbalance, DECIMAL(13, 3) 
> transactionamount, DECIMAL(15, 3) transactionnewbalance, VARCHAR(512) 
> transactionnotes, DECIMAL(2, 0) transactioninstrumenttype, VARCHAR(20) 
> transactioninstrumentsubtype, VARCHAR(20) transactioninstrumentcode, 
> VARCHAR(50) transactionorigpartyguid, VARCHAR(255) 
> transactionorigaccountguid, VARCHAR(50) transactionrecpartyguid, VARCHAR(255) 
> transactionrecaccountguid, VARCHAR(256) transactionstatementdesc, DECIMAL(1, 
> 0) transactionsplit, DECIMAL(1, 0) transactionduplicated, DECIMAL(1, 0) 
> transactionrecategorized, TIMESTAMP(6) transactioncreatedat, TIMESTAMP(6) 
> transactionupdatedat, VARCHAR(50) transactionmatrulerefobjtype, VARCHAR(50) 
> transactionmatrulerefobjguid, VARCHAR(50) transactionmatrulerefobjvalue, 
> VARCHAR(50) transactionuserruleguid, DECIMAL(2, 0) transactionsplitorder, 
> TIMESTAMP(6) transactionprocessedat, TIMESTAMP(6) 
> transactioncategoryassignat, VARCHAR(50) transactionsystemcategoryguid, 
> VARCHAR(50) transactionorigmandateid, VARCHAR(100) fingerprint, VARCHAR(50) 
> categoryguid0, VARCHAR(50) categoryparentguid, DECIMAL(3, 0) categorytype, 
> VARCHAR(50) categoryname, VARCHAR(50) categorydescription, VARCHAR(50) 
> partyguid, VARCHAR(50) categoryguid1, VARCHAR(50) categoryparentguid0, 
> DECIMAL(3, 0) categorytype0, VARCHAR(50) categoryname0, VARCHAR(50) 
> categorydescription0, VARCHAR(50) partyguid0): rowcount = 100.0, cumulative 
> cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64259
> {code}
> I worked around it by creating a view on postgres but not ideal. Thanks in 
> advance.
> First Drill Jira Bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-4211) Column aliases not pushed down to JDBC stores in some cases when Drill expects aliased columns to be returned.

Reply via email to