[ https://issues.apache.org/jira/browse/DRILL-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122661#comment-16122661 ]
Paul Rogers commented on DRILL-4211: ------------------------------------ As we also discussed, Drill has a gap when handling ambiguous columns. At run time, Drill retains only the column name with none of the additional information used at planning time. That information includes: _storage_plugin_ . _schema_ . _table_ . _column_ Consider the following queries: {code} SELECT `dfs.myDir`.`customers.csv`.col, `dfs.yourDir`.`customers.csv`.col FROM `dfs.myDir`.`customers.csv`, `dfs.yourDir`.`customers.csv` WHERE `dfs.myDir`.`customers.csv`.col = `dfs.yourDir`.`customers.csv`.col SELECT `customers1.csv`.col, `customers2.csv`.col FROM `customers1.csv`, `customers2.csv` WHERE `customers1.csv`.col = `customers2.csv`.col SELECT `dfs.myDir`.`customers.csv`.col, * WHERE col = 10 {code} (This assumes that Drill allows fully-qualfied names in the WHERE clause...) In each case, the columns are unambiguous to a human. In the first, the fully qualified path is unique. In the second, the works space is implied, but the table.column name is unique. In the third, "col" will appear twice, but it is clear that the "col" in the WHERE clause refers to the explicit "col", even if the wildcard expands to "col, foo, bar". At run time, however, Drill uses only the tail column name. In fact, another JIRA asks that dots in column names be treated as part of the name. (This allows a JSON field name of the form "customers.col"...) The most general solution would be to modify Drill's runtime naming system to provide fully qualified names: * The full plugin, workspace, table, column name, represented as a {{SchemaPath}}. * A "hint" in a name that says that it came from a wildcard expansion rather than explicit. * When names are ambiguous (and so we want to add a "0" to the name, prefer to add the suffix to names that come from wildcard expansion. There are probably other rules needed to ensure that all semantically valid references in a query are translated into unique references at run time. Note that, because Drill is schema-on-read, some of the name resolution normally done by the planner (which considers fully qualified names) must be pushed into the execution framework in Drill. > Column aliases not pushed down to JDBC stores in some cases when Drill > expects aliased columns to be returned. > -------------------------------------------------------------------------------------------------------------- > > Key: DRILL-4211 > URL: https://issues.apache.org/jira/browse/DRILL-4211 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators > Affects Versions: 1.3.0, 1.11.0 > Environment: Postgres db storage > Reporter: Robert Hamilton-Smith > Assignee: Timothy Farkas > Labels: newbie > > When making an sql statement that incorporates a join to a table and then a > self join to that table to get a parent value , Drill brings back > inconsistent results. > Here is the sql in postgres with correct output: > {code:sql} > select trx.categoryguid, > cat.categoryname, w1.categoryname as parentcat > from transactions trx > join categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID) > join categories w1 on (cat.categoryparentguid = w1.categoryguid) > where cat.categoryparentguid IS NOT NULL; > {code} > Output: > ||categoryid||categoryname||parentcategory|| > |id1|restaurants|food&Dining| > |id1|restaurants|food&Dining| > |id2|Coffee Shops|food&Dining| > |id2|Coffee Shops|food&Dining| > When run in Drill with correct storage prefix: > {code:sql} > select trx.categoryguid, > cat.categoryname, w1.categoryname as parentcat > from db.schema.transactions trx > join db.schema.categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID) > join db.schema.wpfm_categories w1 on (cat.categoryparentguid = > w1.categoryguid) > where cat.categoryparentguid IS NOT NULL > {code} > Results are: > ||categoryid||categoryname||parentcategory|| > |id1|restaurants|null| > |id1|restaurants|null| > |id2|Coffee Shops|null| > |id2|Coffee Shops|null| > Physical plan is: > {code:sql} > 00-00 Screen : rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) > categoryname, VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = > {110.0 rows, 110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64293 > 00-01 Project(categoryguid=[$0], categoryname=[$1], parentcat=[$2]) : > rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname, > VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = {100.0 rows, > 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64292 > 00-02 Project(categoryguid=[$9], categoryname=[$41], parentcat=[$47]) > : rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname, > VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = {100.0 rows, > 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64291 > 00-03 Jdbc(sql=[SELECT * > FROM "public"."transactions" > INNER JOIN (SELECT * > FROM "public"."categories" > WHERE "categoryparentguid" IS NOT NULL) AS "t" ON > "transactions"."categoryguid" = "t"."categoryguid" > INNER JOIN "public"."categories" AS "categories0" ON "t"."categoryparentguid" > = "categories0"."categoryguid"]) : rowType = RecordType(VARCHAR(255) > transactionguid, VARCHAR(255) relatedtransactionguid, VARCHAR(255) > transactioncode, DECIMAL(1, 0) transactionpending, VARCHAR(50) > transactionrefobjecttype, VARCHAR(255) transactionrefobjectguid, > VARCHAR(1024) transactionrefobjectvalue, TIMESTAMP(6) transactiondate, > VARCHAR(256) transactiondescription, VARCHAR(50) categoryguid, VARCHAR(3) > transactioncurrency, DECIMAL(15, 3) transactionoldbalance, DECIMAL(13, 3) > transactionamount, DECIMAL(15, 3) transactionnewbalance, VARCHAR(512) > transactionnotes, DECIMAL(2, 0) transactioninstrumenttype, VARCHAR(20) > transactioninstrumentsubtype, VARCHAR(20) transactioninstrumentcode, > VARCHAR(50) transactionorigpartyguid, VARCHAR(255) > transactionorigaccountguid, VARCHAR(50) transactionrecpartyguid, VARCHAR(255) > transactionrecaccountguid, VARCHAR(256) transactionstatementdesc, DECIMAL(1, > 0) transactionsplit, DECIMAL(1, 0) transactionduplicated, DECIMAL(1, 0) > transactionrecategorized, TIMESTAMP(6) transactioncreatedat, TIMESTAMP(6) > transactionupdatedat, VARCHAR(50) transactionmatrulerefobjtype, VARCHAR(50) > transactionmatrulerefobjguid, VARCHAR(50) transactionmatrulerefobjvalue, > VARCHAR(50) transactionuserruleguid, DECIMAL(2, 0) transactionsplitorder, > TIMESTAMP(6) transactionprocessedat, TIMESTAMP(6) > transactioncategoryassignat, VARCHAR(50) transactionsystemcategoryguid, > VARCHAR(50) transactionorigmandateid, VARCHAR(100) fingerprint, VARCHAR(50) > categoryguid0, VARCHAR(50) categoryparentguid, DECIMAL(3, 0) categorytype, > VARCHAR(50) categoryname, VARCHAR(50) categorydescription, VARCHAR(50) > partyguid, VARCHAR(50) categoryguid1, VARCHAR(50) categoryparentguid0, > DECIMAL(3, 0) categorytype0, VARCHAR(50) categoryname0, VARCHAR(50) > categorydescription0, VARCHAR(50) partyguid0): rowcount = 100.0, cumulative > cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64259 > {code} > I worked around it by creating a view on postgres but not ideal. Thanks in > advance. > First Drill Jira Bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029)