[ https://issues.apache.org/jira/browse/DRILL-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anton Gozhiy closed DRILL-7083. ------------------------------- Resolution: Fixed > Wrong data type for explicit partition column beyond file depth > --------------------------------------------------------------- > > Key: DRILL-7083 > URL: https://issues.apache.org/jira/browse/DRILL-7083 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.15.0 > Reporter: Paul Rogers > Priority: Minor > Fix For: 1.17.0 > > > Consider the simple case in DRILL-7082. That ticket talks about implicit > partition columns created by the wildcard. Consider a very similar case: > {code:sql} > SELECT a, b, c, dir0, dir1 FROM `myTable` > {code} > Where {{myTable}} is a directory of CSV files, each with schema {{(a, b, c)}}: > {noformat} > myTable > |- file1.csv > |- nested > |- file2.csv > {noformat} > If the query is run in "stock" Drill, the planner will place both files > within a single scan operator as described in DRILL-7082. The result schema > will be: > {noformat} > (a VARCHAR, b VARCHAR, c VARCHAR, dir0 VARCHAR, dir1 INT) > {noformat} > Notice that last column: why is "dir1" a (nullable) INT? The partition > mechanism only recognizes partitions that actually exist, leaving the Project > operator to fill in (with a Nullable INT) any partitions that don't exist > (any directory levels not actually seen by the scan operator.) > Now, using the same trick as in DRILL-7082, try the query > {code:sql} > SELECT a, b, c, dir0 FROM `myTable` > {code} > Again, the trick causes Drill to read each file in a separate scan operator > (simulating what happens when queries run at scale.) > The scan operator for {{file1.csv}} will see no partitions, so it will omit > "dir0" and the Project operator will helpfully fill in a Nullable INT. The > scan operator for {{file2.csv}} sees one level of partition, so sets {{dir0}} > to {{nested}} as a Nullable VARCHAR. > What does the client see? Two records: one with "dir0" as a Nullable INT, the > other as a Nullable VARCHAR. Client such as JDBC and ODBC see a hard schema > change between the two records. > The two cases described above are really two versions of the same issue. > Clients expect that, if they use the "dir0", "dir1", ... columns, that the > type is always Nullable Varchar so that the schema stays consistent across > batches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)