[ https://issues.apache.org/jira/browse/DRILL-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers reassigned DRILL-7055: ---------------------------------- Assignee: (was: Paul Rogers) > Revise SELECT * to exclude partitions > ------------------------------------- > > Key: DRILL-7055 > URL: https://issues.apache.org/jira/browse/DRILL-7055 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.15.0 > Reporter: Paul Rogers > Priority: Major > > Suppose you have a partitioned table: > {noformat} > myTable > 2018 > file1.csv > 2019 > file2.csv > {noformat} > For some time, Drill has included the partitions in a {{SELECT *}} query: > {noformat} > SELECT * FROM `myTable`; > col1.1, col1.2, 2018 > col2.1, col2.2, 2019 > {noformat} > There are two subtle issues with this behavior: > * The behavior of partitions is not consistent with the other file metadata > (implicit) columns. > * Because directory depth can vary, and scan order is random, a hard schema > change can occur if Drill starts scanning shallow files before deep files. > This ticked proposes to change the partition behavior to be like that of > other file metadata columns: they are included only when requested: > {noformat} > SELECT * FROM `myTable`; > col1.1, col1.2 > col2.1, col2.2 > SELECT *, dir0, filename FROM `myTable`; > col1.1, col1.2, 2018, file1.csv > col2.1, col2.2, 2019, file2.csv > {noformat} > With this change, there is no possibility of a hard schema change: the user > predefines the desired partitions. > Unfortunately, with the existing readers, a reader that does not have given > partition will omit that partition column and will instead leave it to the > projection operator to fill in the column, which it will do with a Nullable > INT. (The new row-set based scan mechanism handles this case correctly.) > h4. Risks > Note that this change does change user-visible behavior. If a user has been > able to get {{SELECT *}} to work with partition columns, the query will have > to change to include partition columns. However, it may be that the risk of > such a breaking change is low because: > * Users are generally discouraged from using {{SELECT *}} in a production > query. > * Use of {{SELECT *}} in a non-uniform partitioning structure would have > caused failures due to the hard schema change noted above. > For this reason, the benefits of the change appear to outweigh the risks. > h4. Technical Background > In the last year, Calcite appears to have added the ability to specify a > wildcard plus extra columns. When used with implicit columns, we can now say: > {code:sql} > SELECT *, filename FROM myTable; > {code} > However, while the readers (at least the CSV reader) can handle this case, > the {{ProjectRecordBatch}} cannot. > Modify the {{TestCsv.java}} test case with the following test: > {code:java} > @Test > public void testImplicitColWildcard() throws IOException { > String sql = "SELECT *, filename FROM `dfs.data`.`%s`"; > RowSet actual = client.queryBuilder().sql(sql, CASE2_FILE_NAME).rowSet(); > actual.print(); > TupleMetadata expectedSchema = new SchemaBuilder() > .add("a", MinorType.VARCHAR) > .add("b", MinorType.VARCHAR) > .add("c", MinorType.VARCHAR) > .addNullable("filename", MinorType.VARCHAR) > .buildSchema(); > RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema) > .addRow("10", "foo", "bar", CASE2_FILE_NAME) > .build(); > RowSetUtilities.verify(expected, actual); > } > {code} > The output of the {{actual.print()}} is: > {noformat} > #: a, b, c, filename > 0: "10", "foo", "bar", "case2.csv" > {noformat} > Now, try the same thing, but substitute "dir0" for "filename". We would > expect to see something like the above. What we actually see is: > {noformat} > #: a, b, c, dir0, dir00 > 0: "10", "foo", "bar", null, null > {noformat} > Note that I'm trying this on a "new" CSV reader that fills in "dir0". To see > the same thing on the master branch, put the CSV file under a directory and > query the directory. > The problem is traced to > [here|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L592]: > {code:java} > private boolean isImplicitFileColumn(ValueVector vvIn) { > return > ColumnExplorer.initImplicitFileColumns(context.getOptions()).get(vvIn.getField().getName()) > != null; > } > {code} > This has two problems: > 1. It creates a map of implicit column names, but does not handle parsing > names like "dir0". > 2. It creates the map over and over: once per column per schema change. Very > inefficient. > The solution is to modify the code to use the {{isPartitionColumn()}} method > in {{ColumnExplorer}}. Plus, create the {{ColumnExplorer}} once per project > operator instance and reuse it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)