[ 
https://issues.apache.org/jira/browse/DRILL-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-7055:
----------------------------------

    Assignee:     (was: Paul Rogers)

> Revise SELECT * to exclude partitions
> -------------------------------------
>
>                 Key: DRILL-7055
>                 URL: https://issues.apache.org/jira/browse/DRILL-7055
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.15.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Suppose you have a partitioned table:
> {noformat}
> myTable
>   2018
>     file1.csv
>   2019
>     file2.csv
> {noformat}
> For some time, Drill has included the partitions in a {{SELECT *}} query:
> {noformat}
> SELECT * FROM `myTable`;
> col1.1, col1.2, 2018
> col2.1, col2.2, 2019
> {noformat}
> There are two subtle issues with this behavior:
> * The behavior of partitions is not consistent with the other file metadata 
> (implicit) columns.
> * Because directory depth can vary, and scan order is random, a hard schema 
> change can occur if Drill starts scanning shallow files before deep files.
> This ticked proposes to change the partition behavior to be like that of 
> other file metadata columns: they are included only when requested:
> {noformat}
> SELECT * FROM `myTable`;
> col1.1, col1.2
> col2.1, col2.2
> SELECT *, dir0, filename FROM `myTable`;
> col1.1, col1.2, 2018, file1.csv
> col2.1, col2.2, 2019, file2.csv
> {noformat}
> With this change, there is no possibility of a hard schema change: the user 
> predefines the desired partitions.
> Unfortunately, with the existing readers, a reader that does not have given 
> partition will omit that partition column and will instead leave it to the 
> projection operator to fill in the column, which it will do with a Nullable 
> INT. (The new row-set based scan mechanism handles this case correctly.)
> h4. Risks
> Note that this change does change user-visible behavior. If a user has been 
> able to get {{SELECT *}} to work with partition columns, the query will have 
> to change to include partition columns. However, it may be that the risk of 
> such a breaking change is low because:
> * Users are generally discouraged from using {{SELECT *}} in a production 
> query.
> * Use of {{SELECT *}} in a non-uniform partitioning structure would have 
> caused failures due to the hard schema change noted above.
> For this reason, the benefits of the change appear to outweigh the risks.
> h4. Technical Background
> In the last year, Calcite appears to have added the ability to specify a 
> wildcard plus extra columns. When used with implicit columns, we can now say:
> {code:sql}
> SELECT *, filename FROM myTable;
> {code}
> However, while the readers (at least the CSV reader) can handle this case, 
> the {{ProjectRecordBatch}} cannot.
> Modify the {{TestCsv.java}} test case with the following test:
> {code:java}
>   @Test
>   public void testImplicitColWildcard() throws IOException {
>     String sql = "SELECT *, filename FROM `dfs.data`.`%s`";
>     RowSet actual = client.queryBuilder().sql(sql, CASE2_FILE_NAME).rowSet();
>     actual.print();
>     TupleMetadata expectedSchema = new SchemaBuilder()
>         .add("a", MinorType.VARCHAR)
>         .add("b", MinorType.VARCHAR)
>         .add("c", MinorType.VARCHAR)
>         .addNullable("filename", MinorType.VARCHAR)
>         .buildSchema();
>     RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
>         .addRow("10", "foo", "bar", CASE2_FILE_NAME)
>         .build();
>     RowSetUtilities.verify(expected, actual);
>   }
> {code}
> The output of the {{actual.print()}} is:
> {noformat}
> #: a, b, c, filename
> 0: "10", "foo", "bar", "case2.csv"
> {noformat}
> Now, try the same thing, but substitute "dir0" for "filename". We would 
> expect to see something like the above. What we actually see is:
> {noformat}
> #: a, b, c, dir0, dir00
> 0: "10", "foo", "bar", null, null
> {noformat}
> Note that I'm trying this on a "new" CSV reader that fills in "dir0". To see 
> the same thing on the master branch, put the CSV file under a directory and 
> query the directory.
> The problem is traced to 
> [here|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L592]:
> {code:java}
>   private boolean isImplicitFileColumn(ValueVector vvIn) {
>     return 
> ColumnExplorer.initImplicitFileColumns(context.getOptions()).get(vvIn.getField().getName())
>  != null;
>   }
> {code}
> This has two problems:
> 1. It creates a map of implicit column names, but does not handle parsing 
> names like "dir0".
> 2. It creates the map over and over: once per column per schema change. Very 
> inefficient.
> The solution is to modify the code to use the {{isPartitionColumn()}} method 
> in {{ColumnExplorer}}. Plus, create the {{ColumnExplorer}} once per project 
> operator instance and reuse it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to