Yaroslav created DRILL-8508: ------------------------------- Summary: Choosing the best suitable major type for a partially missing parquet column Key: DRILL-8508 URL: https://issues.apache.org/jira/browse/DRILL-8508 Project: Apache Drill Issue Type: Improvement Reporter: Yaroslav Attachments: people.tar.gz
{*}NOTE{*}: This issue requires and assumes DRILL-8507 bug to be fixed first. Please do not proceed to this one until that issue would be solved. h3. Prerequisites If a {{ParquetRecordReader}} doesn't find a selected column, it creates a null-filled {{NullableIntVector}} with the column's name and the correct value count set. h3. Problems Hardcoding the minor type (INT) leads to SchemaChangeExceptions and type cast exceptions. Former also happens due to data mode change (REQUIRED -> OPTIONAL). Consider a {{dfs.tmp.people}} table with such parquet files and their schemas: {code:java} /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | age<INT(REQUIRED)> /tmp/people/1.parquet: id<INT(REQUIRED)>{code} The following query against that table would fail because of minor type change (VARCHAR -> INT): {code:java} apache drill> SELECT name FROM dfs.tmp.people ORDER BY name; Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External Sort. Please enable Union type. Previous schema: BatchSchema [fields=[[`name` (VARCHAR:OPTIONAL)]], selectionVector=NONE] Incoming schema: BatchSchema [fields=[[`name` (INT:OPTIONAL)]], selectionVector=NONE] Fragment: 0:0 [Error Id: 97625816-0a07-410e-87b1-1d461fb8f00d on node2.vmcluster.com:31010] (state=,code=0) {code} And the following query would fail because of data mode change (REQUIRED -> OPTIONAL): {code:java} apache drill> SELECT age FROM dfs.tmp.people ORDER BY age; Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External Sort. Please enable Union type. Previous schema: BatchSchema [fields=[[`age` (INT:REQUIRED)]], selectionVector=NONE] Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], selectionVector=NONE] Fragment: 0:0 [Error Id: adce5b82-331c-410d-87f4-c8fc1ba943e6 on node2.vmcluster.com:31010] (state=,code=0) {code} Note that the last query would also fail if we had both parquet files containing the column, but one would have it as REQURIED and other as OPTIONAL, such as here: {code:java} /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | age<INT(REQUIRED)> /tmp/people/1.parquet: id<INT(REQUIRED)> | age<INT(OPTIONAL)> {code} h3. Solution idea Note that all of the cases above have this {_}partially missing column{_}, meaning that some of the parquet files in a queried table have the column and others do not (or have it as OPTIONAL). If none of the files would contain the column ({_}completely missing{_}), we wouldn't have any chance to guess the major type except defaulting to INT:OPTIONAL. But the case with partially missing column is different in that the correct minor type exists in those parquet files who have the column (and the data mode is obviously OPTIONAL since we create a null-filled vector). So, in theory, we could take the minor type from there create a null-filled vector for a missing column with this type. The solution idea suggested here is based on the fact that {_}schemas of all the parquet files to read is available at planning phase in a foreman{_}. Furthermore, it is already passed to each separate minor fragment (and its parquet readers) so the only thing left to do is to take the minor type from there and use it for missing columns. For data mode issue, however, we also need to catch the partially missing column case and enforce all the readers to return it as OPTIONAL (even if this particular reader have it as REQUIRED). *Expected behavior* So this ticket aims to bring 2 new statements in the Drill's behavior for parquet missing columns: # If at least 1 parquet file contains a selected column, then the null-filled vectors should have its minor type # If at least 1 parquet file does not have a selected column, or have it as OPTIONAL, then ALL of the readers should return this column as OPTIONAL -- This message was sent by Atlassian Jira (v8.20.10#820010)