[
https://issues.apache.org/jira/browse/DRILL-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Maksym Rymar updated DRILL-8508:
--------------------------------
Affects Version/s: 1.21.2
> Choosing the best suitable major type for a partially missing parquet column
> ----------------------------------------------------------------------------
>
> Key: DRILL-8508
> URL: https://issues.apache.org/jira/browse/DRILL-8508
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.21.2
> Reporter: Yaroslav
> Priority: Major
> Attachments: people.tar.gz
>
>
> {*}NOTE{*}: This issue requires and assumes DRILL-8507 bug to be fixed first.
> Please do not proceed to this one until that issue would be solved.
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a
> null-filled {{NullableIntVector}} with the column's name and the correct
> value count set.
> h3. Problems
> Hardcoding the minor type (INT) leads to SchemaChangeExceptions and type cast
> exceptions. Former also happens due to data mode change (REQUIRED ->
> OPTIONAL). Consider a {{dfs.tmp.people}} table with such parquet files and
> their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> |
> age<INT(REQUIRED)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> The following query against that table would fail because of minor type
> change (VARCHAR -> INT):
> {code:java}
> apache drill> SELECT name FROM dfs.tmp.people ORDER BY name;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`name` (VARCHAR:OPTIONAL)]],
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`name` (INT:OPTIONAL)]],
> selectionVector=NONE]
> Fragment: 0:0
> [Error Id: 97625816-0a07-410e-87b1-1d461fb8f00d on node2.vmcluster.com:31010]
> (state=,code=0)
> {code}
> And the following query would fail because of data mode change (REQUIRED ->
> OPTIONAL):
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:REQUIRED)]],
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]],
> selectionVector=NONE]
> Fragment: 0:0
> [Error Id: adce5b82-331c-410d-87f4-c8fc1ba943e6 on node2.vmcluster.com:31010]
> (state=,code=0)
> {code}
> Note that the last query would also fail if we had both parquet files
> containing the column, but one would have it as REQURIED and other as
> OPTIONAL, such as here:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> |
> age<INT(REQUIRED)>
> /tmp/people/1.parquet: id<INT(REQUIRED)> | age<INT(OPTIONAL)>
> {code}
> h3. Solution idea
> Note that all of the cases above have this {_}partially missing column{_},
> meaning that some of the parquet files in a queried table have the column and
> others do not (or have it as OPTIONAL). If none of the files would contain
> the column ({_}completely missing{_}), we wouldn't have any chance to guess
> the major type except defaulting to INT:OPTIONAL.
> But the case with partially missing column is different in that the correct
> minor type exists in those parquet files who have the column (and the data
> mode is obviously OPTIONAL since we create a null-filled vector). So, in
> theory, we could take the minor type from there and create a null-filled
> vector for a missing column with this type.
> The solution idea suggested here is based on the fact that {_}schemas of all
> the parquet files to read is available at planning phase in a foreman{_}.
> Furthermore, it is already passed to each separate minor fragment (and its
> parquet readers) so the only thing left to do is to take the minor type from
> there and use it for missing columns. For data mode issue, however, we also
> need to catch the partially missing column case and enforce all the readers
> to return it as OPTIONAL (even if this particular reader have it as REQUIRED).
> *Expected behavior*
> So this ticket aims to bring 2 new statements in the Drill's behavior for
> parquet missing columns:
> # If at least 1 parquet file contains a selected column, then the
> null-filled vectors should have its minor type
> # If at least 1 parquet file does not have a selected column, or have it as
> OPTIONAL, then ALL of the readers should return this column as OPTIONAL
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)