Yaroslav created DRILL-8508:
-------------------------------
Summary: Choosing the best suitable major type for a partially
missing parquet column
Key: DRILL-8508
URL: https://issues.apache.org/jira/browse/DRILL-8508
Project: Apache Drill
Issue Type: Improvement
Reporter: Yaroslav
Attachments: people.tar.gz
{*}NOTE{*}: This issue requires and assumes DRILL-8507 bug to be fixed first.
Please do not proceed to this one until that issue would be solved.
h3. Prerequisites
If a {{ParquetRecordReader}} doesn't find a selected column, it creates a
null-filled {{NullableIntVector}} with the column's name and the correct value
count set.
h3. Problems
Hardcoding the minor type (INT) leads to SchemaChangeExceptions and type cast
exceptions. Former also happens due to data mode change (REQUIRED -> OPTIONAL).
Consider a {{dfs.tmp.people}} table with such parquet files and their schemas:
{code:java}
/tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> |
age<INT(REQUIRED)>
/tmp/people/1.parquet: id<INT(REQUIRED)>{code}
The following query against that table would fail because of minor type change
(VARCHAR -> INT):
{code:java}
apache drill> SELECT name FROM dfs.tmp.people ORDER BY name;
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External
Sort. Please enable Union type.
Previous schema: BatchSchema [fields=[[`name` (VARCHAR:OPTIONAL)]],
selectionVector=NONE]
Incoming schema: BatchSchema [fields=[[`name` (INT:OPTIONAL)]],
selectionVector=NONE]
Fragment: 0:0
[Error Id: 97625816-0a07-410e-87b1-1d461fb8f00d on node2.vmcluster.com:31010]
(state=,code=0)
{code}
And the following query would fail because of data mode change (REQUIRED ->
OPTIONAL):
{code:java}
apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External
Sort. Please enable Union type.
Previous schema: BatchSchema [fields=[[`age` (INT:REQUIRED)]],
selectionVector=NONE]
Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]],
selectionVector=NONE]
Fragment: 0:0
[Error Id: adce5b82-331c-410d-87f4-c8fc1ba943e6 on node2.vmcluster.com:31010]
(state=,code=0)
{code}
Note that the last query would also fail if we had both parquet files
containing the column, but one would have it as REQURIED and other as OPTIONAL,
such as here:
{code:java}
/tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> |
age<INT(REQUIRED)>
/tmp/people/1.parquet: id<INT(REQUIRED)> | age<INT(OPTIONAL)>
{code}
h3. Solution idea
Note that all of the cases above have this {_}partially missing column{_},
meaning that some of the parquet files in a queried table have the column and
others do not (or have it as OPTIONAL). If none of the files would contain the
column ({_}completely missing{_}), we wouldn't have any chance to guess the
major type except defaulting to INT:OPTIONAL.
But the case with partially missing column is different in that the correct
minor type exists in those parquet files who have the column (and the data mode
is obviously OPTIONAL since we create a null-filled vector). So, in theory, we
could take the minor type from there create a null-filled vector for a missing
column with this type.
The solution idea suggested here is based on the fact that {_}schemas of all
the parquet files to read is available at planning phase in a foreman{_}.
Furthermore, it is already passed to each separate minor fragment (and its
parquet readers) so the only thing left to do is to take the minor type from
there and use it for missing columns. For data mode issue, however, we also
need to catch the partially missing column case and enforce all the readers to
return it as OPTIONAL (even if this particular reader have it as REQUIRED).
*Expected behavior*
So this ticket aims to bring 2 new statements in the Drill's behavior for
parquet missing columns:
# If at least 1 parquet file contains a selected column, then the null-filled
vectors should have its minor type
# If at least 1 parquet file does not have a selected column, or have it as
OPTIONAL, then ALL of the readers should return this column as OPTIONAL
--
This message was sent by Atlassian Jira
(v8.20.10#820010)