Yaroslav created DRILL-8508:
-------------------------------

             Summary: Choosing the best suitable major type for a partially 
missing parquet column
                 Key: DRILL-8508
                 URL: https://issues.apache.org/jira/browse/DRILL-8508
             Project: Apache Drill
          Issue Type: Improvement
            Reporter: Yaroslav
         Attachments: people.tar.gz

{*}NOTE{*}: This issue requires and assumes DRILL-8507 bug to be fixed first. 
Please do not proceed to this one until that issue would be solved.
h3. Prerequisites

If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
null-filled {{NullableIntVector}} with the column's name and the correct value 
count set.
h3. Problems

Hardcoding the minor type (INT) leads to SchemaChangeExceptions and type cast 
exceptions. Former also happens due to data mode change (REQUIRED -> OPTIONAL). 
Consider a {{dfs.tmp.people}} table with such parquet files and their schemas:
{code:java}
/tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
age<INT(REQUIRED)>
/tmp/people/1.parquet: id<INT(REQUIRED)>{code}
The following query against that table would fail because of minor type change 
(VARCHAR -> INT):

 

 
{code:java}
apache drill> SELECT name FROM dfs.tmp.people ORDER BY name;
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.
Previous schema: BatchSchema [fields=[[`name` (VARCHAR:OPTIONAL)]], 
selectionVector=NONE]
Incoming schema: BatchSchema [fields=[[`name` (INT:OPTIONAL)]], 
selectionVector=NONE]
Fragment: 0:0
[Error Id: 97625816-0a07-410e-87b1-1d461fb8f00d on node2.vmcluster.com:31010] 
(state=,code=0)
{code}
 

And the following query would fail because of data mode change (REQUIRED -> 
OPTIONAL):

 
{code:java}
apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.
Previous schema: BatchSchema [fields=[[`age` (INT:REQUIRED)]], 
selectionVector=NONE]
Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
selectionVector=NONE]
Fragment: 0:0
[Error Id: adce5b82-331c-410d-87f4-c8fc1ba943e6 on node2.vmcluster.com:31010] 
(state=,code=0)
{code}
Note that the last query would also fail if we had both parquet files 
containing the column, but one would have it as REQURIED and other as OPTIONAL, 
such as here:

 
{code:java}
/tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
age<INT(REQUIRED)>
/tmp/people/1.parquet: id<INT(REQUIRED)> | age<INT(OPTIONAL)>
{code}
h3. Solution idea

Note that all of the cases above have this {_}partially missing column{_}, 
meaning that some of the parquet files in a queried table have the column and 
others do not (or have it as OPTIONAL). If none of the files would contain the 
column ({_}completely missing{_}), we wouldn't have any chance to guess the 
major type except defaulting to INT:OPTIONAL.

But the case with partially missing column is different in that the correct 
minor type exists in those parquet files who have the column (and the data mode 
is obviously OPTIONAL since we create a null-filled vector). So, in theory, we 
could take the minor type from there create a null-filled vector for a missing 
column with this type.

The solution idea suggested here is based on the fact that {_}schemas of all 
the parquet files to read is available at planning phase in a foreman{_}. 
Furthermore, it is already passed to each separate minor fragment (and its 
parquet readers) so the only thing left to do is to take the minor type from 
there and use it for missing columns. For data mode issue, however, we also 
need to catch the partially missing column case and enforce all the readers to 
return it as OPTIONAL (even if this particular reader have it as REQUIRED).

*Expected behavior*

So this ticket aims to bring 2 new statements in the Drill's behavior for 
parquet missing columns:
 # If at least 1 parquet file contains a selected column, then the null-filled 
vectors should have its minor type
 # If at least 1 parquet file does not have a selected column, or have it as 
OPTIONAL, then ALL of the readers should  return this column as OPTIONAL

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to