[ 
https://issues.apache.org/jira/browse/DRILL-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maksym Rymar updated DRILL-8508:
--------------------------------
    Reviewer: Maksym Rymar  (was: Maksym Rymar)

> Choosing the best suitable major type for a partially missing parquet column
> ----------------------------------------------------------------------------
>
>                 Key: DRILL-8508
>                 URL: https://issues.apache.org/jira/browse/DRILL-8508
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Yaroslav
>            Priority: Major
>         Attachments: people.tar.gz
>
>
> {*}NOTE{*}: This issue requires and assumes DRILL-8507 bug to be fixed first. 
> Please do not proceed to this one until that issue would be solved.
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
> null-filled {{NullableIntVector}} with the column's name and the correct 
> value count set.
> h3. Problems
> Hardcoding the minor type (INT) leads to SchemaChangeExceptions and type cast 
> exceptions. Former also happens due to data mode change (REQUIRED -> 
> OPTIONAL). Consider a {{dfs.tmp.people}} table with such parquet files and 
> their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
> age<INT(REQUIRED)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> The following query against that table would fail because of minor type 
> change (VARCHAR -> INT):
> {code:java}
> apache drill> SELECT name FROM dfs.tmp.people ORDER BY name;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`name` (VARCHAR:OPTIONAL)]], 
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`name` (INT:OPTIONAL)]], 
> selectionVector=NONE]
> Fragment: 0:0
> [Error Id: 97625816-0a07-410e-87b1-1d461fb8f00d on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> And the following query would fail because of data mode change (REQUIRED -> 
> OPTIONAL):
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:REQUIRED)]], 
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
> selectionVector=NONE]
> Fragment: 0:0
> [Error Id: adce5b82-331c-410d-87f4-c8fc1ba943e6 on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> Note that the last query would also fail if we had both parquet files 
> containing the column, but one would have it as REQURIED and other as 
> OPTIONAL, such as here:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
> age<INT(REQUIRED)>
> /tmp/people/1.parquet: id<INT(REQUIRED)> | age<INT(OPTIONAL)>
> {code}
> h3. Solution idea
> Note that all of the cases above have this {_}partially missing column{_}, 
> meaning that some of the parquet files in a queried table have the column and 
> others do not (or have it as OPTIONAL). If none of the files would contain 
> the column ({_}completely missing{_}), we wouldn't have any chance to guess 
> the major type except defaulting to INT:OPTIONAL.
> But the case with partially missing column is different in that the correct 
> minor type exists in those parquet files who have the column (and the data 
> mode is obviously OPTIONAL since we create a null-filled vector). So, in 
> theory, we could take the minor type from there and create a null-filled 
> vector for a missing column with this type.
> The solution idea suggested here is based on the fact that {_}schemas of all 
> the parquet files to read is available at planning phase in a foreman{_}. 
> Furthermore, it is already passed to each separate minor fragment (and its 
> parquet readers) so the only thing left to do is to take the minor type from 
> there and use it for missing columns. For data mode issue, however, we also 
> need to catch the partially missing column case and enforce all the readers 
> to return it as OPTIONAL (even if this particular reader have it as REQUIRED).
> *Expected behavior*
> So this ticket aims to bring 2 new statements in the Drill's behavior for 
> parquet missing columns:
>  # If at least 1 parquet file contains a selected column, then the 
> null-filled vectors should have its minor type
>  # If at least 1 parquet file does not have a selected column, or have it as 
> OPTIONAL, then ALL of the readers should  return this column as OPTIONAL
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to