[jira] [Updated] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

Yaroslav (Jira) Wed, 28 Aug 2024 08:08:14 -0700


     [ 
https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yaroslav updated DRILL-8507:
----------------------------
    Description: 
h3. Prerequisites

If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
null-filled {{NullableIntVector}} with the column's name and the correct value 
count set. The field name for the vector is derived from {{SchemaPath#toExpr}} 
method, which always enquotes the outcome string with backticks.
h3. Problems

This causes some wrong field name equality checks (comparing two strings of 
field names, non-quoted and quoted, returns false, but essentially is supposed 
to return true) that lead to some errors.

For example, the errors occur when you select a column from a table where some 
parquet files contain it and some do not. Consider a {{dfs.tmp.people}} table 
with such parquet files and their schemas:
{code:java}
/tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
age<INT(OPTIONAL)>
/tmp/people/1.parquet: id<INT(REQUIRED)>{code}
Now let's try to use an operator that doesn't support schema change. For 
example, {{{}ORDER BY{}}}:
{code:java}
apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.
Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
selectionVector=NONE]
Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` 
(INT:OPTIONAL)]], selectionVector=NONE]
Fragment: 0:0
[Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010] 
(state=,code=0)

{code}
{{ORDER BY }}error clearly shows us that{{ ``age``}} is an extra column here 
and the incoming schema should only have the unquoted field to match the 
previous schema.

Another example is in {{UNION ALL}} operator:
{code:java}
apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1));
Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch 
have different field counthas!
Fragment: 0:0
Please, refer to logs for more information.
[Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010] 
(state=,code=0)

{code}
Again, "different field counts" issue is caused by an extra quoted column that 
counts as different field.
h3. Solution

The solution for these errors would be to replace {{SchemaPath#toExpr}} call 
with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome 
string. Simply enough, but note that we used to use 
{{SchemaPath#getAsUnescapedPath}} before DRILL-4264  where we switched to 
{{{}SchemaPath#toExpr{}}}. The author even put a comment:
{code:java}
// col.toExpr() is used here as field name since we don't want to see these 
fields in the existing maps{code}
So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision. But, 
honestly, I don't really understand the motivation for this, even with the 
comment. I don't really understand what "existing maps" are here. Maybe someone 
from the community can help here, but I will further consider it as a mistake, 
simply because it causes the above problems.
h3. Regressions

The change brings some regressions detected by unit tests. Those I found fail 
because they changed their execution flow after the fix in these places:
 * FieldIdUtil#getFieldId
 * ScanBatch$Mutator#addField

Before the fix, the field name equality checks returned false from comparing 
quoted and unquoted field names. The failing tests relied on that behavior and 
worked only with this condition. But after the fix, when we we compare two 
quoted names and return true, we fall to a different branch and the tests 
aren't ready for that.

But obviously the change _is fixing the problem_ so the tests that relied on 
that problem should now be adjusted.

Please see more technical details on each failed unit test in the linked PR.

 
 
 
 

 

  was:
h3. Prerequisites

If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
null-filled {{NullableIntVector}} with the column's name and the correct value 
count set. The field name for the vector is derived from {{SchemaPath#toExpr}} 
method, which always enquotes the outcome string with backticks.
h3. Problems

This causes some wrong field name equality checks (comparing two strings of 
field names, non-quoted and quoted, returns false, but essentially is supposed 
to return true) that lead to some errors.

For example, the errors occur when you select a column from a table where some 
parquet files contain it and some do not. Consider a {{dfs.tmp.people}} table 
with such parquet files and their schemas:
{code:java}
/tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
age<INT(OPTIONAL)>
/tmp/people/1.parquet: id<INT(REQUIRED)>{code}
Now let's try to use an operator that doesn't support schema change. For 
example, {{{}ORDER BY{}}}:
{code:java}
apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.
Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
selectionVector=NONE]
Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` 
(INT:OPTIONAL)]], selectionVector=NONE]
Fragment: 0:0
[Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010] 
(state=,code=0)

{code}
{{ORDER BY error clearly shows us that ``age``}} is an extra column here and 
the incoming schema should only have the unquoted field to match the previous 
schema.

Another example is in {{UNION ALL}} operator:
{code:java}
apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1));
Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch 
have different field counthas!
Fragment: 0:0
Please, refer to logs for more information.
[Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010] 
(state=,code=0)

{code}
Again, "different field counts" issue is caused by an extra quoted column that 
counts as different field.
h3. Solution

The solution for these errors would be to replace {{SchemaPath#toExpr}} call 
with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome 
string. Simply enough, but note that we used to use 
{{SchemaPath#getAsUnescapedPath}} before DRILL-4264  where we switched to 
{{{}SchemaPath#toExpr{}}}. The author even put a comment:
{code:java}
// col.toExpr() is used here as field name since we don't want to see these 
fields in the existing maps{code}
So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision. But, 
honestly, I don't really understand the motivation for this, even with the 
comment. I don't really understand what "existing maps" are here. Maybe someone 
from the community can help here, but I will further consider it as a mistake, 
simply because it causes the above problems.
h3. Regressions

The change brings some regressions detected by unit tests. Those I found fail 
because they changed their execution flow after the fix in these places:
 * FieldIdUtil#getFieldId
 * ScanBatch$Mutator#addField

Before the fix, the field name equality checks returned false from comparing 
quoted and unquoted field names. The failing tests relied on that behavior and 
worked only with this condition. But after the fix, when we we compare two 
quoted names and return true, we fall to a different branch and the tests 
aren't ready for that.

But obviously the change _is fixing the problem_ so the tests that relied on 
that problem should now be adjusted.

Please see more technical details on each failed unit test in the linked PR.

 
 
 
 

 


> Missing parquet columns quoted with backticks conflict with existing ones
> -------------------------------------------------------------------------
>
>                 Key: DRILL-8507
>                 URL: https://issues.apache.org/jira/browse/DRILL-8507
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Yaroslav
>            Priority: Major
>         Attachments: people.tar.gz
>
>
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
> null-filled {{NullableIntVector}} with the column's name and the correct 
> value count set. The field name for the vector is derived from 
> {{SchemaPath#toExpr}} method, which always enquotes the outcome string with 
> backticks.
> h3. Problems
> This causes some wrong field name equality checks (comparing two strings of 
> field names, non-quoted and quoted, returns false, but essentially is 
> supposed to return true) that lead to some errors.
> For example, the errors occur when you select a column from a table where 
> some parquet files contain it and some do not. Consider a {{dfs.tmp.people}} 
> table with such parquet files and their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
> age<INT(OPTIONAL)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> Now let's try to use an operator that doesn't support schema change. For 
> example, {{{}ORDER BY{}}}:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` 
> (INT:OPTIONAL)]], selectionVector=NONE]
> Fragment: 0:0
> [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> {{ORDER BY }}error clearly shows us that{{ ``age``}} is an extra column here 
> and the incoming schema should only have the unquoted field to match the 
> previous schema.
> Another example is in {{UNION ALL}} operator:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1));
> Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch 
> have different field counthas!
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> Again, "different field counts" issue is caused by an extra quoted column 
> that counts as different field.
> h3. Solution
> The solution for these errors would be to replace {{SchemaPath#toExpr}} call 
> with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome 
> string. Simply enough, but note that we used to use 
> {{SchemaPath#getAsUnescapedPath}} before DRILL-4264  where we switched to 
> {{{}SchemaPath#toExpr{}}}. The author even put a comment:
> {code:java}
> // col.toExpr() is used here as field name since we don't want to see these 
> fields in the existing maps{code}
> So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision. 
> But, honestly, I don't really understand the motivation for this, even with 
> the comment. I don't really understand what "existing maps" are here. Maybe 
> someone from the community can help here, but I will further consider it as a 
> mistake, simply because it causes the above problems.
> h3. Regressions
> The change brings some regressions detected by unit tests. Those I found fail 
> because they changed their execution flow after the fix in these places:
>  * FieldIdUtil#getFieldId
>  * ScanBatch$Mutator#addField
> Before the fix, the field name equality checks returned false from comparing 
> quoted and unquoted field names. The failing tests relied on that behavior 
> and worked only with this condition. But after the fix, when we we compare 
> two quoted names and return true, we fall to a different branch and the tests 
> aren't ready for that.
> But obviously the change _is fixing the problem_ so the tests that relied on 
> that problem should now be adjusted.
> Please see more technical details on each failed unit test in the linked PR.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

Reply via email to