[jira] [Commented] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

ASF GitHub Bot (Jira) Fri, 30 Aug 2024 04:22:04 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878064#comment-17878064
 ]


ASF GitHub Bot commented on DRILL-8507:
---------------------------------------

ychernysh commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2320908961

   Hi @paul-rogers ,
   Thanks for your comments! Let me clarify some points here:
   > Given this, I don't think the problem is with the column name (which is 
what that referenced code change handles).
   
   The thing is that regardless of whether we did or did not succeed with 
"guessing" the major type for the missing column, the solution won't work until 
we solve the backticks problem, because essentially instead of creating _the 
missing column_ (unquoted), we currently create _a brand new column_ (quoted) 
which has nothing to do with the "real" missing one. Please see my `ORDER BY` 
example in [DRILL-8507](https://issues.apache.org/jira/browse/DRILL-8507) and 
the error from there:
   ```
   Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.
   Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
selectionVector=NONE]
   Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` 
(INT:OPTIONAL)]], selectionVector=NONE]
   ```
   Note that all the major types here are guessed correctly (`INT:OPTIONAL`), 
but we fail because of different field counts. This happens because in the 
incoming schema we have 2 fields: quoted and unquoted, while it is supposed to 
have only one (unquoted, based on previous schema).
   This is the reason I reported the issue as 2 separate jiras and divided the 
PR into separate commits.
   So before proceeding to any type guessing logic, we have to get rid of the 
quoting problem first.
   > Also, note that this fix works ONLY in one direction (column appears, then 
disappears), and ONLY within a single thread: it can't solve the same problem 
if the two files are read in different threads and sent to the SORT to 
reconcile.
   
   Thanks to [AbstractParquetRowGroupScan's 
schema](https://github.com/apache/drill/blob/5a3b1785c119a5e3986b6cb269a48ba76db94a3b/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetRowGroupScan.java#L46),
 which this (DRILL-8508) solution is based on, all readers in all fragments on 
all machines are aware about the overall (constructed from all parquet files to 
read) table schema. Foreman scans the footers of all the files and merges a 
single schema from them back in the planning phase and then sends it to each 
fragment. This is exactly what gives us the opportunity to "see the future" at 
reading phase, because all the metadata (even for future files) is already 
available and merged into a single schema. 
   With that, we are not just _propagating the old column type_, but we are 
propagating _the certainly correct_ column type. And we **are sure** the type 
is correct because it was resolved back in the planner.
   So, in fact, the fix works in ALL directions and within ALL threads and 
machines.
   
   > Further, we are changing the mode to OPTIONAL as required so we can fill 
the vector with NULL values. However, change of mode (i.e. nullability) is a 
schema change and will cause the SORT to fail. We have to have known, on the 
previous file, that the column will be missing in this file, so that we can 
create the original column as OPTIONAL.
   
   Exactly, and that is why I added this _enforcing OPTIONAL_ logic. That is, 
even if the particular parquet reader is going to read REQUIRED parquet column, 
we enforce it to put it in NULLABLE value vector to get consistent schema with 
missing column in some file. Note that we are able to so, because _we know at 
the reading phase that the column is partially missing_ thanks to the 
aforementioned schema propagated to all readers.




> Missing parquet columns quoted with backticks conflict with existing ones
> -------------------------------------------------------------------------
>
>                 Key: DRILL-8507
>                 URL: https://issues.apache.org/jira/browse/DRILL-8507
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.21.2
>            Reporter: Yaroslav
>            Priority: Major
>         Attachments: people.tar.gz
>
>
> {*}NOTE{*}: I worked on this issue along with DRILL-8508. It turned out that 
> it required this bug to get fixed first. And since these 2 are about a little 
> bit different things it was decided to report them as separate issues. So, 
> I'm going to link this issue as a requirement for that issue and open one PR 
> for both (if it's allowed..). I think single PR would make it easier to 
> review the code since the issues are quite related anyway.
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
> null-filled {{NullableIntVector}} with the column's name and the correct 
> value count set. The field name for the vector is derived from 
> {{SchemaPath#toExpr}} method, which always enquotes the outcome string with 
> backticks.
> h3. Problems
> This causes some wrong field name equality checks (comparing two strings of 
> field names, non-quoted and quoted, returns false, but essentially is 
> supposed to return true) that lead to some errors.
> For example, the errors occur when you select a column from a table where 
> some parquet files contain it and some do not. Consider a {{dfs.tmp.people}} 
> table with such parquet files and their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
> age<INT(OPTIONAL)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> Now let's try to use an operator that doesn't support schema change. For 
> example, {{{}ORDER BY{}}}:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` 
> (INT:OPTIONAL)]], selectionVector=NONE]
> Fragment: 0:0
> [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> ORDER BY error clearly shows us that ``age`` is an extra column here and the 
> incoming schema should only have the unquoted field to match the previous 
> schema.
> Another example is in {{UNION ALL}} operator:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1));
> Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch 
> have different field counthas!
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> Again, "different field counts" issue is caused by an extra quoted column 
> that counts as different field.
> h3. Solution
> The solution for these errors would be to replace {{SchemaPath#toExpr}} call 
> with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome 
> string. Simply enough, but note that we used to use 
> {{SchemaPath#getAsUnescapedPath}} before DRILL-4264  where we switched to 
> {{{}SchemaPath#toExpr{}}}. The author even put a comment:
> {code:java}
> // col.toExpr() is used here as field name since we don't want to see these 
> fields in the existing maps{code}
> So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision. 
> But, honestly, I don't really understand the motivation for this, even with 
> the comment. I don't really understand what "existing maps" are here. Maybe 
> someone from the community can help here, but I will further consider it as a 
> mistake, simply because it causes the above problems.
> h3. Regressions
> The change brings some regressions detected by unit tests. Those I found fail 
> because they changed their execution flow after the fix in these places:
>  * 
> [FieldIdUtil#getFieldId|https://github.com/apache/drill/blob/7c813ff6440a118de15f552145b40eb07bb8e7a2/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206]
>  * 
> [ScanBatch$Mutator#addField|https://github.com/apache/drill/blob/097a4717ac998ec6bf3c70a99575c7ff53f47430/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L523-L524]
> Before the fix, the field name equality checks returned false from comparing 
> quoted and unquoted field names. The failing tests relied on that behavior 
> and worked only with this condition. But after the fix, when we compare two 
> quoted names and return true, we fall to a different branch and the tests 
> aren't ready for that.
> But obviously the change _is fixing the problem_ so the tests that relied on 
> that problem should now be adjusted.
> Please see more technical details on each failed unit test in the linked PR.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

Reply via email to