[
https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879726#comment-17879726
]
ASF GitHub Bot commented on DRILL-8507:
---------------------------------------
paul-rogers commented on code in PR #2937:
URL: https://github.com/apache/drill/pull/2937#discussion_r1746149745
##########
exec/java-exec/src/test/java/org/apache/drill/TestParquetPartiallyMissingColumns.java:
##########
@@ -0,0 +1,128 @@
+package org.apache.drill;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.record.BatchSchema;
+import org.apache.drill.exec.record.BatchSchemaBuilder;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.test.ClusterFixture;
+import org.apache.drill.test.ClusterTest;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.nio.file.Paths;
+
+/**
+ * Covers querying a table in which some parquet files do contain selected
columns, and
+ * others do not (or have them as OPTIONALs).
+ *
+ * Expected behavior for the missing columns is following:
+ * 1) If at least 1 parquet file to be read has the column, take the minor
type from there.
+ * Otherwise, default to INT.
+ * 2) If at least 1 parquet file to be read doesn't have the column, or has it
as OPTIONAL,
+ * enforce the overall scan output schema to have it as OPTIONAL
+ *
+ * We need to control ordering of scanning batches to cover different
erroneous cases, and we assume
+ * that parquet files in a table would be read in alphabetic order (not a real
use case though). So
+ * we name our files 0.parquet and 1.parquet expecting that they would be
scanned in that order
+ * (not guaranteed though, but seems to work). We use such tables for such
scenarios:
+ *
+ * - parquet/partially_missing/o_m
> Missing parquet columns quoted with backticks conflict with existing ones
> -------------------------------------------------------------------------
>
> Key: DRILL-8507
> URL: https://issues.apache.org/jira/browse/DRILL-8507
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.21.2
> Reporter: Yaroslav
> Priority: Major
> Attachments: people.tar.gz
>
>
> {*}NOTE{*}: I worked on this issue along with DRILL-8508. It turned out that
> it required this bug to get fixed first. And since these 2 are about a little
> bit different things it was decided to report them as separate issues. So,
> I'm going to link this issue as a requirement for that issue and open one PR
> for both (if it's allowed..). I think single PR would make it easier to
> review the code since the issues are quite related anyway.
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a
> null-filled {{NullableIntVector}} with the column's name and the correct
> value count set. The field name for the vector is derived from
> {{SchemaPath#toExpr}} method, which always enquotes the outcome string with
> backticks.
> h3. Problems
> This causes some wrong field name equality checks (comparing two strings of
> field names, non-quoted and quoted, returns false, but essentially is
> supposed to return true) that lead to some errors.
> For example, the errors occur when you select a column from a table where
> some parquet files contain it and some do not. Consider a {{dfs.tmp.people}}
> table with such parquet files and their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> |
> age<INT(OPTIONAL)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> Now let's try to use an operator that doesn't support schema change. For
> example, {{{}ORDER BY{}}}:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]],
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age``
> (INT:OPTIONAL)]], selectionVector=NONE]
> Fragment: 0:0
> [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010]
> (state=,code=0)
> {code}
> ORDER BY error clearly shows us that ``age`` is an extra column here and the
> incoming schema should only have the unquoted field to match the previous
> schema.
> Another example is in {{UNION ALL}} operator:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1));
> Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch
> have different field counthas!
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010]
> (state=,code=0)
> {code}
> Again, "different field counts" issue is caused by an extra quoted column
> that counts as different field.
> h3. Solution
> The solution for these errors would be to replace {{SchemaPath#toExpr}} call
> with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome
> string. Simply enough, but note that we used to use
> {{SchemaPath#getAsUnescapedPath}} before DRILL-4264 where we switched to
> {{{}SchemaPath#toExpr{}}}. The author even put a comment:
> {code:java}
> // col.toExpr() is used here as field name since we don't want to see these
> fields in the existing maps{code}
> So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision.
> But, honestly, I don't really understand the motivation for this, even with
> the comment. I don't really understand what "existing maps" are here. Maybe
> someone from the community can help here, but I will further consider it as a
> mistake, simply because it causes the above problems.
> h3. Regressions
> The change brings some regressions detected by unit tests. Those I found fail
> because they changed their execution flow after the fix in these places:
> *
> [FieldIdUtil#getFieldId|https://github.com/apache/drill/blob/7c813ff6440a118de15f552145b40eb07bb8e7a2/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206]
> *
> [ScanBatch$Mutator#addField|https://github.com/apache/drill/blob/097a4717ac998ec6bf3c70a99575c7ff53f47430/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L523-L524]
> Before the fix, the field name equality checks returned false from comparing
> quoted and unquoted field names. The failing tests relied on that behavior
> and worked only with this condition. But after the fix, when we compare two
> quoted names and return true, we fall to a different branch and the tests
> aren't ready for that.
> But obviously the change _is fixing the problem_ so the tests that relied on
> that problem should now be adjusted.
> Please see more technical details on each failed unit test in the linked PR.
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)