[
https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878640#comment-17878640
]
ASF GitHub Bot commented on DRILL-8507:
---------------------------------------
ychernysh commented on code in PR #2937:
URL: https://github.com/apache/drill/pull/2937#discussion_r1741155460
##########
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetTableMetadataUtils.java:
##########
@@ -661,6 +663,12 @@ static Map<SchemaPath, TypeProtos.MajorType>
resolveFields(MetadataBase.ParquetT
// row groups in the file have the same schema, so using the first one
Map<SchemaPath, TypeProtos.MajorType> fileColumns =
getFileFields(parquetTableMetadata, file);
fileColumns.forEach((columnPath, type) -> putType(columns, columnPath,
type));
+ // If at least 1 parquet file to read doesn't contain a column, enforce
this column
+ // DataMode to OPTIONAL in the overall table schema
Review Comment:
The first item is about resolving different data types even if there are no
missing columns, which I didn't cover.
`but only if the other types are REQUIRED` - is this condition necessary?
Regarding REPEATED - I haven't covered it in any way.
In theory, implementing these should not be that hard..l
> Missing parquet columns quoted with backticks conflict with existing ones
> -------------------------------------------------------------------------
>
> Key: DRILL-8507
> URL: https://issues.apache.org/jira/browse/DRILL-8507
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.21.2
> Reporter: Yaroslav
> Priority: Major
> Attachments: people.tar.gz
>
>
> {*}NOTE{*}: I worked on this issue along with DRILL-8508. It turned out that
> it required this bug to get fixed first. And since these 2 are about a little
> bit different things it was decided to report them as separate issues. So,
> I'm going to link this issue as a requirement for that issue and open one PR
> for both (if it's allowed..). I think single PR would make it easier to
> review the code since the issues are quite related anyway.
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a
> null-filled {{NullableIntVector}} with the column's name and the correct
> value count set. The field name for the vector is derived from
> {{SchemaPath#toExpr}} method, which always enquotes the outcome string with
> backticks.
> h3. Problems
> This causes some wrong field name equality checks (comparing two strings of
> field names, non-quoted and quoted, returns false, but essentially is
> supposed to return true) that lead to some errors.
> For example, the errors occur when you select a column from a table where
> some parquet files contain it and some do not. Consider a {{dfs.tmp.people}}
> table with such parquet files and their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> |
> age<INT(OPTIONAL)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> Now let's try to use an operator that doesn't support schema change. For
> example, {{{}ORDER BY{}}}:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]],
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age``
> (INT:OPTIONAL)]], selectionVector=NONE]
> Fragment: 0:0
> [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010]
> (state=,code=0)
> {code}
> ORDER BY error clearly shows us that ``age`` is an extra column here and the
> incoming schema should only have the unquoted field to match the previous
> schema.
> Another example is in {{UNION ALL}} operator:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1));
> Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch
> have different field counthas!
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010]
> (state=,code=0)
> {code}
> Again, "different field counts" issue is caused by an extra quoted column
> that counts as different field.
> h3. Solution
> The solution for these errors would be to replace {{SchemaPath#toExpr}} call
> with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome
> string. Simply enough, but note that we used to use
> {{SchemaPath#getAsUnescapedPath}} before DRILL-4264 where we switched to
> {{{}SchemaPath#toExpr{}}}. The author even put a comment:
> {code:java}
> // col.toExpr() is used here as field name since we don't want to see these
> fields in the existing maps{code}
> So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision.
> But, honestly, I don't really understand the motivation for this, even with
> the comment. I don't really understand what "existing maps" are here. Maybe
> someone from the community can help here, but I will further consider it as a
> mistake, simply because it causes the above problems.
> h3. Regressions
> The change brings some regressions detected by unit tests. Those I found fail
> because they changed their execution flow after the fix in these places:
> *
> [FieldIdUtil#getFieldId|https://github.com/apache/drill/blob/7c813ff6440a118de15f552145b40eb07bb8e7a2/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206]
> *
> [ScanBatch$Mutator#addField|https://github.com/apache/drill/blob/097a4717ac998ec6bf3c70a99575c7ff53f47430/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L523-L524]
> Before the fix, the field name equality checks returned false from comparing
> quoted and unquoted field names. The failing tests relied on that behavior
> and worked only with this condition. But after the fix, when we compare two
> quoted names and return true, we fall to a different branch and the tests
> aren't ready for that.
> But obviously the change _is fixing the problem_ so the tests that relied on
> that problem should now be adjusted.
> Please see more technical details on each failed unit test in the linked PR.
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)