KazydubB commented on a change in pull request #1954: DRILL-7509: Incorrect
TupleSchema is created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366897908
##########
File path:
metastore/metastore-api/src/main/java/org/apache/drill/metastore/util/SchemaPathUtils.java
##########
@@ -63,6 +64,30 @@ public static ColumnMetadata getColumnMetadata(SchemaPath
schemaPath, TupleMetad
return colMetadata;
}
+ /**
+ * Checks if field indetified by the schema path is child in either {@code
DICT} or {@code REPEATED MAP}.
+ * For such fields, nested in {@code DICT} or {@code REPEATED MAP},
+ * filters can't be removed based on Parquet statistics.
+ * @param schemaPath schema path used in filter
+ * @param schema schema containing all the fields in the file
+ * @return {@literal true} if field is nested inside {@code DICT} (is {@code
`key`} or {@code `value`})
+ * or inside {@code REPEATED MAP} field, {@literal false} otherwise.
+ */
+ public static boolean isFieldNestedInDictOrRepeatedMap(SchemaPath
schemaPath, TupleMetadata schema) {
Review comment:
This method is used to check whether a schema path in filter, e.g. `...
WHERE mapcol['a'] IS NULL`, references a `DICT`'s `value` (accessed by some
key). If a `value` is `OPTIONAL INT` (which is default type for absent column
also), first statistics is going to be retrieved for the 'field' identified by
schema path. In case when `value` is retrieved by key, previous example results
to schema path being `` `mapcol`.`a` `` which is not present in statistics (but
there is statistics for the value itself, which has schema path ``
`mapcol`.`map`.`value` ``) and then is treated as an absent column resulting in
every row matching the filter.
While doing the changes for resolving correct metadata for the `DICT`, I've
seen that `DataMode` for `key` and `value` fields are always `REPEATED`. This
was because data mode was determined using Parquet's max `repetition` and
`definition` level values (see changes in `ParquetTableMetadataUtils.java`),
computed for the whole schema up to the leaf field. The algorith was, if
`repetition >= 1` then the field is `REPEATED`. This means, if there is at
least one `REPEATED` member in schema, each of its children is going to be
`REPEATED`. In case of `DICT`, it has a nested `repeated group`, mentioned
above, thus resulting in `key` and `value` being `REPEATED`. To retain original
data mode, a Parquet's `Type.Repetition` was added to column metadata v4 and is
used instead. This filtering was working before for the `DICT` because the
`value` was `REPEATED` and statistics was not used for `REPEATED` fields. Now,
when the data mode is retained, there is a need to handle such a case.
`REPEATED MAP` is included, because its fields were determined to be `REPEATED`
also, thus to preserve previous behaviuor. But currently, this method is likely
to be used if the `REPEATED MAP` contains an `OPTIONAL INT`. For the case of
repeated map, it's statistics is not found, because it uses simple name, e.g.
`` `struct_array`.`a` `` (note that indexes are ommited, as they are not
retained in Parquet schema), but it has another _actual_ structure: ``
`struct_array`.`bag`.`array_element`.`a` ``.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services