[GitHub] [drill] KazydubB commented on a change in pull request #1954: DRILL-7509: Incorrect TupleSchema is created for DICT column when querying Parquet files

GitBox Wed, 15 Jan 2020 06:16:58 -0800

KazydubB commented on a change in pull request #1954: DRILL-7509: Incorrect 
TupleSchema is created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366897908


 ##########
 File path: 
metastore/metastore-api/src/main/java/org/apache/drill/metastore/util/SchemaPathUtils.java
 ##########
 @@ -63,6 +64,30 @@ public static ColumnMetadata getColumnMetadata(SchemaPath 
schemaPath, TupleMetad
     return colMetadata;
   }
 
+  /**
+   * Checks if field indetified by the schema path is child in either {@code 
DICT} or {@code REPEATED MAP}.
+   * For such fields, nested in {@code DICT} or {@code REPEATED MAP},
+   * filters can't be removed based on Parquet statistics.
+   * @param schemaPath schema path used in filter
+   * @param schema schema containing all the fields in the file
+   * @return {@literal true} if field is nested inside {@code DICT} (is {@code 
`key`} or {@code `value`})
+   *         or inside {@code REPEATED MAP} field, {@literal false} otherwise.
+   */
+  public static boolean isFieldNestedInDictOrRepeatedMap(SchemaPath 
schemaPath, TupleMetadata schema) {
 
 Review comment:
   This method is used to check whether a schema path in filter, e.g. `... 
WHERE mapcol['a'] IS NULL`, references a `DICT`'s `value` (accessed by some 
key). If a `value` is `OPTIONAL INT` (which is default type for absent column 
also), first statistics is going to be retrieved for the 'field' identified by 
schema path. In case when `value` is retrieved by key, previous example results 
to schema path being `` `mapcol`.`a` `` which is not present in statistics (but 
there is statistics for the value itself, which has schema path `` 
`mapcol`.`map`.`value` ``) and then is treated as an absent column resulting in 
every row matching the filter.
   
   While doing the changes for resolving correct metadata for the `DICT`, I've 
seen that `DataMode` for `key` and `value` fields are always `REPEATED`. This 
was because data mode was determined using Parquet's max `repetition` and 
`definition` level values (see changes in `ParquetTableMetadataUtils.java`), 
computed for the whole schema up to the leaf field. The algorith was, if 
`repetition >= 1` then the field is `REPEATED`. This means, if there is at 
least one `REPEATED` member in schema, each of its children is going to be 
`REPEATED`. In case of `DICT`, it has a nested `repeated group`, mentioned 
above, thus resulting in `key` and `value` being `REPEATED`. To retain original 
data mode, a Parquet's `Type.Repetition` was added to column metadata v4 and is 
used instead. This filtering was working before for the `DICT` because the 
`value` was `REPEATED` and statistics was not used for `REPEATED` fields. Now, 
when the data mode is retained, there is a need to handle such a case. 
`REPEATED MAP` is included, because its fields were determined to be `REPEATED` 
also, thus to preserve previous behaviuor. But currently, this method is likely 
to be used if the `REPEATED MAP` contains an `OPTIONAL INT`. For the case of 
repeated map, it's statistics is not found, because it uses simple name, e.g. 
`` `struct_array`.`a` `` (note that indexes are ommited, as they are not 
retained in Parquet schema), but it has another _actual_ structure: `` 
`struct_array`.`bag`.`array_element`.`a` ``. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] KazydubB commented on a change in pull request #1954: DRILL-7509: Incorrect TupleSchema is created for DICT column when querying Parquet files

Reply via email to