ihuzenko commented on a change in pull request #1954: DRILL-7509: Incorrect
TupleSchema is created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366483635
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetTableMetadataUtils.java
##########
@@ -507,29 +507,67 @@ private static long convertToDrillDateValue(int
dateValue) {
MetadataBase.ParquetTableMetadataBase parquetTableMetadata) {
int precision = 0;
int scale = 0;
- int definitionLevel = 1;
- int repetitionLevel = 0;
MetadataVersion metadataVersion = new
MetadataVersion(parquetTableMetadata.getMetadataVersion());
// only ColumnTypeMetadata_v3 and ColumnTypeMetadata_v4 store information
about scale, precision, repetition level and definition level
if (parquetTableMetadata.hasColumnMetadata() &&
(metadataVersion.compareTo(new MetadataVersion(3, 0)) >= 0)) {
scale = parquetTableMetadata.getScale(name);
precision = parquetTableMetadata.getPrecision(name);
- repetitionLevel = parquetTableMetadata.getRepetitionLevel(name);
- definitionLevel = parquetTableMetadata.getDefinitionLevel(name);
- }
- TypeProtos.DataMode mode;
- if (repetitionLevel >= 1) {
- mode = TypeProtos.DataMode.REPEATED;
- } else if (repetitionLevel == 0 && definitionLevel == 0) {
- mode = TypeProtos.DataMode.REQUIRED;
- } else {
- mode = TypeProtos.DataMode.OPTIONAL;
}
+
+ TypeProtos.DataMode mode = getDataMode(parquetTableMetadata,
metadataVersion, name);
return
TypeProtos.MajorType.newBuilder(ParquetReaderUtility.getType(primitiveType,
originalType, precision, scale))
.setMode(mode)
.build();
}
+ /**
+ * Obtain data mode from table metadata for a column. Algorithm for
retrieving data mode depends on metadata version:
+ * <ul>
+ * <li>starting from version {@code 4.2}, Parquet's {@link
org.apache.parquet.schema.Type.Repetition}
+ * is stored in table metadata itself;</li>
+ * <li>starting from {@code 3.0} to {@code 4.2} (exclusively) the data
mode is
+ * computed based on max {@code definition} and {@code repetition} levels
+ * ({@link
MetadataBase.ParquetTableMetadataBase#getDefinitionLevel(String[])} and
+ * {@link
MetadataBase.ParquetTableMetadataBase#getRepetitionLevel(String[])}
respectively)
+ * obtained from Parquet's schema;
+ *
+ * <p><strong>Note:</strong> this computation may lead to erroneous
results,
+ * when there are few nesting levels.</p>
+ * </li>
+ * <li>prior to {@code 3.0} {@code DataMode.OPTIONAL} is returned.</li>
+ * </ul>
+ * @param tableMetadata Parquet table metadata
+ * @param metadataVersion version of Parquet table metadata
+ * @param name (leaf) column to obtain data mode for
+ * @return data mode of the specified column
+ */
+ private static TypeProtos.DataMode
getDataMode(MetadataBase.ParquetTableMetadataBase tableMetadata,
Review comment:
Could you please simplify the method, for example:
```java
TypeProtos.DataMode mode = TypeProtos.DataMode.OPTIONAL;
if (tableMetadata.hasColumnMetadata()) {
if (metadataVersion.compareTo(new MetadataVersion(4, 2)) >= 0) {
mode =
ParquetReaderUtility.getDataMode(tableMetadata.getRepetition(name));
} else if (metadataVersion.compareTo(new MetadataVersion(3, 0)) >= 0) {
int repetitionLevel = tableMetadata.getRepetitionLevel(name);
if (repetitionLevel >= 1) {
mode = TypeProtos.DataMode.REPEATED;
} else if (repetitionLevel == 0 &&
tableMetadata.getDefinitionLevel(name) == 0) {
mode = TypeProtos.DataMode.REQUIRED;
}
}
}
return mode;
```
Also, since we know that ```MetadataVersion``` is based on major & minor int
numbers, I'd suggest providing a few useful methods to make comparisons easy to
read. For example, we could use methods ```atLeast(major, minor)``` and
```atMost(major, minor)``` similar to Range class. We need to analyze compare
to usages and figure out which method would be useful to add.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services