Github user parthchandra commented on a diff in the pull request:
https://github.com/apache/drill/pull/637#discussion_r86042975
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
---
@@ -1000,6 +1053,81 @@ public long getColumnValueCount(SchemaPath column) {
@Override
public List<SchemaPath> getPartitionColumns() {
- return new ArrayList<>(columnTypeMap.keySet());
+ return new ArrayList<>(partitionColTypeMap.keySet());
}
+
+ public GroupScan applyFilter(LogicalExpression filterExpr, UdfUtilities
udfUtilities,
+ FunctionImplementationRegistry functionImplementationRegistry,
OptionManager optionManager) {
+ if (fileSet.size() == 1 || ! (parquetTableMetadata instanceof
Metadata.ParquetTableMetadata_v3)) {
+ return null; // no pruning for 1 single parquet file or metadata is
prior v3.
+ }
+
+ final Set<SchemaPath> schemaPathsInExpr = filterExpr.accept(new
ParquetRGFilterEvaluator.FieldReferenceFinder(), null);
+
+ final List<RowGroupMetadata> qualifiedRGs = new
ArrayList<>(parquetTableMetadata.getFiles().size());
+ Set<String> qualifiedFileNames = Sets.newHashSet(); // HashSet keeps a
fileName unique.
+
+ ParquetFilterPredicate filterPredicate = null;
+
+ for (ParquetFileMetadata file : parquetTableMetadata.getFiles()) {
+ final ImplicitColumnExplorer columnExplorer = new
ImplicitColumnExplorer(optionManager, this.columns);
+ Map<String, String> implicitColValues =
columnExplorer.populateImplicitColumns(file.getPath(), selectionRoot);
+
+ for (RowGroupMetadata rowGroup : file.getRowGroups()) {
+ ParquetMetaStatCollector statCollector = new
ParquetMetaStatCollector(
+ parquetTableMetadata,
+ rowGroup.getColumns(),
+ implicitColValues);
+
+ Map<SchemaPath, ColumnStatistics> columnStatisticsMap =
statCollector.collectColStat(schemaPathsInExpr);
--- End diff --
Shouldn't we be able to build the filter predicate once outside the for
loop? Or is it needed because the implicit columns are needed here?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---