lcspinter commented on a change in pull request #2137: URL: https://github.com/apache/hive/pull/2137#discussion_r618093086
########## File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java ########## @@ -194,6 +210,54 @@ public boolean canProvideBasicStatistics() { return stats; } + public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table, + ExprNodeDesc syntheticFilterPredicate) { + try { + Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns(); + if (partitionColumns.size() > 0) { + // Collect the column names from the predicate + Set<String> filterColumns = Sets.newHashSet(); + columns(syntheticFilterPredicate, filterColumns); + + // While Iceberg could handle multiple columns the current pruning only able to handle filters for a + // single column. We keep the logic below to handle multiple columns so if pruning is available on executor + // side the we can easily adapt to it as well. + if (filterColumns.size() > 1) { Review comment: We collect every column name in the filterColumns set through the columns() method. That method is traversing every node recursively, so it might be time-consuming. After that, the size of the set is validated, and if it's greater than 1, return false. Can we introduce some logic, to fail fast, without the need of traversing every node? I'm just thinking aloud, I don't know whether it is feasible or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org