jbimbert commented on a change in pull request #1298: DRILL-5796: Filter
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r204325743
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
##########
@@ -124,8 +124,7 @@ private static LogicalExpression
createIsTruePredicate(LogicalExpression expr) {
*/
private static LogicalExpression createIsFalsePredicate(LogicalExpression
expr) {
return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) ->
- //if min value is not false or if there are all nulls -> canDrop
- isAllNulls(exprStat, evaluator.getRowCount()) ||
exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin()
+ isAllNulls(exprStat, evaluator.getRowCount()) ||
exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() ?
RowsMatch.NONE : checkNull(exprStat)
Review comment:
OK I found the reason why the tests pass :
1. we need several parquet files ,else the process is squeezed in
AbstractParquetGroupScan.applyFilter.
2. We need that some parquet files are dropped again in
AbstractParquetGroupScan.applyFilter
if (qualifiedRGs.size() == rowGroupInfos.size() ) { return null } ...
3. If one at least of the row groups is SOME, then the filter is applied to
all, in ParquetPushDownFilter.doOnMatch L 179
These 3 conditions together make that the tests pass.
One way to check that they fail, is to put ft0.parquet, ft0.parquet and
tt1.parquet in the same folder and run a IS TRUE predicate. the result then
reads F, T, F, T (wrong) instead of F, F (expected) !
I have then written the IS TRUE, IS FALSE, IS NOT TRUE and IS NOT FALSE
predicates based on the cases:
a. ST:[min: true, max: true, num_nulls: ?]
b. ST:[min: false, max: false, num_nulls: ?]
c. ST:[min: false, max: true, num_nulls: ?]
d. and num_nulls = RC ( row count)
And check all cases.
I also introduced 4 helper functions for code readability: minIsTrue,
minIsFalse, maxIsTrue and maxIsFalse
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services