jbimbert commented on a change in pull request #1298: DRILL-5796: Filter
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r203966145
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
##########
@@ -124,8 +124,7 @@ private static LogicalExpression
createIsTruePredicate(LogicalExpression expr) {
*/
private static LogicalExpression createIsFalsePredicate(LogicalExpression
expr) {
return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) ->
- //if min value is not false or if there are all nulls -> canDrop
- isAllNulls(exprStat, evaluator.getRowCount()) ||
exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin()
+ isAllNulls(exprStat, evaluator.getRowCount()) ||
exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() ?
RowsMatch.NONE : checkNull(exprStat)
Review comment:
blnTbl/0_0_1.parquet => ST:[min: false, max: false, num_nulls: 0] : 8 tests
in testBooleanPredicate()
tfTbl/ft0.parquet => ST:[min: false, max: true, num_nulls: 0] : 4 tests in
testBooleanPredicate
example1:
select * from
`ava-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_1.parquet` where
col_bln is false returns (false, false, false)
example2:
select * from
`java-exec/src/test/resources/parquetFilterPush/tfTbl/ft0.parquet` where a is
true[resp. false] return true[resp. false]
Finally, when running the query
select * from dfs.tmp.`blnTbl` where col_bln is false
with blnTbl contains only 0_0_0.parquet (T,T,T) and 0_0_1.parquet (F,F,F)
the physical plan reads:
00-00 Screen : rowType = RecordType(DYNAMIC_STAR **): rowcount = 3.0,
cumulative cost = {9.3 rows, 12.3 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
523
00-01 Project(**=[$0]) : rowType = RecordType(DYNAMIC_STAR **):
rowcount = 3.0, cumulative cost = {9.0 rows, 12.0 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 522
00-02 Project(**=[$0]) : rowType = RecordType(DYNAMIC_STAR **):
rowcount = 3.0, cumulative cost = {6.0 rows, 9.0 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 521
00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=/tmp/blnTbl/0_0_1.parquet]], selectionRoot=file:/tmp/blnTbl, numFiles=1,
numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) : rowType =
RecordType(DYNAMIC_STAR **, ANY col_bln): rowcount = 3.0, cumulative cost =
{3.0 rows, 6.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 520
No more filter since it returns NONE for 0_0_0.parquet and ALL for
0_0_1.parquet.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services