[ https://issues.apache.org/jira/browse/DRILL-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173533#comment-16173533 ]
ASF GitHub Bot commented on DRILL-5795: --------------------------------------- Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/949#discussion_r140036046 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -1095,7 +1104,7 @@ public GroupScan applyFilter(LogicalExpression filterExpr, UdfUtilities udfUtili final Set<SchemaPath> schemaPathsInExpr = filterExpr.accept(new ParquetRGFilterEvaluator.FieldReferenceFinder(), null); - final List<RowGroupMetadata> qualifiedRGs = new ArrayList<>(parquetTableMetadata.getFiles().size()); + final List<RowGroupInfo> qualifiedRGs = new ArrayList<>(rowGroupInfos.size()); --- End diff -- Never mind the previous comment. It's probably better to use RowGroupInfos throughout the code. > Filter pushdown for parquet handles multi rowgroup file > ------------------------------------------------------- > > Key: DRILL-5795 > URL: https://issues.apache.org/jira/browse/DRILL-5795 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet > Reporter: Damien Profeta > Assignee: Damien Profeta > Labels: doc-impacting > > DRILL-1950 implemented the filter pushdown for parquet file but only in the > case of one rowgroup per parquet file. In the case of multiple rowgroups per > files, it detects that the rowgroup can be pruned but then tell to the > drillbit to read the whole file which leads to performance issue. > Having multiple rowgroup per file helps to handle partitioned dataset and > still read only the relevant subset of data without ending with more file > than really needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)