[ https://issues.apache.org/jira/browse/DRILL-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206794#comment-16206794 ]
ASF GitHub Bot commented on DRILL-5795: --------------------------------------- Github user priteshm commented on the issue: https://github.com/apache/drill/pull/949 @paul-rogers, @kkhatua can you provide some more information on the test case that failed? Hopefully, @dprofeta can replicate it in his environment. > Filter pushdown for parquet handles multi rowgroup file > ------------------------------------------------------- > > Key: DRILL-5795 > URL: https://issues.apache.org/jira/browse/DRILL-5795 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet > Affects Versions: 1.11.0 > Reporter: Damien Profeta > Assignee: Damien Profeta > Labels: doc-impacting, ready-to-commit > Fix For: 1.12.0 > > Attachments: multirowgroup_overlap.parquet > > > DRILL-1950 implemented the filter pushdown for parquet file but only in the > case of one rowgroup per parquet file. In the case of multiple rowgroups per > files, it detects that the rowgroup can be pruned but then tell to the > drillbit to read the whole file which leads to performance issue. > Having multiple rowgroup per file helps to handle partitioned dataset and > still read only the relevant subset of data without ending with more file > than really needed. > Let's say for instance you have a Parquet file composed of RG1 and RG2 with > only one column a. Min/max in RG1 are 1-2 and min/max in RG2 are 2-3. > If I do "select a from file where a=3", today it will read the whole file, > with the patch it will only read RG2. > *For documentation* > Support / Other section in > https://drill.apache.org/docs/parquet-filter-pushdown/ should be updated. > After the fix files with multiple row groups will be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029)