Hi Jinfeng, Netflix already has this working in Presto with current Parquet version so the fundamentals are all there.
I wish we had resources to do this our selves as this is massively important to us and I would think that the performance gain is so substantial that this would be of high value to more. I would, if that helps, be more than happy to commission this work or offer a bounty if that is considered appropriate and then assign resources to such problems as soon as we can. Regards, -Stefán On Thu, Jun 1, 2017 at 4:46 AM, Jinfeng Ni <j...@apache.org> wrote: > Kunal is correct that Drill currently supports filter pruning at parquet > row group level, using min/max statistics. Such support is limited to > numeric/timestamp type, due to the potential corrupted varchar min/max > issue as Kunal mentioned. > > For now Drill does not support dictionary-based pruning. It would be great > if someone in the community could contribute to make it happen. That > probably would require lots of work in Parquet reader during execution > time. > > On Wed, May 31, 2017 at 5:47 PM, Kunal Khatua <kkha...@mapr.com> wrote: > > > > > I might not be completely accurate, but the min-max technique allows you > > to figure if a String-based filter potentially exists in a rowgroup > > (Currently, Drill doesn't check at the page level). The comparison might > be > > incorrect in cases where the bytes of a text are not interpreted as > > unsigned bytes. The Parquet Filter Pushdown filter is applied by Drill > > during planning time. > > > > > > However, for dictionary-encoded fields, the Reader/Scanner would need to > > decode the Dictionary page to identify whether a filter condition's value > > is present in the subsequent data pages. This would (most likely) be done > > during execution time, and I don't believe Drill does that as yet. > > > > > > > > <http://www.mapr.com/> > > > > ________________________________ > > From: Stefán Baxter <ste...@activitystream.com> > > Sent: Wednesday, May 31, 2017 5:08:23 PM > > To: user > > Subject: Re: Parquet filter pushdown and string fields that use > dictionary > > encoding > > > > Thank you Kunal. > > > > Kan you please explain to me why min/max values would be relevant for > > dictionary encoded fields? (I think I may be completely misunderstanding > > how they work) > > > > Regards, > > -Stefán > > > > On Wed, May 31, 2017 at 5:55 PM, Kunal Khatua <kkha...@mapr.com> wrote: > > > > > Even though filter pushdown is supported in Drill, it is limited to > > > pushing down of numeric values including dates. We do not support > > pushdown > > > of varchar because of this bug in the parquet library: > > > > > > https://issues.apache.org/jira/browse/PARQUET-686 > > > > > > <http://www.mapr.com/> > > > > > > The issue of correctness for comparison is what makes the dependency on > > > min-max statistics by the Parquet library be unreliable. > > > > > > > > > ________________________________ > > > From: Stefán Baxter <ste...@activitystream.com> > > > Sent: Monday, May 29, 2017 1:41:30 PM > > > To: user > > > Subject: Parquet filter pushdown and string fields that use dictionary > > > encoding > > > > > > Hi, > > > > > > I would like to verify that my understanding of parquet filter pushdown > > in > > > Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is > > correct. > > > > > > Is it correctly understood that Drill does not support predicate > > push-down > > > for string fields when dictionary based string encoding is enabled? > (It > > > looks like Presto can do this.) > > > > > > We save a lot of space using dictionary encoding (not enabled in Drill > > 1.10 > > > by default) and if my understanding of how-it-works is correct then the > > > segment dictionary could be used to determine if a value is in a > segments > > > or if it can be pruned/skipped when filtering based on columns that are > > > compressed/encoded using a dictionary. > > > > > > I may be misunderstanding how this works and perhaps the dictionary is > > > create for the file as a whole and not individual sections but I know > > that > > > min/max values would not be good to determine the need for a segment > > scan. > > > > > > I was hoping we could use partitioning on field(s) with lower > cardinality > > > to create partitions for typical partition pruning and then sort the > > > contents of individual fields by session/customer IDs (which include > > > alphanumeric characters here) so that segments would only contain a > > > relatively low number of those unique values to facilitate "segment > > > pruning" when looking for data belonging to individual > > sessions/customers. > > > > > > Best regards, > > > -Stefán Baxter > > > > > >