Hi Jinfeng,

Netflix already has this working in Presto with current Parquet version so
the fundamentals are all there.

I wish we had resources to do this our selves as this is massively
important to us and I would think that the performance gain is so
substantial that this would be of high value to more.

I would, if that helps, be more than happy to commission this work or offer
a bounty if that is considered appropriate and then assign resources to
such problems as soon as we can.

Regards,
 -Stefán

On Thu, Jun 1, 2017 at 4:46 AM, Jinfeng Ni <j...@apache.org> wrote:

> Kunal is correct that Drill currently supports filter pruning at parquet
> row group level, using min/max statistics. Such support is limited to
> numeric/timestamp type, due to the potential corrupted varchar min/max
> issue as Kunal mentioned.
>
> For now Drill does not support dictionary-based pruning. It would be great
> if someone in the community could contribute to make it happen.  That
> probably would require lots of work in Parquet reader during execution
> time.
>
> On Wed, May 31, 2017 at 5:47 PM, Kunal Khatua <kkha...@mapr.com> wrote:
>
> >
> > I might not be completely accurate, but the min-max technique allows you
> > to figure if a String-based filter potentially exists in a rowgroup
> > (Currently, Drill doesn't check at the page level). The comparison might
> be
> > incorrect in cases where the bytes of a text are not interpreted as
> > unsigned bytes. The Parquet Filter Pushdown filter is applied by Drill
> > during planning time.
> >
> >
> > However, for dictionary-encoded fields, the Reader/Scanner would need to
> > decode the Dictionary page to identify whether a filter condition's value
> > is present in the subsequent data pages. This would (most likely) be done
> > during execution time, and I don't believe Drill does that as yet.
> >
> >
> >
> > <http://www.mapr.com/>
> >
> > ________________________________
> > From: Stefán Baxter <ste...@activitystream.com>
> > Sent: Wednesday, May 31, 2017 5:08:23 PM
> > To: user
> > Subject: Re: Parquet filter pushdown and string fields that use
> dictionary
> > encoding
> >
> > Thank you Kunal.
> >
> > Kan you please explain to me why min/max values would be relevant for
> > dictionary encoded fields? (I think I may be completely misunderstanding
> > how they work)
> >
> > Regards,
> >  -Stefán
> >
> > On Wed, May 31, 2017 at 5:55 PM, Kunal Khatua <kkha...@mapr.com> wrote:
> >
> > > Even though filter pushdown is supported in Drill, it is limited to
> > > pushing down of numeric values including dates. We do not support
> > pushdown
> > > of varchar because of this bug in the parquet library:
> > >
> > > https://issues.apache.org/jira/browse/PARQUET-686
> > >
> > > <http://www.mapr.com/>
> > >
> > > The issue of correctness for comparison is what makes the dependency on
> > > min-max statistics by the Parquet library be unreliable.
> > >
> > >
> > > ________________________________
> > > From: Stefán Baxter <ste...@activitystream.com>
> > > Sent: Monday, May 29, 2017 1:41:30 PM
> > > To: user
> > > Subject: Parquet filter pushdown and string fields that use dictionary
> > > encoding
> > >
> > > Hi,
> > >
> > > I would like to verify that my understanding of parquet filter pushdown
> > in
> > > Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is
> > correct.
> > >
> > > Is it correctly understood that Drill does not support predicate
> > push-down
> > > for string fields when dictionary based string encoding is enabled?
> (It
> > > looks like Presto can do this.)
> > >
> > > We save a lot of space using dictionary encoding (not enabled in Drill
> > 1.10
> > > by default) and if my understanding of how-it-works is correct then the
> > > segment dictionary could be used to determine if a value is in a
> segments
> > > or if it can be pruned/skipped when filtering based on columns that are
> > > compressed/encoded using a dictionary.
> > >
> > > I may be misunderstanding how this works and perhaps the dictionary is
> > > create for the file as a whole and not individual sections but I know
> > that
> > > min/max values would not be good to determine the need for a segment
> > scan.
> > >
> > > I was hoping we could use partitioning on field(s) with lower
> cardinality
> > > to create partitions for typical partition pruning and then sort the
> > > contents of individual fields by session/customer IDs (which include
> > > alphanumeric characters here) so that segments would only contain a
> > > relatively low number of those unique values to facilitate "segment
> > > pruning" when looking for data belonging to individual
> > sessions/customers.
> > >
> > > Best regards,
> > >  -Stefán Baxter
> > >
> >
>

Reply via email to