Re: Parquet and filtering

Adam Gilmore Sun, 11 Jan 2015 17:10:17 -0800

Hi Jason,

I'd be interested in contributing these.  I've had a fairly good review of
the codebase at the moment and have reviewed some of the other push down
implementation (e.g. for MongoDB), so I think following the same pattern
makes sense.


The only question will be whether to "map" the plan to the filter objects
provided by the Parquet library (even though these objects won't be passed
to the Parquet library in the main reader, as we'll be doing the filtering
ourselves) or whether to just pass in the plan's filter.

Furthermore, having reviewed a bit of the push down stuff for Parquet from
say, Spark, there are some instances where filters will be incompatible
with push down, which we'll need to detect and handle.

Happy to do some more research on the best way to move implement this and
we can discuss over email or otherwise!


On Sat, Jan 10, 2015 at 3:26 AM, Jason Altekruse <altekruseja...@gmail.com>
wrote:

> Hey Adam,
>
> I think the row group filtering should be pretty straightforward with the
> current code, and considering the common use case of parquet being long
> term storage of volumes of legacy data, we should be avoiding as much
> reading as possible. The major consideration for implementing it would be
> tying in the Drill expression evaluation to the reader, which would give us
> the most flexibility in terms of pushing down any filter expressions.
>
> Unfortunately skipping pages will take a little more work, as parquet
> currently does not line up pages, for value types that are smaller, like
> boolean, it is possible that a row group may only have a few pages in the
> row group, while a varbinary column may be split up into a few dozen pages
> in the same row group. This just requires us to take the code that we have
> now and separate the movement forward in the file from copying into our
> vectors, effectively implementing a skip(numRows) functionality, as
> currently reading and progressing through the file are a bit mixed
> together.
>
> The consideration above is not a problem for row groups because they are
> defined to be record-aligned.
>
> Would you be interested in trying to contribute some of these enhancements?
> It's been on my plate for a while, but I've been focused on closing a
> number of outstanding bugs lately. I can certainly meet in a hangout or
> just answer questions via e-mail if you needed help navigating the current
> code.
>
> -Jason
>
> On Thu, Jan 8, 2015 at 10:27 PM, Adam Gilmore <a...@pharmadata.net.au>
> wrote:
>
> > What about starting with something simple?
> >
> > For example, why don't we try to filter out row groups or pages whose
> > statistics for columns explicitly DON'T match the filters we're
> > requesting.  That is, if we searched for say "customerId = 1" and the
> > statistics for customerId in that row group said min: 2 max: 5, we skip
> the
> > whole row group.
> >
> > Then, you still push it through the filter at the end to do the filter on
> > individual values.
> >
> > This would at least have massive performance increases when scanning a
> > smaller subset of rows in a large Parquet file.
> >
> > Then as a phase two, you can get further down to making the reader do the
> > majority of the filter work for you, even with individual values.
> >
> > To me, this would be as simple as creating an optimizer rule to create a
> > few Parquet filters and then have the row group scan match against those
> > before starting to read a row group/page.
> >
> > What do you think?
> >
> >
> > Regards,
> >
> >
> > *Adam Gilmore*
> >
> > Director of Technology
> >
> > a...@pharmadata.net.au
> >
> >
> > +61 421 997 655 (Mobile)
> >
> > 1300 733 876 (AU)
> >
> > +617 3171 9902 (Intl)
> >
> >
> > *PharmaData*
> >
> > Data Intelligence Solutions for Pharmacy
> >
> > www.PharmaData.net.au <http://www.pharmadata.net.au/>
> >
> >
> >
> > [image: pharmadata-sig]
> >
> >
> >
> > *Disclaimer*
> >
> > This communication including any attachments may contain information that
> > is either confidential or otherwise protected from disclosure and is
> > intended solely for the use of the intended recipient. If you are not the
> > intended recipient please immediately notify the sender by e-mail and
> > delete the original transmission and its contents. Any unauthorised use,
> > dissemination, forwarding, printing, or copying of this communication
> > including any file attachments is prohibited. The recipient should check
> > this email and any attachments for viruses and other defects. The Company
> > disclaims any liability for loss or damage arising in any way from this
> > communication including any file attachments.
> >
> > On Fri, Jan 9, 2015 at 3:36 AM, Jason Altekruse <
> altekruseja...@gmail.com>
> > wrote:
> >
> >> You are correct that we do need a hybrid approach to meet both cases.
> Just
> >> one thing I would add, in cases where we have nested and repeated types,
> >> there is no architectural reason why we cannot make vectorized copies of
> >> the data. We do represent the nesting an repeating slightly differently,
> >> so
> >> we cannot simply make a vectorized copy of the definition and repetition
> >> levels into our data structure. For example, we use offsets to denote
> the
> >> cutoffs of repeated types, rather than a list of lengths of each list
> >> (which is what effectively happens in parquet once the repetition levels
> >> have been run length encoded) to allow random access to the values in
> our
> >> vectors.
> >>
> >> We also do not make a distinction about the level in the schema at
> which a
> >> value became null, only leaves in the schema can become null. On the
> other
> >> hand, parquet does store a definition level rather than a simple
> >> nullability bit at each leaf node in the schema. This stores the
> >> nullability of the entire ancestry of a leaf node in the schema,
> >> redundantly storing much of the data, but then efficiently encoding it
> in
> >> most cases.
> >>
> >> These two differences require a little extra work, but it would be very
> >> doable. We just have taken the performance hit for now and are hoping to
> >> get back to it if we see use cases that require greater performance in
> the
> >> case of full table scans on nested/repeated data.
> >>
> >> - Jason
> >>
> >> On Thu, Jan 8, 2015 at 7:45 AM, Jacques Nadeau <jacq...@apache.org>
> >> wrote:
> >>
> >> > That is correct.
> >> >
> >> > On Wed, Jan 7, 2015 at 7:57 PM, Adam Gilmore <a...@pharmadata.net.au>
> >> > wrote:
> >> >
> >> > > That makes a lot of sense.  Just one question with regarding to
> >> handling
> >> > > complex types - do you mean maps/arrays/etc. (repetitions in
> Parquet)?
> >> > As
> >> > > in, if I created a Parquet table from some JSON files with a rather
> >> > > complex/nested structure, it would fall back to individual copies?
> >> > >
> >> > >
> >> > > Regards,
> >> > >
> >> > >
> >> > > *Adam Gilmore*
> >> > >
> >> > > Director of Technology
> >> > >
> >> > > a...@pharmadata.net.au
> >> > >
> >> > >
> >> > > +61 421 997 655 (Mobile)
> >> > >
> >> > > 1300 733 876 (AU)
> >> > >
> >> > > +617 3171 9902 (Intl)
> >> > >
> >> > >
> >> > > *PharmaData*
> >> > >
> >> > > Data Intelligence Solutions for Pharmacy
> >> > >
> >> > > www.PharmaData.net.au <http://www.pharmadata.net.au/>
> >> > >
> >> > >
> >> > >
> >> > > [image: pharmadata-sig]
> >> > >
> >> > >
> >> > >
> >> > > *Disclaimer*
> >> > >
> >> > > This communication including any attachments may contain information
> >> that
> >> > > is either confidential or otherwise protected from disclosure and is
> >> > > intended solely for the use of the intended recipient. If you are
> not
> >> the
> >> > > intended recipient please immediately notify the sender by e-mail
> and
> >> > > delete the original transmission and its contents. Any unauthorised
> >> use,
> >> > > dissemination, forwarding, printing, or copying of this
> communication
> >> > > including any file attachments is prohibited. The recipient should
> >> check
> >> > > this email and any attachments for viruses and other defects. The
> >> Company
> >> > > disclaims any liability for loss or damage arising in any way from
> >> this
> >> > > communication including any file attachments.
> >> > >
> >> > > On Thu, Jan 8, 2015 at 12:05 PM, Jason Altekruse <
> >> > altekruseja...@gmail.com
> >> > > > wrote:
> >> > >
> >> > >> The parquet library provides an interface for accessing individual
> >> > values
> >> > >> of each column (as well as a record assembly interface for
> populating
> >> > java
> >> > >> objects). As parquet is columnar, and the Drill in-memory storage
> >> format
> >> > >> is
> >> > >> also columnar, we can get much better read performance on queries
> >> where
> >> > >> most of the data is needed if we do copies of long runs of values
> >> rather
> >> > >> than a large number of individual copies.
> >> > >>
> >> > >> This obviously does not give us great performance for point
> queries,
> >> > where
> >> > >> a small subset of the data is needed. While these use cases are
> >> > prevalent
> >> > >> and we are hoping to fix this issue soon, as we wrote the original
> >> > >> implementation we were interested in stretching the bounds of how
> >> fast
> >> > we
> >> > >> could load a volume of data into the engine.
> >> > >>
> >> > >> The second reader that was written to handle complex types does use
> >> the
> >> > >> current 'columnar' interface exposed by the parquet library, but it
> >> > still
> >> > >> requires us to make individual copies for each value. Even as we
> >> > >> experimented with early versions of the project pushdown provided
> by
> >> the
> >> > >> parquet codebase, we were unable to match the performance of
> reading
> >> and
> >> > >> filtering the data ourselves. This was not fully explored, and a
> >> number
> >> > of
> >> > >> enhancements have been made to the parquet mainline that may give
> us
> >> the
> >> > >> performance we are looking for in these cases. We haven't had time
> to
> >> > >> revisit it so far.
> >> > >>
> >> > >> -Jason Altekruse
> >> > >>
> >> > >> On Wed, Jan 7, 2015 at 4:04 PM, Adam Gilmore <
> dragoncu...@gmail.com>
> >> > >> wrote:
> >> > >>
> >> > >> > Out of interest, is there a reason Drill implemented effectively
> >> its
> >> > own
> >> > >> > Parquet reading implementation as opposed to using the reading
> >> classes
> >> > >> from
> >> > >> > the Parquet project itself?  Were there particular performance
> >> reasons
> >> > >> for
> >> > >> > this?
> >> > >> >
> >> > >> > On Thu, Jan 8, 2015 at 2:22 AM, Jason Altekruse <
> >> > >> altekruseja...@gmail.com>
> >> > >> > wrote:
> >> > >> >
> >> > >> > > Just made one, I put some comments there from the design
> >> discussions
> >> > >> we
> >> > >> > > have had in the past.
> >> > >> > >
> >> > >> > > https://issues.apache.org/jira/browse/DRILL-1950
> >> > >> > >
> >> > >> > > - Jason Altekruse
> >> > >> > >
> >> > >> > > On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore <
> >> > dragoncu...@gmail.com>
> >> > >> > > wrote:
> >> > >> > >
> >> > >> > > > Just a quick follow up on this - is there a JIRA item for
> >> > >> implementing
> >> > >> > > push
> >> > >> > > > down predicates for Parquet scans or do we need to create
> one?
> >> > >> > > >
> >> > >> > > > On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse <
> >> > >> > > altekruseja...@gmail.com>
> >> > >> > > > wrote:
> >> > >> > > >
> >> > >> > > > > Hi Adam,
> >> > >> > > > >
> >> > >> > > > > I have a few thoughts that might explain the difference in
> >> query
> >> > >> > times.
> >> > >> > > > > Drill is able to read a subset of the data from a parquet
> >> file,
> >> > >> when
> >> > >> > > > > selecting only a few columns out of a large file. Drill
> will
> >> > give
> >> > >> you
> >> > >> > > > > faster results if you ask for 3 columns instead of 10 in
> >> terms
> >> > of
> >> > >> > read
> >> > >> > > > > performance. However, we are still working on further
> >> optimizing
> >> > >> the
> >> > >> > > > reader
> >> > >> > > > > by making use of the statistics contained in the block and
> >> page
> >> > >> > > > meta-data,
> >> > >> > > > > that will allow us to skip reading a subset of a column, as
> >> the
> >> > >> > parquet
> >> > >> > > > > writer can store min/max values for blocks of data.
> >> > >> > > > >
> >> > >> > > > > If you ran a query that was summing over a column, the
> >> reason it
> >> > >> was
> >> > >> > > > faster
> >> > >> > > > > is because it avoided a bunch of individual value copies as
> >> we
> >> > >> > filtered
> >> > >> > > > out
> >> > >> > > > > the records that were not needed. This currently takes
> place
> >> in
> >> > a
> >> > >> > > > separate
> >> > >> > > > > filter operator and should be pushed down into the read
> >> > operation
> >> > >> to
> >> > >> > > make
> >> > >> > > > > use of the file meta-data and eliminate some of the reads.
> >> > >> > > > >
> >> > >> > > > > -Jason
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <
> >> > >> dragoncu...@gmail.com>
> >> > >> > > > > wrote:
> >> > >> > > > >
> >> > >> > > > > > Hi guys,
> >> > >> > > > > >
> >> > >> > > > > > I have a question re Parquet.  I'm not sure if this is a
> >> Drill
> >> > >> > > question
> >> > >> > > > > or
> >> > >> > > > > > Parquet, but thought I'd start here.
> >> > >> > > > > >
> >> > >> > > > > > I have a sample dataset of ~100M rows in a Parquet file.
> >> It's
> >> > >> > quick
> >> > >> > > to
> >> > >> > > > > sum
> >> > >> > > > > > a single column across the whole dataset.
> >> > >> > > > > >
> >> > >> > > > > > I have a column which has approx 100 unique values (e.g.
> a
> >> > >> customer
> >> > >> > > > ID).
> >> > >> > > > > > When I filter on that column by one of those values (to
> >> reduce
> >> > >> the
> >> > >> > > set
> >> > >> > > > to
> >> > >> > > > > > ~1M values), the query takes longer.
> >> > >> > > > > >
> >> > >> > > > > > This doesn't make a lot of sense to me - I would have
> >> expected
> >> > >> the
> >> > >> > > > > Parquet
> >> > >> > > > > > format to only bring back segments that match that and
> only
> >> > sum
> >> > >> > those
> >> > >> > > > > > values.  I would expect that this would make the query
> >> > >> magnitudes
> >> > >> > > > faster,
> >> > >> > > > > > not slower.
> >> > >> > > > > >
> >> > >> > > > > > Other columnar formats I've used (e.g. ORCFile, SQL
> Server
> >> > >> > > Columnstore)
> >> > >> > > > > > have acted this way, so I can't quite understand why
> >> Parquet
> >> > >> > doesn't
> >> > >> > > > act
> >> > >> > > > > > the same.
> >> > >> > > > > >
> >> > >> > > > > > Can anyone suggest what I'm doing wrong?
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Parquet and filtering

Reply via email to