Re: Parquet and filtering

Jason Altekruse Fri, 09 Jan 2015 09:28:58 -0800

Hey Adam,

I think the row group filtering should be pretty straightforward with the
current code, and considering the common use case of parquet being long
term storage of volumes of legacy data, we should be avoiding as much
reading as possible. The major consideration for implementing it would be
tying in the Drill expression evaluation to the reader, which would give us
the most flexibility in terms of pushing down any filter expressions.


Unfortunately skipping pages will take a little more work, as parquet
currently does not line up pages, for value types that are smaller, like
boolean, it is possible that a row group may only have a few pages in the
row group, while a varbinary column may be split up into a few dozen pages
in the same row group. This just requires us to take the code that we have
now and separate the movement forward in the file from copying into our
vectors, effectively implementing a skip(numRows) functionality, as
currently reading and progressing through the file are a bit mixed
together.

The consideration above is not a problem for row groups because they are
defined to be record-aligned.

Would you be interested in trying to contribute some of these enhancements?
It's been on my plate for a while, but I've been focused on closing a
number of outstanding bugs lately. I can certainly meet in a hangout or
just answer questions via e-mail if you needed help navigating the current
code.

-Jason

On Thu, Jan 8, 2015 at 10:27 PM, Adam Gilmore <a...@pharmadata.net.au>
wrote:

> What about starting with something simple?
>
> For example, why don't we try to filter out row groups or pages whose
> statistics for columns explicitly DON'T match the filters we're
> requesting.  That is, if we searched for say "customerId = 1" and the
> statistics for customerId in that row group said min: 2 max: 5, we skip the
> whole row group.
>
> Then, you still push it through the filter at the end to do the filter on
> individual values.
>
> This would at least have massive performance increases when scanning a
> smaller subset of rows in a large Parquet file.
>
> Then as a phase two, you can get further down to making the reader do the
> majority of the filter work for you, even with individual values.
>
> To me, this would be as simple as creating an optimizer rule to create a
> few Parquet filters and then have the row group scan match against those
> before starting to read a row group/page.
>
> What do you think?
>
>
> Regards,
>
>
> *Adam Gilmore*
>
> Director of Technology
>
> a...@pharmadata.net.au
>
>
> +61 421 997 655 (Mobile)
>
> 1300 733 876 (AU)
>
> +617 3171 9902 (Intl)
>
>
> *PharmaData*
>
> Data Intelligence Solutions for Pharmacy
>
> www.PharmaData.net.au <http://www.pharmadata.net.au/>
>
>
>
> [image: pharmadata-sig]
>
>
>
> *Disclaimer*
>
> This communication including any attachments may contain information that
> is either confidential or otherwise protected from disclosure and is
> intended solely for the use of the intended recipient. If you are not the
> intended recipient please immediately notify the sender by e-mail and
> delete the original transmission and its contents. Any unauthorised use,
> dissemination, forwarding, printing, or copying of this communication
> including any file attachments is prohibited. The recipient should check
> this email and any attachments for viruses and other defects. The Company
> disclaims any liability for loss or damage arising in any way from this
> communication including any file attachments.
>
> On Fri, Jan 9, 2015 at 3:36 AM, Jason Altekruse <altekruseja...@gmail.com>
> wrote:
>
>> You are correct that we do need a hybrid approach to meet both cases. Just
>> one thing I would add, in cases where we have nested and repeated types,
>> there is no architectural reason why we cannot make vectorized copies of
>> the data. We do represent the nesting an repeating slightly differently,
>> so
>> we cannot simply make a vectorized copy of the definition and repetition
>> levels into our data structure. For example, we use offsets to denote the
>> cutoffs of repeated types, rather than a list of lengths of each list
>> (which is what effectively happens in parquet once the repetition levels
>> have been run length encoded) to allow random access to the values in our
>> vectors.
>>
>> We also do not make a distinction about the level in the schema at which a
>> value became null, only leaves in the schema can become null. On the other
>> hand, parquet does store a definition level rather than a simple
>> nullability bit at each leaf node in the schema. This stores the
>> nullability of the entire ancestry of a leaf node in the schema,
>> redundantly storing much of the data, but then efficiently encoding it in
>> most cases.
>>
>> These two differences require a little extra work, but it would be very
>> doable. We just have taken the performance hit for now and are hoping to
>> get back to it if we see use cases that require greater performance in the
>> case of full table scans on nested/repeated data.
>>
>> - Jason
>>
>> On Thu, Jan 8, 2015 at 7:45 AM, Jacques Nadeau <jacq...@apache.org>
>> wrote:
>>
>> > That is correct.
>> >
>> > On Wed, Jan 7, 2015 at 7:57 PM, Adam Gilmore <a...@pharmadata.net.au>
>> > wrote:
>> >
>> > > That makes a lot of sense.  Just one question with regarding to
>> handling
>> > > complex types - do you mean maps/arrays/etc. (repetitions in Parquet)?
>> > As
>> > > in, if I created a Parquet table from some JSON files with a rather
>> > > complex/nested structure, it would fall back to individual copies?
>> > >
>> > >
>> > > Regards,
>> > >
>> > >
>> > > *Adam Gilmore*
>> > >
>> > > Director of Technology
>> > >
>> > > a...@pharmadata.net.au
>> > >
>> > >
>> > > +61 421 997 655 (Mobile)
>> > >
>> > > 1300 733 876 (AU)
>> > >
>> > > +617 3171 9902 (Intl)
>> > >
>> > >
>> > > *PharmaData*
>> > >
>> > > Data Intelligence Solutions for Pharmacy
>> > >
>> > > www.PharmaData.net.au <http://www.pharmadata.net.au/>
>> > >
>> > >
>> > >
>> > > [image: pharmadata-sig]
>> > >
>> > >
>> > >
>> > > *Disclaimer*
>> > >
>> > > This communication including any attachments may contain information
>> that
>> > > is either confidential or otherwise protected from disclosure and is
>> > > intended solely for the use of the intended recipient. If you are not
>> the
>> > > intended recipient please immediately notify the sender by e-mail and
>> > > delete the original transmission and its contents. Any unauthorised
>> use,
>> > > dissemination, forwarding, printing, or copying of this communication
>> > > including any file attachments is prohibited. The recipient should
>> check
>> > > this email and any attachments for viruses and other defects. The
>> Company
>> > > disclaims any liability for loss or damage arising in any way from
>> this
>> > > communication including any file attachments.
>> > >
>> > > On Thu, Jan 8, 2015 at 12:05 PM, Jason Altekruse <
>> > altekruseja...@gmail.com
>> > > > wrote:
>> > >
>> > >> The parquet library provides an interface for accessing individual
>> > values
>> > >> of each column (as well as a record assembly interface for populating
>> > java
>> > >> objects). As parquet is columnar, and the Drill in-memory storage
>> format
>> > >> is
>> > >> also columnar, we can get much better read performance on queries
>> where
>> > >> most of the data is needed if we do copies of long runs of values
>> rather
>> > >> than a large number of individual copies.
>> > >>
>> > >> This obviously does not give us great performance for point queries,
>> > where
>> > >> a small subset of the data is needed. While these use cases are
>> > prevalent
>> > >> and we are hoping to fix this issue soon, as we wrote the original
>> > >> implementation we were interested in stretching the bounds of how
>> fast
>> > we
>> > >> could load a volume of data into the engine.
>> > >>
>> > >> The second reader that was written to handle complex types does use
>> the
>> > >> current 'columnar' interface exposed by the parquet library, but it
>> > still
>> > >> requires us to make individual copies for each value. Even as we
>> > >> experimented with early versions of the project pushdown provided by
>> the
>> > >> parquet codebase, we were unable to match the performance of reading
>> and
>> > >> filtering the data ourselves. This was not fully explored, and a
>> number
>> > of
>> > >> enhancements have been made to the parquet mainline that may give us
>> the
>> > >> performance we are looking for in these cases. We haven't had time to
>> > >> revisit it so far.
>> > >>
>> > >> -Jason Altekruse
>> > >>
>> > >> On Wed, Jan 7, 2015 at 4:04 PM, Adam Gilmore <dragoncu...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > Out of interest, is there a reason Drill implemented effectively
>> its
>> > own
>> > >> > Parquet reading implementation as opposed to using the reading
>> classes
>> > >> from
>> > >> > the Parquet project itself?  Were there particular performance
>> reasons
>> > >> for
>> > >> > this?
>> > >> >
>> > >> > On Thu, Jan 8, 2015 at 2:22 AM, Jason Altekruse <
>> > >> altekruseja...@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Just made one, I put some comments there from the design
>> discussions
>> > >> we
>> > >> > > have had in the past.
>> > >> > >
>> > >> > > https://issues.apache.org/jira/browse/DRILL-1950
>> > >> > >
>> > >> > > - Jason Altekruse
>> > >> > >
>> > >> > > On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore <
>> > dragoncu...@gmail.com>
>> > >> > > wrote:
>> > >> > >
>> > >> > > > Just a quick follow up on this - is there a JIRA item for
>> > >> implementing
>> > >> > > push
>> > >> > > > down predicates for Parquet scans or do we need to create one?
>> > >> > > >
>> > >> > > > On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse <
>> > >> > > altekruseja...@gmail.com>
>> > >> > > > wrote:
>> > >> > > >
>> > >> > > > > Hi Adam,
>> > >> > > > >
>> > >> > > > > I have a few thoughts that might explain the difference in
>> query
>> > >> > times.
>> > >> > > > > Drill is able to read a subset of the data from a parquet
>> file,
>> > >> when
>> > >> > > > > selecting only a few columns out of a large file. Drill will
>> > give
>> > >> you
>> > >> > > > > faster results if you ask for 3 columns instead of 10 in
>> terms
>> > of
>> > >> > read
>> > >> > > > > performance. However, we are still working on further
>> optimizing
>> > >> the
>> > >> > > > reader
>> > >> > > > > by making use of the statistics contained in the block and
>> page
>> > >> > > > meta-data,
>> > >> > > > > that will allow us to skip reading a subset of a column, as
>> the
>> > >> > parquet
>> > >> > > > > writer can store min/max values for blocks of data.
>> > >> > > > >
>> > >> > > > > If you ran a query that was summing over a column, the
>> reason it
>> > >> was
>> > >> > > > faster
>> > >> > > > > is because it avoided a bunch of individual value copies as
>> we
>> > >> > filtered
>> > >> > > > out
>> > >> > > > > the records that were not needed. This currently takes place
>> in
>> > a
>> > >> > > > separate
>> > >> > > > > filter operator and should be pushed down into the read
>> > operation
>> > >> to
>> > >> > > make
>> > >> > > > > use of the file meta-data and eliminate some of the reads.
>> > >> > > > >
>> > >> > > > > -Jason
>> > >> > > > >
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <
>> > >> dragoncu...@gmail.com>
>> > >> > > > > wrote:
>> > >> > > > >
>> > >> > > > > > Hi guys,
>> > >> > > > > >
>> > >> > > > > > I have a question re Parquet.  I'm not sure if this is a
>> Drill
>> > >> > > question
>> > >> > > > > or
>> > >> > > > > > Parquet, but thought I'd start here.
>> > >> > > > > >
>> > >> > > > > > I have a sample dataset of ~100M rows in a Parquet file.
>> It's
>> > >> > quick
>> > >> > > to
>> > >> > > > > sum
>> > >> > > > > > a single column across the whole dataset.
>> > >> > > > > >
>> > >> > > > > > I have a column which has approx 100 unique values (e.g. a
>> > >> customer
>> > >> > > > ID).
>> > >> > > > > > When I filter on that column by one of those values (to
>> reduce
>> > >> the
>> > >> > > set
>> > >> > > > to
>> > >> > > > > > ~1M values), the query takes longer.
>> > >> > > > > >
>> > >> > > > > > This doesn't make a lot of sense to me - I would have
>> expected
>> > >> the
>> > >> > > > > Parquet
>> > >> > > > > > format to only bring back segments that match that and only
>> > sum
>> > >> > those
>> > >> > > > > > values.  I would expect that this would make the query
>> > >> magnitudes
>> > >> > > > faster,
>> > >> > > > > > not slower.
>> > >> > > > > >
>> > >> > > > > > Other columnar formats I've used (e.g. ORCFile, SQL Server
>> > >> > > Columnstore)
>> > >> > > > > > have acted this way, so I can't quite understand why
>> Parquet
>> > >> > doesn't
>> > >> > > > act
>> > >> > > > > > the same.
>> > >> > > > > >
>> > >> > > > > > Can anyone suggest what I'm doing wrong?
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: Parquet and filtering

Reply via email to