Re: Parquet and filtering

Jason Altekruse Wed, 07 Jan 2015 18:08:15 -0800

The parquet library provides an interface for accessing individual values
of each column (as well as a record assembly interface for populating java
objects). As parquet is columnar, and the Drill in-memory storage format is
also columnar, we can get much better read performance on queries where
most of the data is needed if we do copies of long runs of values rather
than a large number of individual copies.


This obviously does not give us great performance for point queries, where
a small subset of the data is needed. While these use cases are prevalent
and we are hoping to fix this issue soon, as we wrote the original
implementation we were interested in stretching the bounds of how fast we
could load a volume of data into the engine.

The second reader that was written to handle complex types does use the
current 'columnar' interface exposed by the parquet library, but it still
requires us to make individual copies for each value. Even as we
experimented with early versions of the project pushdown provided by the
parquet codebase, we were unable to match the performance of reading and
filtering the data ourselves. This was not fully explored, and a number of
enhancements have been made to the parquet mainline that may give us the
performance we are looking for in these cases. We haven't had time to
revisit it so far.

-Jason Altekruse

On Wed, Jan 7, 2015 at 4:04 PM, Adam Gilmore <dragoncu...@gmail.com> wrote:

> Out of interest, is there a reason Drill implemented effectively its own
> Parquet reading implementation as opposed to using the reading classes from
> the Parquet project itself?  Were there particular performance reasons for
> this?
>
> On Thu, Jan 8, 2015 at 2:22 AM, Jason Altekruse <altekruseja...@gmail.com>
> wrote:
>
> > Just made one, I put some comments there from the design discussions we
> > have had in the past.
> >
> > https://issues.apache.org/jira/browse/DRILL-1950
> >
> > - Jason Altekruse
> >
> > On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore <dragoncu...@gmail.com>
> > wrote:
> >
> > > Just a quick follow up on this - is there a JIRA item for implementing
> > push
> > > down predicates for Parquet scans or do we need to create one?
> > >
> > > On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse <
> > altekruseja...@gmail.com>
> > > wrote:
> > >
> > > > Hi Adam,
> > > >
> > > > I have a few thoughts that might explain the difference in query
> times.
> > > > Drill is able to read a subset of the data from a parquet file, when
> > > > selecting only a few columns out of a large file. Drill will give you
> > > > faster results if you ask for 3 columns instead of 10 in terms of
> read
> > > > performance. However, we are still working on further optimizing the
> > > reader
> > > > by making use of the statistics contained in the block and page
> > > meta-data,
> > > > that will allow us to skip reading a subset of a column, as the
> parquet
> > > > writer can store min/max values for blocks of data.
> > > >
> > > > If you ran a query that was summing over a column, the reason it was
> > > faster
> > > > is because it avoided a bunch of individual value copies as we
> filtered
> > > out
> > > > the records that were not needed. This currently takes place in a
> > > separate
> > > > filter operator and should be pushed down into the read operation to
> > make
> > > > use of the file meta-data and eliminate some of the reads.
> > > >
> > > > -Jason
> > > >
> > > >
> > > >
> > > > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <dragoncu...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi guys,
> > > > >
> > > > > I have a question re Parquet.  I'm not sure if this is a Drill
> > question
> > > > or
> > > > > Parquet, but thought I'd start here.
> > > > >
> > > > > I have a sample dataset of ~100M rows in a Parquet file.  It's
> quick
> > to
> > > > sum
> > > > > a single column across the whole dataset.
> > > > >
> > > > > I have a column which has approx 100 unique values (e.g. a customer
> > > ID).
> > > > > When I filter on that column by one of those values (to reduce the
> > set
> > > to
> > > > > ~1M values), the query takes longer.
> > > > >
> > > > > This doesn't make a lot of sense to me - I would have expected the
> > > > Parquet
> > > > > format to only bring back segments that match that and only sum
> those
> > > > > values.  I would expect that this would make the query magnitudes
> > > faster,
> > > > > not slower.
> > > > >
> > > > > Other columnar formats I've used (e.g. ORCFile, SQL Server
> > Columnstore)
> > > > > have acted this way, so I can't quite understand why Parquet
> doesn't
> > > act
> > > > > the same.
> > > > >
> > > > > Can anyone suggest what I'm doing wrong?
> > > > >
> > > >
> > >
> >
>

Re: Parquet and filtering

Reply via email to