Re: Parquet and filtering

Jason Altekruse Wed, 07 Jan 2015 08:23:23 -0800

Just made one, I put some comments there from the design discussions we
have had in the past.


https://issues.apache.org/jira/browse/DRILL-1950

- Jason Altekruse

On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore <dragoncu...@gmail.com> wrote:

> Just a quick follow up on this - is there a JIRA item for implementing push
> down predicates for Parquet scans or do we need to create one?
>
> On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse <altekruseja...@gmail.com>
> wrote:
>
> > Hi Adam,
> >
> > I have a few thoughts that might explain the difference in query times.
> > Drill is able to read a subset of the data from a parquet file, when
> > selecting only a few columns out of a large file. Drill will give you
> > faster results if you ask for 3 columns instead of 10 in terms of read
> > performance. However, we are still working on further optimizing the
> reader
> > by making use of the statistics contained in the block and page
> meta-data,
> > that will allow us to skip reading a subset of a column, as the parquet
> > writer can store min/max values for blocks of data.
> >
> > If you ran a query that was summing over a column, the reason it was
> faster
> > is because it avoided a bunch of individual value copies as we filtered
> out
> > the records that were not needed. This currently takes place in a
> separate
> > filter operator and should be pushed down into the read operation to make
> > use of the file meta-data and eliminate some of the reads.
> >
> > -Jason
> >
> >
> >
> > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <dragoncu...@gmail.com>
> > wrote:
> >
> > > Hi guys,
> > >
> > > I have a question re Parquet.  I'm not sure if this is a Drill question
> > or
> > > Parquet, but thought I'd start here.
> > >
> > > I have a sample dataset of ~100M rows in a Parquet file.  It's quick to
> > sum
> > > a single column across the whole dataset.
> > >
> > > I have a column which has approx 100 unique values (e.g. a customer
> ID).
> > > When I filter on that column by one of those values (to reduce the set
> to
> > > ~1M values), the query takes longer.
> > >
> > > This doesn't make a lot of sense to me - I would have expected the
> > Parquet
> > > format to only bring back segments that match that and only sum those
> > > values.  I would expect that this would make the query magnitudes
> faster,
> > > not slower.
> > >
> > > Other columnar formats I've used (e.g. ORCFile, SQL Server Columnstore)
> > > have acted this way, so I can't quite understand why Parquet doesn't
> act
> > > the same.
> > >
> > > Can anyone suggest what I'm doing wrong?
> > >
> >
>

Re: Parquet and filtering

Reply via email to