Hi Jason,

Understood - so currently Drill doesn't do predicate pushdown for Parqu

On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse <altekruseja...@gmail.com>
wrote:

> Hi Adam,
>
> I have a few thoughts that might explain the difference in query times.
> Drill is able to read a subset of the data from a parquet file, when
> selecting only a few columns out of a large file. Drill will give you
> faster results if you ask for 3 columns instead of 10 in terms of read
> performance. However, we are still working on further optimizing the reader
> by making use of the statistics contained in the block and page meta-data,
> that will allow us to skip reading a subset of a column, as the parquet
> writer can store min/max values for blocks of data.
>
> If you ran a query that was summing over a column, the reason it was faster
> is because it avoided a bunch of individual value copies as we filtered out
> the records that were not needed. This currently takes place in a separate
> filter operator and should be pushed down into the read operation to make
> use of the file meta-data and eliminate some of the reads.
>
> -Jason
>
>
>
> On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <dragoncu...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > I have a question re Parquet.  I'm not sure if this is a Drill question
> or
> > Parquet, but thought I'd start here.
> >
> > I have a sample dataset of ~100M rows in a Parquet file.  It's quick to
> sum
> > a single column across the whole dataset.
> >
> > I have a column which has approx 100 unique values (e.g. a customer ID).
> > When I filter on that column by one of those values (to reduce the set to
> > ~1M values), the query takes longer.
> >
> > This doesn't make a lot of sense to me - I would have expected the
> Parquet
> > format to only bring back segments that match that and only sum those
> > values.  I would expect that this would make the query magnitudes faster,
> > not slower.
> >
> > Other columnar formats I've used (e.g. ORCFile, SQL Server Columnstore)
> > have acted this way, so I can't quite understand why Parquet doesn't act
> > the same.
> >
> > Can anyone suggest what I'm doing wrong?
> >
>

Reply via email to