Just made one, I put some comments there from the design discussions we have had in the past.
https://issues.apache.org/jira/browse/DRILL-1950 - Jason Altekruse On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore <dragoncu...@gmail.com> wrote: > Just a quick follow up on this - is there a JIRA item for implementing push > down predicates for Parquet scans or do we need to create one? > > On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse <altekruseja...@gmail.com> > wrote: > > > Hi Adam, > > > > I have a few thoughts that might explain the difference in query times. > > Drill is able to read a subset of the data from a parquet file, when > > selecting only a few columns out of a large file. Drill will give you > > faster results if you ask for 3 columns instead of 10 in terms of read > > performance. However, we are still working on further optimizing the > reader > > by making use of the statistics contained in the block and page > meta-data, > > that will allow us to skip reading a subset of a column, as the parquet > > writer can store min/max values for blocks of data. > > > > If you ran a query that was summing over a column, the reason it was > faster > > is because it avoided a bunch of individual value copies as we filtered > out > > the records that were not needed. This currently takes place in a > separate > > filter operator and should be pushed down into the read operation to make > > use of the file meta-data and eliminate some of the reads. > > > > -Jason > > > > > > > > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <dragoncu...@gmail.com> > > wrote: > > > > > Hi guys, > > > > > > I have a question re Parquet. I'm not sure if this is a Drill question > > or > > > Parquet, but thought I'd start here. > > > > > > I have a sample dataset of ~100M rows in a Parquet file. It's quick to > > sum > > > a single column across the whole dataset. > > > > > > I have a column which has approx 100 unique values (e.g. a customer > ID). > > > When I filter on that column by one of those values (to reduce the set > to > > > ~1M values), the query takes longer. > > > > > > This doesn't make a lot of sense to me - I would have expected the > > Parquet > > > format to only bring back segments that match that and only sum those > > > values. I would expect that this would make the query magnitudes > faster, > > > not slower. > > > > > > Other columnar formats I've used (e.g. ORCFile, SQL Server Columnstore) > > > have acted this way, so I can't quite understand why Parquet doesn't > act > > > the same. > > > > > > Can anyone suggest what I'm doing wrong? > > > > > >