Re: Parquet and filtering

Adam Gilmore Wed, 07 Jan 2015 19:58:31 -0800

That makes a lot of sense.  Just one question with regarding to handling
complex types - do you mean maps/arrays/etc. (repetitions in Parquet)?  As
in, if I created a Parquet table from some JSON files with a rather
complex/nested structure, it would fall back to individual copies?



Regards,


*Adam Gilmore*

Director of Technology

a...@pharmadata.net.au


+61 421 997 655 (Mobile)

1300 733 876 (AU)

+617 3171 9902 (Intl)


*PharmaData*

Data Intelligence Solutions for Pharmacy

www.PharmaData.net.au <http://www.pharmadata.net.au/>



[image: pharmadata-sig]



*Disclaimer*

This communication including any attachments may contain information that
is either confidential or otherwise protected from disclosure and is
intended solely for the use of the intended recipient. If you are not the
intended recipient please immediately notify the sender by e-mail and
delete the original transmission and its contents. Any unauthorised use,
dissemination, forwarding, printing, or copying of this communication
including any file attachments is prohibited. The recipient should check
this email and any attachments for viruses and other defects. The Company
disclaims any liability for loss or damage arising in any way from this
communication including any file attachments.

On Thu, Jan 8, 2015 at 12:05 PM, Jason Altekruse <altekruseja...@gmail.com>
wrote:

> The parquet library provides an interface for accessing individual values
> of each column (as well as a record assembly interface for populating java
> objects). As parquet is columnar, and the Drill in-memory storage format is
> also columnar, we can get much better read performance on queries where
> most of the data is needed if we do copies of long runs of values rather
> than a large number of individual copies.
>
> This obviously does not give us great performance for point queries, where
> a small subset of the data is needed. While these use cases are prevalent
> and we are hoping to fix this issue soon, as we wrote the original
> implementation we were interested in stretching the bounds of how fast we
> could load a volume of data into the engine.
>
> The second reader that was written to handle complex types does use the
> current 'columnar' interface exposed by the parquet library, but it still
> requires us to make individual copies for each value. Even as we
> experimented with early versions of the project pushdown provided by the
> parquet codebase, we were unable to match the performance of reading and
> filtering the data ourselves. This was not fully explored, and a number of
> enhancements have been made to the parquet mainline that may give us the
> performance we are looking for in these cases. We haven't had time to
> revisit it so far.
>
> -Jason Altekruse
>
> On Wed, Jan 7, 2015 at 4:04 PM, Adam Gilmore <dragoncu...@gmail.com>
> wrote:
>
> > Out of interest, is there a reason Drill implemented effectively its own
> > Parquet reading implementation as opposed to using the reading classes
> from
> > the Parquet project itself?  Were there particular performance reasons
> for
> > this?
> >
> > On Thu, Jan 8, 2015 at 2:22 AM, Jason Altekruse <
> altekruseja...@gmail.com>
> > wrote:
> >
> > > Just made one, I put some comments there from the design discussions we
> > > have had in the past.
> > >
> > > https://issues.apache.org/jira/browse/DRILL-1950
> > >
> > > - Jason Altekruse
> > >
> > > On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore <dragoncu...@gmail.com>
> > > wrote:
> > >
> > > > Just a quick follow up on this - is there a JIRA item for
> implementing
> > > push
> > > > down predicates for Parquet scans or do we need to create one?
> > > >
> > > > On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse <
> > > altekruseja...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Adam,
> > > > >
> > > > > I have a few thoughts that might explain the difference in query
> > times.
> > > > > Drill is able to read a subset of the data from a parquet file,
> when
> > > > > selecting only a few columns out of a large file. Drill will give
> you
> > > > > faster results if you ask for 3 columns instead of 10 in terms of
> > read
> > > > > performance. However, we are still working on further optimizing
> the
> > > > reader
> > > > > by making use of the statistics contained in the block and page
> > > > meta-data,
> > > > > that will allow us to skip reading a subset of a column, as the
> > parquet
> > > > > writer can store min/max values for blocks of data.
> > > > >
> > > > > If you ran a query that was summing over a column, the reason it
> was
> > > > faster
> > > > > is because it avoided a bunch of individual value copies as we
> > filtered
> > > > out
> > > > > the records that were not needed. This currently takes place in a
> > > > separate
> > > > > filter operator and should be pushed down into the read operation
> to
> > > make
> > > > > use of the file meta-data and eliminate some of the reads.
> > > > >
> > > > > -Jason
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <
> dragoncu...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi guys,
> > > > > >
> > > > > > I have a question re Parquet.  I'm not sure if this is a Drill
> > > question
> > > > > or
> > > > > > Parquet, but thought I'd start here.
> > > > > >
> > > > > > I have a sample dataset of ~100M rows in a Parquet file.  It's
> > quick
> > > to
> > > > > sum
> > > > > > a single column across the whole dataset.
> > > > > >
> > > > > > I have a column which has approx 100 unique values (e.g. a
> customer
> > > > ID).
> > > > > > When I filter on that column by one of those values (to reduce
> the
> > > set
> > > > to
> > > > > > ~1M values), the query takes longer.
> > > > > >
> > > > > > This doesn't make a lot of sense to me - I would have expected
> the
> > > > > Parquet
> > > > > > format to only bring back segments that match that and only sum
> > those
> > > > > > values.  I would expect that this would make the query magnitudes
> > > > faster,
> > > > > > not slower.
> > > > > >
> > > > > > Other columnar formats I've used (e.g. ORCFile, SQL Server
> > > Columnstore)
> > > > > > have acted this way, so I can't quite understand why Parquet
> > doesn't
> > > > act
> > > > > > the same.
> > > > > >
> > > > > > Can anyone suggest what I'm doing wrong?
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parquet and filtering

Reply via email to