Re: Parquet and filtering

Jason Altekruse Thu, 08 Jan 2015 09:39:11 -0800

You are correct that we do need a hybrid approach to meet both cases. Just
one thing I would add, in cases where we have nested and repeated types,
there is no architectural reason why we cannot make vectorized copies of
the data. We do represent the nesting an repeating slightly differently, so
we cannot simply make a vectorized copy of the definition and repetition
levels into our data structure. For example, we use offsets to denote the
cutoffs of repeated types, rather than a list of lengths of each list
(which is what effectively happens in parquet once the repetition levels
have been run length encoded) to allow random access to the values in our
vectors.


We also do not make a distinction about the level in the schema at which a
value became null, only leaves in the schema can become null. On the other
hand, parquet does store a definition level rather than a simple
nullability bit at each leaf node in the schema. This stores the
nullability of the entire ancestry of a leaf node in the schema,
redundantly storing much of the data, but then efficiently encoding it in
most cases.

These two differences require a little extra work, but it would be very
doable. We just have taken the performance hit for now and are hoping to
get back to it if we see use cases that require greater performance in the
case of full table scans on nested/repeated data.

- Jason

On Thu, Jan 8, 2015 at 7:45 AM, Jacques Nadeau <jacq...@apache.org> wrote:

> That is correct.
>
> On Wed, Jan 7, 2015 at 7:57 PM, Adam Gilmore <a...@pharmadata.net.au>
> wrote:
>
> > That makes a lot of sense.  Just one question with regarding to handling
> > complex types - do you mean maps/arrays/etc. (repetitions in Parquet)?
> As
> > in, if I created a Parquet table from some JSON files with a rather
> > complex/nested structure, it would fall back to individual copies?
> >
> >
> > Regards,
> >
> >
> > *Adam Gilmore*
> >
> > Director of Technology
> >
> > a...@pharmadata.net.au
> >
> >
> > +61 421 997 655 (Mobile)
> >
> > 1300 733 876 (AU)
> >
> > +617 3171 9902 (Intl)
> >
> >
> > *PharmaData*
> >
> > Data Intelligence Solutions for Pharmacy
> >
> > www.PharmaData.net.au <http://www.pharmadata.net.au/>
> >
> >
> >
> > [image: pharmadata-sig]
> >
> >
> >
> > *Disclaimer*
> >
> > This communication including any attachments may contain information that
> > is either confidential or otherwise protected from disclosure and is
> > intended solely for the use of the intended recipient. If you are not the
> > intended recipient please immediately notify the sender by e-mail and
> > delete the original transmission and its contents. Any unauthorised use,
> > dissemination, forwarding, printing, or copying of this communication
> > including any file attachments is prohibited. The recipient should check
> > this email and any attachments for viruses and other defects. The Company
> > disclaims any liability for loss or damage arising in any way from this
> > communication including any file attachments.
> >
> > On Thu, Jan 8, 2015 at 12:05 PM, Jason Altekruse <
> altekruseja...@gmail.com
> > > wrote:
> >
> >> The parquet library provides an interface for accessing individual
> values
> >> of each column (as well as a record assembly interface for populating
> java
> >> objects). As parquet is columnar, and the Drill in-memory storage format
> >> is
> >> also columnar, we can get much better read performance on queries where
> >> most of the data is needed if we do copies of long runs of values rather
> >> than a large number of individual copies.
> >>
> >> This obviously does not give us great performance for point queries,
> where
> >> a small subset of the data is needed. While these use cases are
> prevalent
> >> and we are hoping to fix this issue soon, as we wrote the original
> >> implementation we were interested in stretching the bounds of how fast
> we
> >> could load a volume of data into the engine.
> >>
> >> The second reader that was written to handle complex types does use the
> >> current 'columnar' interface exposed by the parquet library, but it
> still
> >> requires us to make individual copies for each value. Even as we
> >> experimented with early versions of the project pushdown provided by the
> >> parquet codebase, we were unable to match the performance of reading and
> >> filtering the data ourselves. This was not fully explored, and a number
> of
> >> enhancements have been made to the parquet mainline that may give us the
> >> performance we are looking for in these cases. We haven't had time to
> >> revisit it so far.
> >>
> >> -Jason Altekruse
> >>
> >> On Wed, Jan 7, 2015 at 4:04 PM, Adam Gilmore <dragoncu...@gmail.com>
> >> wrote:
> >>
> >> > Out of interest, is there a reason Drill implemented effectively its
> own
> >> > Parquet reading implementation as opposed to using the reading classes
> >> from
> >> > the Parquet project itself?  Were there particular performance reasons
> >> for
> >> > this?
> >> >
> >> > On Thu, Jan 8, 2015 at 2:22 AM, Jason Altekruse <
> >> altekruseja...@gmail.com>
> >> > wrote:
> >> >
> >> > > Just made one, I put some comments there from the design discussions
> >> we
> >> > > have had in the past.
> >> > >
> >> > > https://issues.apache.org/jira/browse/DRILL-1950
> >> > >
> >> > > - Jason Altekruse
> >> > >
> >> > > On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore <
> dragoncu...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Just a quick follow up on this - is there a JIRA item for
> >> implementing
> >> > > push
> >> > > > down predicates for Parquet scans or do we need to create one?
> >> > > >
> >> > > > On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse <
> >> > > altekruseja...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Hi Adam,
> >> > > > >
> >> > > > > I have a few thoughts that might explain the difference in query
> >> > times.
> >> > > > > Drill is able to read a subset of the data from a parquet file,
> >> when
> >> > > > > selecting only a few columns out of a large file. Drill will
> give
> >> you
> >> > > > > faster results if you ask for 3 columns instead of 10 in terms
> of
> >> > read
> >> > > > > performance. However, we are still working on further optimizing
> >> the
> >> > > > reader
> >> > > > > by making use of the statistics contained in the block and page
> >> > > > meta-data,
> >> > > > > that will allow us to skip reading a subset of a column, as the
> >> > parquet
> >> > > > > writer can store min/max values for blocks of data.
> >> > > > >
> >> > > > > If you ran a query that was summing over a column, the reason it
> >> was
> >> > > > faster
> >> > > > > is because it avoided a bunch of individual value copies as we
> >> > filtered
> >> > > > out
> >> > > > > the records that were not needed. This currently takes place in
> a
> >> > > > separate
> >> > > > > filter operator and should be pushed down into the read
> operation
> >> to
> >> > > make
> >> > > > > use of the file meta-data and eliminate some of the reads.
> >> > > > >
> >> > > > > -Jason
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <
> >> dragoncu...@gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hi guys,
> >> > > > > >
> >> > > > > > I have a question re Parquet.  I'm not sure if this is a Drill
> >> > > question
> >> > > > > or
> >> > > > > > Parquet, but thought I'd start here.
> >> > > > > >
> >> > > > > > I have a sample dataset of ~100M rows in a Parquet file.  It's
> >> > quick
> >> > > to
> >> > > > > sum
> >> > > > > > a single column across the whole dataset.
> >> > > > > >
> >> > > > > > I have a column which has approx 100 unique values (e.g. a
> >> customer
> >> > > > ID).
> >> > > > > > When I filter on that column by one of those values (to reduce
> >> the
> >> > > set
> >> > > > to
> >> > > > > > ~1M values), the query takes longer.
> >> > > > > >
> >> > > > > > This doesn't make a lot of sense to me - I would have expected
> >> the
> >> > > > > Parquet
> >> > > > > > format to only bring back segments that match that and only
> sum
> >> > those
> >> > > > > > values.  I would expect that this would make the query
> >> magnitudes
> >> > > > faster,
> >> > > > > > not slower.
> >> > > > > >
> >> > > > > > Other columnar formats I've used (e.g. ORCFile, SQL Server
> >> > > Columnstore)
> >> > > > > > have acted this way, so I can't quite understand why Parquet
> >> > doesn't
> >> > > > act
> >> > > > > > the same.
> >> > > > > >
> >> > > > > > Can anyone suggest what I'm doing wrong?
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Parquet and filtering

Reply via email to