Re: Parquet pushdown filtering

Julien Le Dem Tue, 08 Dec 2015 11:49:05 -0800

Adam: do you want to schedule a hangout?

On Tue, Dec 8, 2015 at 4:59 AM, Adam Gilmore <dragoncu...@gmail.com> wrote:


> That makes sense, yep.  The problem is I guess with my implementation.  I
> will iterate through all Parquet files and try to eliminate ones where the
> filter conflicts with the statistics.  In instances where no files match
> the filter, I end up with an empty set of files for the Parquet scan to
> iterate through.  I suppose I could just pick the schema of the first file
> or something, but that seems like a pretty messy rule.
>
> Julien - I'd be happy to have a chat about this.  I've pretty much got the
> implementation down, but need to solve a few of these little issues.
>
>
> On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <hanifigu...@gmail.com>
> wrote:
>
> > Regarding your point  #1. I guess Daniel struggled with this limitation
> as
> > well. I merged few of his patches which addressed empty batch(no data)
> > handling in various places during execution. That said, however, we still
> > could not have time to develop a solid way to handle empty batches with
> no
> > schema.
> >
> > *- Scan batches don't allow empty batches.  This means if a
> > particular filter filters out *all* rows, we get an exception.*
> > Looks to me, you are referring to no data rather than no schema here. I
> > would expect graceful execution in this case. Do you mind sharing a
> simple
> > reproduction?
> >
> >
> > -Hanifi
> >
> > 2015-12-03 10:56 GMT-08:00 Julien Le Dem <jul...@dremio.com>:
> >
> > > Hey Adam,
> > > If you have questions about the Parquet side of things, I'm happy to
> > chat.
> > > Julien
> > >
> > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <par...@apache.org>
> > wrote:
> > >
> > > > Parquet metadata has the rowCount for every rowGroup which is also
> the
> > > > value count for every column in the rowGroup. Isn't that what you
> need?
> > > >
> > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <dragoncu...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi guys,
> > > > >
> > > > > I'm trying to (re)implement pushdown filtering for Parquet with the
> > new
> > > > > Parquet metadata caching implementation.
> > > > >
> > > > > I've run into a couple of challenges:
> > > > >
> > > > >    1. Scan batches don't allow empty batches.  This means if a
> > > particular
> > > > >    filter filters out *all* rows, we get an exception.  I haven't
> > read
> > > > the
> > > > >    full comments on the relevant JIRA items, but it seems odd that
> we
> > > > can't
> > > > >    query an empty JSON file, for example.  This is a bit of a
> blocker
> > > to
> > > > >    implement the pushdown filtering properly.
> > > > >    2. The Parquet metadata doesn't include all the relevant
> metadata.
> > > > >    Specifically, count of values is not included, therefore the
> > default
> > > > >    Parquet statistics filter has issues because it compares the
> count
> > > of
> > > > >    values with count of nulls to work out if it can drop it.  This
> > > isn't
> > > > >    necessarily a blocker, but it feels ugly simulating there's "1"
> > row
> > > > in a
> > > > >    block (just to get around the null comparison).
> > > > >
> > > > > Also, it feels a bit ugly rehydrating the standard Parquet metadata
> > > > objects
> > > > > manually.  I'm not sure I understand why we created our own objects
> > for
> > > > the
> > > > > Parquet metadata as opposed to simply writing a custom serializer
> for
> > > > those
> > > > > objects which we store.
> > > > >
> > > > > Thoughts would be great - I'd love to get a patch out for this.
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Julien
> > >
> >
>



-- 
Julien

Re: Parquet pushdown filtering

Reply via email to