Hey Adam,
If you have questions about the Parquet side of things, I'm happy to chat.
Julien

On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <par...@apache.org> wrote:

> Parquet metadata has the rowCount for every rowGroup which is also the
> value count for every column in the rowGroup. Isn't that what you need?
>
> On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <dragoncu...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > I'm trying to (re)implement pushdown filtering for Parquet with the new
> > Parquet metadata caching implementation.
> >
> > I've run into a couple of challenges:
> >
> >    1. Scan batches don't allow empty batches.  This means if a particular
> >    filter filters out *all* rows, we get an exception.  I haven't read
> the
> >    full comments on the relevant JIRA items, but it seems odd that we
> can't
> >    query an empty JSON file, for example.  This is a bit of a blocker to
> >    implement the pushdown filtering properly.
> >    2. The Parquet metadata doesn't include all the relevant metadata.
> >    Specifically, count of values is not included, therefore the default
> >    Parquet statistics filter has issues because it compares the count of
> >    values with count of nulls to work out if it can drop it.  This isn't
> >    necessarily a blocker, but it feels ugly simulating there's "1" row
> in a
> >    block (just to get around the null comparison).
> >
> > Also, it feels a bit ugly rehydrating the standard Parquet metadata
> objects
> > manually.  I'm not sure I understand why we created our own objects for
> the
> > Parquet metadata as opposed to simply writing a custom serializer for
> those
> > objects which we store.
> >
> > Thoughts would be great - I'd love to get a patch out for this.
> >
>



-- 
Julien

Reply via email to