Parquet metadata has the rowCount for every rowGroup which is also the
value count for every column in the rowGroup. Isn't that what you need?

On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <dragoncu...@gmail.com> wrote:

> Hi guys,
>
> I'm trying to (re)implement pushdown filtering for Parquet with the new
> Parquet metadata caching implementation.
>
> I've run into a couple of challenges:
>
>    1. Scan batches don't allow empty batches.  This means if a particular
>    filter filters out *all* rows, we get an exception.  I haven't read the
>    full comments on the relevant JIRA items, but it seems odd that we can't
>    query an empty JSON file, for example.  This is a bit of a blocker to
>    implement the pushdown filtering properly.
>    2. The Parquet metadata doesn't include all the relevant metadata.
>    Specifically, count of values is not included, therefore the default
>    Parquet statistics filter has issues because it compares the count of
>    values with count of nulls to work out if it can drop it.  This isn't
>    necessarily a blocker, but it feels ugly simulating there's "1" row in a
>    block (just to get around the null comparison).
>
> Also, it feels a bit ugly rehydrating the standard Parquet metadata objects
> manually.  I'm not sure I understand why we created our own objects for the
> Parquet metadata as opposed to simply writing a custom serializer for those
> objects which we store.
>
> Thoughts would be great - I'd love to get a patch out for this.
>

Reply via email to