Hi guys,

I'm trying to (re)implement pushdown filtering for Parquet with the new
Parquet metadata caching implementation.

I've run into a couple of challenges:

   1. Scan batches don't allow empty batches.  This means if a particular
   filter filters out *all* rows, we get an exception.  I haven't read the
   full comments on the relevant JIRA items, but it seems odd that we can't
   query an empty JSON file, for example.  This is a bit of a blocker to
   implement the pushdown filtering properly.
   2. The Parquet metadata doesn't include all the relevant metadata.
   Specifically, count of values is not included, therefore the default
   Parquet statistics filter has issues because it compares the count of
   values with count of nulls to work out if it can drop it.  This isn't
   necessarily a blocker, but it feels ugly simulating there's "1" row in a
   block (just to get around the null comparison).

Also, it feels a bit ugly rehydrating the standard Parquet metadata objects
manually.  I'm not sure I understand why we created our own objects for the
Parquet metadata as opposed to simply writing a custom serializer for those
objects which we store.

Thoughts would be great - I'd love to get a patch out for this.

Reply via email to