Hi guys, I'm trying to (re)implement pushdown filtering for Parquet with the new Parquet metadata caching implementation.
I've run into a couple of challenges: 1. Scan batches don't allow empty batches. This means if a particular filter filters out *all* rows, we get an exception. I haven't read the full comments on the relevant JIRA items, but it seems odd that we can't query an empty JSON file, for example. This is a bit of a blocker to implement the pushdown filtering properly. 2. The Parquet metadata doesn't include all the relevant metadata. Specifically, count of values is not included, therefore the default Parquet statistics filter has issues because it compares the count of values with count of nulls to work out if it can drop it. This isn't necessarily a blocker, but it feels ugly simulating there's "1" row in a block (just to get around the null comparison). Also, it feels a bit ugly rehydrating the standard Parquet metadata objects manually. I'm not sure I understand why we created our own objects for the Parquet metadata as opposed to simply writing a custom serializer for those objects which we store. Thoughts would be great - I'd love to get a patch out for this.
