Hey Adam, If you have questions about the Parquet side of things, I'm happy to chat. Julien
On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <par...@apache.org> wrote: > Parquet metadata has the rowCount for every rowGroup which is also the > value count for every column in the rowGroup. Isn't that what you need? > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <dragoncu...@gmail.com> > wrote: > > > Hi guys, > > > > I'm trying to (re)implement pushdown filtering for Parquet with the new > > Parquet metadata caching implementation. > > > > I've run into a couple of challenges: > > > > 1. Scan batches don't allow empty batches. This means if a particular > > filter filters out *all* rows, we get an exception. I haven't read > the > > full comments on the relevant JIRA items, but it seems odd that we > can't > > query an empty JSON file, for example. This is a bit of a blocker to > > implement the pushdown filtering properly. > > 2. The Parquet metadata doesn't include all the relevant metadata. > > Specifically, count of values is not included, therefore the default > > Parquet statistics filter has issues because it compares the count of > > values with count of nulls to work out if it can drop it. This isn't > > necessarily a blocker, but it feels ugly simulating there's "1" row > in a > > block (just to get around the null comparison). > > > > Also, it feels a bit ugly rehydrating the standard Parquet metadata > objects > > manually. I'm not sure I understand why we created our own objects for > the > > Parquet metadata as opposed to simply writing a custom serializer for > those > > objects which we store. > > > > Thoughts would be great - I'd love to get a patch out for this. > > > -- Julien