alamb commented on PR #22024: URL: https://github.com/apache/datafusion/pull/22024#issuecomment-4450886490
> Thanks for the review Andrew. > > I think the biggest thing to point out is that it is _not_ possible to implement this sort of sampling externally by passing in a `ParquetAccessPlan`: you don't know what the row groups and pages look like until you open the file. So it has to be done inside of the opener, either directly or via some extension. I think you *could* do it: 1. fetching the ParquetMetadata yourself before creating the plan 2. Using the ParquetMetadata to come up with the access plan 3. providing the (pre-calculated) ParquetMetadata to the opener to prevent re-parsing it It is certainly not easy and is a bit ugly, but I think it is possible > I'm not sure what other sampling strategies might look like. To me it only really makes sense to sample at the row group / page level. Do you have thoughts on what other sampling strategies for Parquet would look like? I linked to multiple systems which sample at the "block" level. For parquet that is row groups / pages. The pages (row fraction) part is perhaps a bit more questionable, I'm happy to remove that and add that as a followup if you'd like. I can imagine people implementing dynamic sampling strategies for example, or striped blocks if you knew how your data was clustered > I'm open to prototyping on some sort of `ParquetAccessPlanOptimizer` but I'm not sure it will end up being a simple abstraction, I suspect it will be quite leaky. That is: every time you want to add a new optimizer you have to change the API to add more inputs / more context or more outputs / things it can change. The adaptive dynamic filter work for example has to touch _a lot more_ than just the `ParquetAccessPlan`. I'd guess we'd end up with a very leaky abstraction. IMO doing this as structured in this PR and factoring out as much code into it's own modules and such probably gets us 90% of the wins without forcing us into APIs we then have to constantly churn. Yeah, I agree we would have to try it out and see how much commonality there is and if this idea is actually possible. I also realize it is not really fair to make you test this out given it is some crazy idea I came up with -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
