Re: [PR] feat(parquet): row-group and row-range sampling on ParquetSource [datafusion]

via GitHub Thu, 14 May 2026 06:02:01 -0700


alamb commented on PR #22024:
URL: https://github.com/apache/datafusion/pull/22024#issuecomment-4450886490


   > Thanks for the review Andrew.
   > 
   > I think the biggest thing to point out is that it is _not_ possible to 
implement this sort of sampling externally by passing in a `ParquetAccessPlan`: 
you don't know what the row groups and pages look like until you open the file. 
So it has to be done inside of the opener, either directly or via some 
extension.
   
   I think you *could* do it:
   1.  fetching the ParquetMetadata yourself before creating the plan
   2. Using the ParquetMetadata to come up with the access plan
   3. providing the (pre-calculated) ParquetMetadata to the opener to prevent 
re-parsing it
   
   It is certainly not easy and is a bit ugly, but I think it is possible
   
   
   > I'm not sure what other sampling strategies might look like. To me it only 
really makes sense to sample at the row group / page level. Do you have 
thoughts on what other sampling strategies for Parquet would look like? I 
linked to multiple systems which sample at the "block" level. For parquet that 
is row groups / pages. The pages (row fraction) part is perhaps a bit more 
questionable, I'm happy to remove that and add that as a followup if you'd like.
   
   I can imagine people implementing dynamic sampling strategies for example, 
or striped blocks if you knew how your data was clustered
   
   
   
   > I'm open to prototyping on some sort of `ParquetAccessPlanOptimizer` but 
I'm not sure it will end up being a simple abstraction, I suspect it will be 
quite leaky. That is: every time you want to add a new optimizer you have to 
change the API to add more inputs / more context or more outputs / things it 
can change. The adaptive dynamic filter work for example has to touch _a lot 
more_ than just the `ParquetAccessPlan`. I'd guess we'd end up with a very 
leaky abstraction. IMO doing this as structured in this PR and factoring out as 
much code into it's own modules and such probably gets us 90% of the wins 
without forcing us into APIs we then have to constantly churn.
   
   Yeah, I agree we would have to try it out and see how much commonality there 
is and if this idea is actually possible. I also realize it is not really fair 
to make you test this out given it is some crazy idea I came up with
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(parquet): row-group and row-range sampling on ParquetSource [datafusion]

Reply via email to