Re: [PR] Support reverse parquet scan and fast parquet order inversion at row group level [datafusion]

via GitHub Mon, 24 Nov 2025 07:03:37 -0800


zhuqi-lucas commented on PR #18817:
URL: https://github.com/apache/datafusion/pull/18817#issuecomment-3571228433


   > > > I haven't looked into all of this discussion and code (I just got 
tagged). I've been looking into optimizing sorted scanning in DataFusion and 
IMO where we should land is:
   > > > 
   > > > 1. Via metadata (`FileScanConfig` / `ORDERED BY ...` in SQL) users 
declare a known sort order of their files.
   > > > 2. The planner uses statistics from the files + any `ORDER BY` clauses 
in the query to arrange file ordering to best match the query. The `FileSource` 
implementation can also receive the `ORDER BY` information and optimize scan 
order within a file (e.g. reversing the order of reads which is what I think 
this PR is doing).
   > > > 3. If the planner is able to deduce from file level stats that the 
files can be ordered and the `FileSource` reports that it is able to produce 
batches in sorted order then the optimizer can optimize away the sort 
completely.
   > > > 
   > > > I hope that is helpful.
   > > 
   > > 
   > > Thank you @adriangb , it's helpful for future optimization. My current 
implementation focuses on a specific optimization case: when data is already 
sorted and we need the reverse order, we can flip the scan direction instead of 
reading everything and sorting. The reverse_scan in FileSource handles the 
files/ and within-file ordering reversal.
   > > I think these approaches are complementary - my PR handles the reverse 
scan optimization, while your vision provides a framework for broader 
sorted-scan optimizations using file-level statistics and metadata. Would be 
great to build toward that architecture incrementally.
   > 
   > My point is that instead of `enable_reverse_scan: bool` we might want to 
consider a more holistic approach e.g. `try_pushdown_sort(&self, order: 
LexOrdering) -> Result<Option<Arc<dyn ExecutionPlan>>>` either at the 
`ExecutionPlan` level or at the `DataSource` level.
   > 
   > I'm not opposed to this as a step towards that but I'm not sure how 
helpful it is. Seeing something more concrete w.r.t. how this interacts with 
the bigger picture would be helpful IMO.
   
   This is a great idea to have high level sort pushdown @adriangb , and 
reverse scan is one of the polices, i will refactor this PR to use this way, 
thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Support reverse parquet scan and fast parquet order inversion at row group level [datafusion]

Reply via email to