acking-you commented on issue #15512:
URL: https://github.com/apache/datafusion/issues/15512#issuecomment-2813004760
I briefly looked at the descriptions of these optimizations. For example,
the method of dynamically handling the "order by limit" process using
statistics is really cool! @alamb
## Idea
But I have some new ideas that seem to be more universally applicable to
`order by limit`(q23):
```sql
┌──────────────────────────────────────┐
│ [Step 1] Filter RowGroups │
│ - Use Parquet metadata to skip RGs │
│ WHERE RG.min(EventTime) > cutoff OR│
│ RG lacks URL stats for '%google%' │
└───────────────────┬──────────────────┘
↓
┌──────────────────────────────────────┐
│ [Step 2] Read EventTime + Filter │
│ - Scan EventTime column in valid RGs │
│ - Apply URL LIKE '%google%' filter │
│ - Sort values → Track top 10 rows │
└───────────────────┬──────────────────┘
↓
┌──────────────────────────────────────┐
│ [Step 3] Record Row Locations │
│ - Map top 10 EventTime to physical │
│ positions (RG_ID + Row_Offset) │
└───────────────────┬──────────────────┘
↓
┌──────────────────────────────────────┐
│ [Step 4] Fetch 10 Rows │
│ - Directly read rows from Parquet │
│ via recorded positions (non-seq) │
└───────────────────┬──────────────────┘
↓
Final Result
```
## Explore
Currently, q23 takes approximately 6 seconds to execute. I have confirmed
that DataFusion does not have the aforementioned optimizations and still scans
a very large number of rows and columns. By the way, is there a convenient way
in `datafusion-cli` to view statistics on the number of rows and columns
scanned? Currently, I directly print the batch information in the `Like`
expression, which gives the following output (it seems endless, and the amount
of data being scanned appears to be very large, all with exactly 105 columns):

## Some concerns
Parquet is composed of RowGroups. Is it difficult to read an individual
page? In my previous work, I’ve seen optimizations for this scenario (based on
DataFusion), but it used a custom columnar storage format, which was easier to
implement. At that time, when working with very large datasets (similar to the
hits dataset), the query time for "order by limit" was reduced to around 2
seconds.
## Summary
The reading process of the entire data can be delayed by using "order by" on
columns, which is very effective for the "order by limit" scenario. I'm not
sure if DataFusion is currently doing this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]