zhuqi-lucas commented on PR #9937:
URL: https://github.com/apache/arrow-rs/pull/9937#issuecomment-4394522328

   Following up on the question of "where's the speedup" — added a more 
compelling benchmark in `14479b7` that simulates the existing **row-group-level 
reverse** strategy (DataFusion apache/datafusion#18817: forward-decode the 
whole chunk, reverse the value buffer, take N) and compares it against the 
**page-level reverse** strategy enabled by this PR.
   
   Numbers on Apple M-series (`--quick`, 100k INT32 values, 98 pages, no 
dictionary, uncompressed):
   
   | N | row_group_sim | page_reverse | Speedup |
   |---:|---:|---:|---:|
   | 10   | 28.8 µs | 564 ns | **~51x** |
   | 100  | 29.0 µs | 602 ns | **~48x** |
   | 1024 | 28.9 µs | 505 ns | **~57x** |
   
   **Why the speedup**: row-group reverse must decode all 98 pages even when 
only the last 10 reversed values are wanted; page reverse decodes one page (the 
last one) and reverses its values in place. The ratio scales with the number of 
pages — i.e. the bigger the row group, the larger the win.
   
   **What this models**: it's the closest in-crate proxy for the `ORDER BY DESC 
LIMIT N` query path that DataFusion currently runs through its row-group 
reverse rule. A real query-engine integration (Phase 2) would emit 
`RecordBatch`es page by page rather than gather a `Vec<i32>`, but the 
underlying decode-work ratio is the same.
   
   **Caveats** (called out so this isn't oversold):
   - Both readers run from in-memory `Bytes`, so I/O latency is not modeled. On 
real S3/object storage, both readers issue one byte-range read per page, so the 
page-reverse advantage holds and is amplified by network round-trips dominating 
decode cost.
   - This bench is no-filter. With `RowFilter` pushdown, both strategies pay 
filter-evaluation cost; the page-reverse advantage compounds because the 
row-group strategy filter-evaluates pages it ultimately discards.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to