zhuqi-lucas commented on PR #18817: URL: https://github.com/apache/datafusion/pull/18817#issuecomment-3569492649
> > Thank you @2010YOUY01 for review and valid concern: > > You raise valid concerns about memory overhead is what i mentioned the key risk for this approach. > > However, I want to clarify that row group reversal alone cannot eliminate the SortExec - it only provides TopK filtering benefits. Without reversing rows within each row group, the data remains in the original order (e.g., ASC when we need DESC), so the sort must stay. I propose we keep the complete optimization but default enable_reverse_scan to false. Once we implement page-level caching in arrow-rs (which will reduce memory overhead significantly), we can consider enabling it by default. > > Did you mean 'cannot eliminate the SortExec(TopK)'? Just to confirm there is no global sort, but it is true that we have do a `topK` on a whole row group for this naive approach. > > I have a intuition that for this kind of workload, the bottleneck is on the parquet decoding speed, and an extra `TopK` won't introduce much additional overhead, so this naive approach can also get pretty fast. > > It makes a lot of sense that it's very hard to implement page/row level reversal in `arrow-rs` side, so we have to figure out how to do this at row-group level. > > Summary: Perhaps we can start by adding a few end-to-end benchmarks that reflect your typical production workload. If this PR’s approach shows a clear improvement over the naive approach in [#18817 (comment)](https://github.com/apache/datafusion/pull/18817#issuecomment-3568934764) (I'm happy to do a quick prototype), we should definitely move forward. Nice point @2010YOUY01 , i agree most time will be decode page, i can change this PR to add the config to implement [#18817 (comment)](https://github.com/apache/datafusion/pull/18817#issuecomment-3568934764), so we can have more options to compare, i agree the easier solution is better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
