zhuqi-lucas commented on PR #18817: URL: https://github.com/apache/datafusion/pull/18817#issuecomment-3569024898
> Supporting scanning Parquet files in reverse order is an absolutely great idea. I have a few questions. > > Let me first rephrase it to make sure I understand correctly, this PR does: > > 1. For applicable query patterns (topK that has reverse order to the parquet existing order), reverse the row-group scanning order > 2. For each row group, first cache all the result, then reverse the row-level order batch by batch. > > This implementation is quite aggressive, I think it can get a bit tricky to tune it right, to avoid excessive caching, or reversing rows batch by batch become too expensive. > > What if we limit the initial implementation only to reverse the row-group order, similar to what @adriangb is planning to do at file level in #17271 After scanning the last row-group, the topk dynamic filter will automatically get updated and skip the preceding row groups. > > * The benefits are simplicity and lower risk of regressions > * The downside is it's too conservative and can't get the optimal performance. But once we have native reverse parquet decoding support in `arrow-rs` (that is described in the original issue [Fast parquet order inversionĀ #17172](https://github.com/apache/datafusion/issues/17172)), we can implement the reverse scan at the row level as follow-ups. Thank you @2010YOUY01 for review and valid concern: You raise valid concerns about memory overhead is what i mentioned the key risk for this approach. However, I want to clarify that row group reversal alone cannot eliminate the SortExec - it only provides TopK filtering benefits. Without reversing rows within each row group, the data remains in the original order (e.g., ASC when we need DESC), so the sort must stay. I propose we keep the complete optimization but default enable_reverse_scan to false. Once we implement page-level caching in arrow-rs (which will reduce memory overhead significantly), we can consider enabling it by default. And we've been running the full implementation (row group + row-level reversal) in production for very long time with excellent results: 10-100x speedups for time-series queries, well-controlled memory usage (~one row group cached at a time), but we need to note we should not make the row group size big if we enable this feature. And with very small limit, the high memory usage is very short time. Also the reverse time is very small compared to the benefit we remove all sort. And if we want to improve the original scan to support limit the initial implementation only to reverse the row-group order, i think we can add follow-up PRs because this is another optimization which can't remove the sort for optimization so we need to do this in another PR. What's your opinion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
