2010YOUY01 commented on PR #18817: URL: https://github.com/apache/datafusion/pull/18817#issuecomment-3568934764
Supporting scanning Parquet files in reverse order is an absolutely great idea. I have a few questions. Let me first rephrase it to make sure I understand correctly, this PR does: 1. For applicable query patterns (topK that has reverse order to the parquet existing order), reverse the row-group scanning order 2. For each row group, first cache all the result, then reverse the row-level order batch by batch. This implementation is quite aggressive, I think it can get a bit tricky to tune it right, to avoid excessive caching, or reversing rows batch by batch become too expensive. What if we limit the initial implementation only to reverse the row-group order, similar to what @adriangb is planning to do at file level in https://github.com/apache/datafusion/issues/17271 After scanning the last row-group, the topk dynamic filter will automatically get updated and skip the preceding row groups. - The benefits are simplicity and lower risk of regressions - The downside is it's too conservative and can't get the optimal performance. But once we have native reverse parquet decoding support in `arrow-rs` (that is described in the original issue https://github.com/apache/datafusion/issues/17172), we can implement the reverse scan at the row level as follow-ups. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
