2010YOUY01 commented on PR #18817:
URL: https://github.com/apache/datafusion/pull/18817#issuecomment-3568934764

   Supporting scanning Parquet files in reverse order is an absolutely great 
idea. I have a few questions.
   
   Let me first rephrase it to make sure I understand correctly, this PR does:
   
   1. For applicable query patterns (topK that has reverse order to the parquet 
existing order), reverse the row-group scanning order
   2. For each row group, first cache all the result, then reverse the 
row-level order batch by batch.
   
   This implementation is quite aggressive, I think it can get a bit tricky to 
tune it right, to avoid excessive caching, or reversing rows batch by batch 
become too expensive.
   
   What if we limit the initial implementation only to reverse the row-group 
order, similar to what @adriangb is planning to do at file level in 
https://github.com/apache/datafusion/issues/17271
   After scanning the last row-group, the topk dynamic filter will 
automatically get updated and skip the preceding row groups.
   - The benefits are simplicity and lower risk of regressions
   - The downside is it's too conservative and can't get the optimal 
performance. But once we have native reverse parquet decoding support in 
`arrow-rs` (that is described in the original issue 
https://github.com/apache/datafusion/issues/17172), we can implement the 
reverse scan at the row level as follow-ups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to