zhuqi-lucas commented on issue #7363:
URL: https://github.com/apache/arrow-rs/issues/7363#issuecomment-2851331211

   @alamb @XiangpengHao 
   Updated the polish PR with new commit:
   
   
https://github.com/apache/arrow-rs/pull/7428/commits/d26de886685a8fc658b84d7f4e73b87243df5037
   
   I found some of the regression comes from the page cache missing, so it will 
cause more time to decode page even we enable page cache, for example our 
default batch size for the clickbench is 8192, in Q 27 clickbench benchmark 
result, it will cause more than 20% page cache missing due to some batch > one 
page size , with above commit, it's performance will not have regression.
   
   
   Explanation details:
   ```rust
   Most cases:
   
   Assumption & observation: each page consists multiple batches.
   Then our pipeline looks like this:
   Load Page 1
   Load batch 1 -> evaluate predicates -> filter 1 -> load & emit batch 1
   Load batch 2 -> evaluate predicates -> filter 2 -> load & emit batch 2
   Load batch 3 -> evaluate predicates -> filter 3 -> load & emit batch 3
   Load Page 2
   Load batch 4 -> evaluate predicates -> filter 4 -> load & emit batch 4
   Load batch 5 -> evaluate predicates -> filter 5 -> load & emit batch 5
   
   
   But some cases:
   Load Page1
   Load batch 1 -> evaluate predicates -> filter 1 -> load & emit batch 1
   
   Load Page2
   Load batch 1 -> evaluate predicates -> filter 1 -> load & emit batch 1
   
   Load Page3
   Load batch 1 -> evaluate predicates -> filter 1 -> load & emit batch 1
   
   
   When we load Page2, the cache will update to Page2, and next time we access 
the Page1, it will miss.
   ```rust


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to