[GitHub] [arrow-datafusion] ParadoxShmaradox commented on issue #2845: [Question] Optimize multiple reads on same DataFrame

GitBox Mon, 11 Jul 2022 06:18:24 -0700


ParadoxShmaradox commented on issue #2845:
URL: 
https://github.com/apache/arrow-datafusion/issues/2845#issuecomment-1180398779


   What I ended up doing was to collect the record batches from the dataframe 
and because I have knowledge that the record batches are pre sorted by the id 
column from the read parquet file I could skip batches and apply the kernel 
filters by hand.
   
   This cut the filtering time dramatically from 5ms average to 1ms. There are 
about 100 partitions.
   I wonder if a record batch could hold some statistics on the data, either 
pre computed or on demand and then Datafusion use that statistics in the 
physical plan optimization.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] ParadoxShmaradox commented on issue #2845: [Question] Optimize multiple reads on same DataFrame

Reply via email to