tafia opened a new pull request, #2156:
URL: https://github.com/apache/iceberg-rust/pull/2156

   Scanning of all files is both cpu and io intensive. While we can control the 
io parallelism via concurrency_limit* arguments, all the work is effectively 
done on the same tokio task, thus the same cpu.
   
   This situation is one of the main reason why iceberg-rust is much slower 
than pyiceberg while reading large files (my test involved a 10G file).
   
   This PR proposes to split scans into chunks which can be spawned 
independently to allow cpu parallelism.
   
   In my tests (I have yet to find how to benchmark it in this project 
directly), reading a 10G file:
   - before: 38s
   - after: 16s
   - pyiceberg: 15s
   
   ## Which issue does this PR close?
   
   I haven't found any particular issue but several comments are referring to 
cpu bounded processing.
   
   ## What changes are included in this PR?
   
   This PR proposes to split scans into chunks which can be spawned 
independently to allow cpu parallelism.
   
   ## Are these changes tested?
   
   I have added a test to show that the change doesn't affect the output. I 
have yet to find a good benchmark to prove my claim about the performance. Any 
tip on how I could do would be welcomed!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to