The GitHub Actions job "Bindings Python CI" on 
iceberg-rust.git/spawn-multiple-tasks-per-read has failed.
Run started by GitHub user tafia (triggered by tafia).

Head commit for run:
297d707984bfdfd0cece7a9d9e26d4be8129dabd / Johann Tuffe <[email protected]>
feat: (perf) allow spawning multiple tasks per read

Scanning of all files is both cpu and io intensive. While we can
control the io parallelism via concurrency_limit* arguments, all
the work is effectively done on the same tokio task, thus the
same cpu.

This situation is one of the main reason why iceberg-rust is much
slower than pyiceberg while reading large files (my test involved
a 10G file).

This PR proposes to split scans into chunks which can be spawned
independently to allow cpu parallelism.

In my tests (I have yet to find how to benchmark it in this project
directly), reading a 10G file:
- before: 38s
- after: 16s
- pyiceberg: 15s

Report URL: https://github.com/apache/iceberg-rust/actions/runs/22252417857

With regards,
GitHub Actions via GitBox

Reply via email to