So the parallelization of SmallPT uses `dynamic` scheduling
#pragma omp parallel for schedule(dynamic, 1) private(r) // OpenMP
for (int y=0; y<h; y++){ // Loop over image rows
Run
This is very very important because some rays do not collide and do nothing,
while some rays bounce all other the place. And the worse is that often you
have wide expanse of "easy scenes" (sky, walls, ...) and so those would be
assigned threads and the threads that are assigned the complex parts would be
all alone.
Example:
[https://www.cs.brandeis.edu/~dilant/WebPage_TA160/initialsllides.pdf](https://www.cs.brandeis.edu/~dilant/WebPage_TA160/initialsllides.pdf)
image::
[https://gist.githubusercontent.com/mratsim/aaecdb8d77582bcfc2994b0ee66b99d5/raw/6cb890ee5ff68946133ba45fd9013ff555156fa2/2020-05-22_18-33.png](https://gist.githubusercontent.com/mratsim/aaecdb8d77582bcfc2994b0ee66b99d5/raw/6cb890ee5ff68946133ba45fd9013ff555156fa2/2020-05-22_18-33.png)
A runtime without any load balancing wouldn't be able to scale a raytracing
application (which makes it an interesting load balancing benchmark)
I expect the GCC team broke dynamic scheduling and default to static. An easy
way to confirm that is to change the schedule from dynamic to static, and rerun
with GCC8 and Clang and see if the perf degradation matches GCC10.
I leave that as an exercise to the reader (just joking).