Hi all, I would like to check if someone could help me figure out an issue I am chasing on a libgomp patch intended to partially address the issue described at BZ#79784.
I have identified that one of the bottlenecks is the global barrier used on both thread pool and team which causes a lof of cache ping-pong in high-core count machines. And it seems not be an aarch64 specific issue as hinted by the bugzilla. So the optimization I am implementing, which is similar of what LLVM openmp implementation does; is to use a per OMP thread barrier to synchronize team/task creation. The activation I have implemented so far is a simple linear one, where the master scan linearly over the children threads (LLVM openmp implement some fancy ones that I plan to take a look as well). The patch I came up so far is quite simple [2] and required some polish yet (some documentation, code styling, etc.), however there is one regression that is making me scratching my head: cancel-parallel-2. What it does to exercise OpenMP cancellation in a 'omp parallel' construct and the issue I am seeing is falling to understand why the final team barrier (done on gomp_team_end called by GOMP_parallel_end) it not synchronizing correctly with the team barrier in each OpenMP task. So any help on the design is appreciate (even if it would I should re-thinking it for libgomp). [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79784 [2] https://github.com/zatrazz/gcc/tree/azanella/libgomp-scalability