Hi all, I would like to check if someone could help me figure out
an issue I am chasing on a libgomp patch intended to partially
address the issue described at BZ#79784. 

I have identified that one of the bottlenecks is the global barrier 
used on both thread pool and team which causes a lof of cache ping-pong 
in high-core count machines. And it seems not be an aarch64 specific
issue as hinted by the bugzilla.

So the optimization I am implementing, which is similar of what LLVM
openmp implementation does; is to use a per OMP thread barrier to
synchronize team/task creation.  The activation I have implemented
so far is a simple linear one, where the master scan linearly over
the children threads (LLVM openmp implement some fancy ones that I
plan to take a look as well).

The patch I came up so far is quite simple [2] and required some polish
yet (some documentation, code styling, etc.), however there is one 
regression that is making me scratching my head: cancel-parallel-2.

What it does to exercise OpenMP cancellation in a 'omp parallel' 
construct and the issue I am seeing is falling to understand why
the final team barrier (done on gomp_team_end called by GOMP_parallel_end)
it not synchronizing correctly with the team barrier in each OpenMP
task.

So any help on the design is appreciate (even if it would I should
re-thinking it for libgomp).

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79784
[2] https://github.com/zatrazz/gcc/tree/azanella/libgomp-scalability

Reply via email to