https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #27 from Alexander Nesterovskiy <alexander.nesterovskiy at intel 
dot com> ---
Place of interest here is a loop in mat_times_vec function.
For r253678 a mat_times_vec.constprop._loopfn.0 is created with autopar.
For r256990 the mat_times_vec is inlined into bi_cgstab_block and three
functions are created by autopar:
bi_cgstab_block.constprop._loopfn.3
bi_cgstab_block.constprop._loopfn.6
bi_cgstab_block.constprop._loopfn.10
Sum of effective CPU time for these functions in all four threads is very close
for r253678 and r256990.
It looks reasonable since in both cases the equal amount of calculations is
being done.
But there is a significant difference in spinning/wait time.

Measuring with OMP_WAIT_POLICY=ACTIVE seems to be more informative - threads
never sleeps, they either working or spinning, thanks to Jakub.
r253678 case:
Main thread0:      ~0%  thread time spinning (~100% working)
Worker threads1-3: ~45% thread time spinning (~55% working)
r256990 case:
Main thread0:      ~20% thread time spinning (~80% working)
Worker threads1-3: ~50% thread time spinning (~50% working)

I've attached a second chart comparing CPU time for both cases (r253678 vs
r256990_work_spin), I think it illustrates the difference better than the first
one.

Reply via email to