[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

alexander.nesterovskiy at intel dot com Thu, 08 Feb 2018 02:02:51 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604


--- Comment #27 from Alexander Nesterovskiy <alexander.nesterovskiy at intel 
dot com> ---
Place of interest here is a loop in mat_times_vec function.
For r253678 a mat_times_vec.constprop._loopfn.0 is created with autopar.
For r256990 the mat_times_vec is inlined into bi_cgstab_block and three
functions are created by autopar:
bi_cgstab_block.constprop._loopfn.3
bi_cgstab_block.constprop._loopfn.6
bi_cgstab_block.constprop._loopfn.10
Sum of effective CPU time for these functions in all four threads is very close
for r253678 and r256990.
It looks reasonable since in both cases the equal amount of calculations is
being done.
But there is a significant difference in spinning/wait time.

Measuring with OMP_WAIT_POLICY=ACTIVE seems to be more informative - threads
never sleeps, they either working or spinning, thanks to Jakub.
r253678 case:
Main thread0:      ~0%  thread time spinning (~100% working)
Worker threads1-3: ~45% thread time spinning (~55% working)
r256990 case:
Main thread0:      ~20% thread time spinning (~80% working)
Worker threads1-3: ~50% thread time spinning (~50% working)

I've attached a second chart comparing CPU time for both cases (r253678 vs
r256990_work_spin), I think it illustrates the difference better than the first
one.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

Reply via email to