https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #27 from Alexander Nesterovskiy <alexander.nesterovskiy at intel dot com> --- Place of interest here is a loop in mat_times_vec function. For r253678 a mat_times_vec.constprop._loopfn.0 is created with autopar. For r256990 the mat_times_vec is inlined into bi_cgstab_block and three functions are created by autopar: bi_cgstab_block.constprop._loopfn.3 bi_cgstab_block.constprop._loopfn.6 bi_cgstab_block.constprop._loopfn.10 Sum of effective CPU time for these functions in all four threads is very close for r253678 and r256990. It looks reasonable since in both cases the equal amount of calculations is being done. But there is a significant difference in spinning/wait time. Measuring with OMP_WAIT_POLICY=ACTIVE seems to be more informative - threads never sleeps, they either working or spinning, thanks to Jakub. r253678 case: Main thread0: ~0% thread time spinning (~100% working) Worker threads1-3: ~45% thread time spinning (~55% working) r256990 case: Main thread0: ~20% thread time spinning (~80% working) Worker threads1-3: ~50% thread time spinning (~50% working) I've attached a second chart comparing CPU time for both cases (r253678 vs r256990_work_spin), I think it illustrates the difference better than the first one.