https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68238
Bug ID: 68238 Summary: Vector cost model overestimates prologue cost for SLPed code Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jgreenhalgh at gcc dot gnu.org Target Milestone: --- Host: *-*-* Target: *-*-* Created attachment 36663 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36663&action=edit reduced testcase showing high costs analysis The attached testcase is derived from a benchmark which shows a performance regression under GCC 4.9 and GCC 5.2. At the root of the regression is the runtime profitability calculation which decides whether to execute the scalar or the vector code path. GCC 4.9 and 5.2 both return a much higher guess at the minimum number of iterations for the vector code-path to be profitable, consequently low values of "size" are sent on the scalar path and show a drop in performance along the magnitude of the number of vector lanes your target can load. I'm compiling the testcase (on x86_64-none-linux-gnu or aarch64-none-linux-gnu - though AArch64 vector costs are unreliable in 4.9 and 5.2) with: <gcc> -O3 slp-costs.c On my (x86_64) system GCC 4.8.2 the cost analysis looks like: slp-costs.c:7: note: Cost model analysis: Vector inside of loop cost: 32 Vector prologue cost: 10 Vector epilogue cost: 0 Scalar iteration cost: 64 Scalar outside cost: 1 Vector outside cost: 10 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 On my (x86_64) 5.2 the cost analysis looks like: slp-costs.c:7:3: note: Cost model analysis: Vector inside of loop cost: 32 Vector prologue cost: 1033 Vector epilogue cost: 0 Scalar iteration cost: 64 Scalar outside cost: 1 Vector outside cost: 1033 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 33 Trunk starts to get this right again after r228751 . I had a look at backporting that patch but it uses some of the new hash-table stuff so it won't be a trivial backport. slp-costs.c:7:3: note: Cost model analysis: Vector inside of loop cost: 32 Vector prologue cost: 10 Vector epilogue cost: 0 Scalar iteration cost: 64 Scalar outside cost: 1 Vector outside cost: 10 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 slp-costs.c:7:3: note: Runtime profitability threshold = 0