https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118818
Bug ID: 118818
Summary: Optimization of divps to rcpps + newton can cause slow
down
Product: gcc
Version: 14.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: benjamin.meier70 at gmail dot com
Target Milestone: ---
Hey
I work a lot with SSE vectorized code. Mainly with floats
gcc optimizes most of the code very well. When I compute reciprocals, I've
recognized that it replaces `divps` by `rcpps` + newton. It seems to be a smart
optimization, but on many machines it's actually slower than `divps`.
E.g. the following test program can be used to test that:
------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
#include <math.h>
#include <xmmintrin.h>
#include <unistd.h>
#define N (1024 * 16)
#define FORCE_INLINE inline
__attribute__((always_inline))
__attribute__((aligned(16))) float y[N] = {0};
static FORCE_INLINE __m128 inverse(__m128 x, __m128 one)
{
// good old 1.0f/x
return _mm_div_ps(
one,
x
);
}
__attribute__((noinline))
void f(const float *restrict in, float *restrict out)
{
const __m128 one = _mm_set1_ps(1.0f);
for (size_t i = 0; i < N; i += 4)
{
__m128 v_in = _mm_load_ps(&in[i]);
__m128 v_out = inverse(v_in, one);
_mm_store_ps(&out[i], v_out);
}
}
unsigned long takeMonotonicTimestampNs()
{
struct timespec tv_start;
clock_gettime(CLOCK_MONOTONIC_RAW, &tv_start);
return ((tv_start.tv_sec * 1000000000) + tv_start.tv_nsec);
}
void test_lat(const float *restrict values) {
f(values, y);
uint64_t tsa = takeMonotonicTimestampNs();
for (int i = 0; i < 100000; ++i)
{
f(values, y);
}
uint64_t tsb = takeMonotonicTimestampNs();
printf("%.10f\n", y[N - 1]);
printf("%.3fms (slow)\n", (tsb - tsa) / 1e6);
}
int main()
{
// generate some "random" inputs
srand(0);
float *values = aligned_alloc(16, N * sizeof(values[0]));
for (int i = 0; i < N; ++i) {
values[i] = (rand() + 1) * (rand() + 1);
}
while (1) {
test_lat(values);
}
}
------------------------------
Compile with `divps`: gcc -O3 ./main.c -msse4.2
Compile with `rcpps` + newton: gcc -Ofast ./main.c -msse4.2
With `divps` it's about 25% faster (tested on a `Intel(R) Xeon(R) Platinum
8275CL CPU )
Can this specific optimization be disabled? I mean only the one that div gets
replaced by rcp plus newton. In general gcc optimization work very well and due
to that I don't like to disable anything else.
Plus is there a reason why the optimization is still used? I believe it was
faster at some point, but maybe that's not the case anymore? Plus I can see
that icx does not do this optimization.
Thanks a lot