https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121416
--- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> ---
For completeness, modifying OpenACC's reduction-cplx-dbl.c to use atomics, i.e.
#pragma acc parallel num_gangs (32) copyin(ary[0:N]) copy(tsum,tprod)
#pragma acc loop gang
for (int ix = 0; ix < N; ix++)
{
#pragma acc atomic update
__real__ tsum += __real__ ary[ix];
#pragma acc atomic update
__imag__ tsum += __imag__ ary[ix];
also yields the correct result.
[Here, with atomics, the data is updated on every step - and not once per
threads/worker and once per team/gang as with reductions.]