https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dje at gcc dot gnu.org, | |meissner at gcc dot gnu.org, | |segher at gcc dot gnu.org Component|libgomp |target --- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Seems to be a powerpc64le backend bug or RA bug. Reduced testcase for -fopenacc -O1: program reduction_3 implicit none integer, parameter :: n = 10, vl = 32 integer :: i double precision :: vresult, rv double precision, parameter :: e = 0.001 double precision, dimension (n) :: array do i = 1, n array(i) = i end do rv = 0 vresult = 0 !$acc parallel vector_length(vl) copy(rv) !$acc loop reduction(max:rv) vector do i = 1, n rv = max (rv, array(i)) end do !$acc end parallel do i = 1, n vresult = max (vresult, array(i)) end do if (abs (rv - vresult) .ge. e) STOP 11 end program reduction_3 In *.optimized it looks all correct: <bb 3> [local count: 437450368]: # vect_M.23_45 = PHI <vect_cst__39(2), vect_M.27_34(3)> # ivtmp.34_3 = PHI <ivtmp.34_43(2), ivtmp.34_4(3)> _2 = (void *) ivtmp.34_3; vect__28.26_44 = MEM[base: _2, offset: 0B]; vect_M.27_34 = MAX_EXPR <vect__28.26_44, vect_M.23_45>; ivtmp.34_4 = ivtmp.34_3 + 16; if (ivtmp.34_4 != _25) goto <bb 3>; [80.00%] else goto <bb 4>; [20.00%] <bb 4> [local count: 437450371]: stmp_M.28_8 = .REDUC_MAX (vect_M.27_34); *_10 = stmp_M.28_8; and the loop indeed iterates properly and we end up with { 10.0, 9.0 } vector which REDUC_MAX ifn should reduce to 10.0. During early RTL opts it also looks correct: (insn 20 19 21 4 (parallel [ (set (reg:V2DF 134) (smax:V2DF (vec_concat:V2DF (vec_select:DF (reg:V2DF 128 [ vect_M.23 ]) (parallel [ (const_int 1 [0x1]) ])) (vec_select:DF (reg:V2DF 128 [ vect_M.23 ]) (parallel [ (const_int 0 [0]) ]))) (reg:V2DF 128 [ vect_M.23 ]))) (clobber (scratch:V2DF)) ]) 1330 {vsx_reduc_smax_v2df} (nil)) (insn 21 20 22 4 (set (reg:DF 123 [ stmp_M.28 ]) (vec_select:DF (reg:V2DF 134) (parallel [ (const_int 0 [0]) ]))) 1219 {vsx_extract_v2df} (nil)) Then combine turns that into: (insn 21 20 22 4 (parallel [ (set (reg:DF 123 [ stmp_M.28 ]) (vec_select:DF (smax:V2DF (vec_concat:V2DF (vec_select:DF (reg:V2DF 128 [ vect_M.23 ]) (parallel [ (const_int 1 [0x1]) ])) (vec_select:DF (reg:V2DF 128 [ vect_M.23 ]) (parallel [ (const_int 0 [0]) ]))) (reg:V2DF 128 [ vect_M.23 ])) (parallel [ (const_int 1 [0x1]) ]))) (clobber (scratch:DF)) ]) 1336 {*vsx_reduc_smax_v2df_scalar} (expr_list:REG_DEAD (reg:V2DF 128 [ vect_M.23 ]) (nil))) That is then split into: (insn 34 20 35 4 (set (reg:DF 137) (vec_select:DF (reg:V2DF 128 [ vect_M.23 ]) (parallel [ (const_int 1 [0x1]) ]))) -1 (nil)) (insn 35 34 22 4 (set (reg:DF 123 [ stmp_M.28 ]) (smax:DF (subreg:DF (reg:V2DF 128 [ vect_M.23 ]) 8) (reg:DF 137))) -1 (nil)) at which point I'm already not sure if it is correct or not. As I said, at least in the debugger it shows that the input to this .REDUC_MAX contains the value { 10, 9 } is the vec_select extracting the second elt (i.e. 9.0) and (subreg 8) also the second one? In the end, that is what happens, the resulting assembly is: 0x000000001000086c <+32>: lxvd2x vs0,0,r9 0x0000000010000870 <+36>: addi r8,r1,-16 0x0000000010000874 <+40>: lxvd2x vs12,0,r8 0x0000000010000878 <+44>: xxswapd vs12,vs12 0x000000001000087c <+48>: xvmaxdp vs0,vs12,vs0 0x0000000010000880 <+52>: xxswapd vs0,vs0 0x0000000010000884 <+56>: stxvd2x vs0,0,r8 0x0000000010000888 <+60>: xxswapd vs0,vs0 0x000000001000088c <+64>: addi r9,r9,16 0x0000000010000890 <+68>: bdnz 0x1000086c <MAIN__._omp_fn.0+32> => 0x0000000010000894 <+72>: lfd f12,-8(r1) 0x0000000010000898 <+76>: xsmaxdp vs0,vs12,vs0 0x000000001000089c <+80>: stfd f0,0(r10) 0x00000000100008a0 <+84>: blr and at that point x/2fg $r1-16 0x3fffffffed90: 10 9 p $vs0.v2_double $6 = {10, 9} p $vs12.v2_double $7 = {8, 7} Now, the lfd loads into f12 the second element (i.e. 9), in the debugger it shows p $vs12.v2_double $8 = {0, 9} after the lfd insn, and xsmaxdp {10, 9}, {0, 9} gives {0, 9} and that is what we store. So, does vsx_reduc_smax_v2df_scalar expander need adjustments for little-endian?