[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2021-02-19 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

--- Comment #8 from Vladimir Makarov  ---
I've tried different approaches to fix it.  The best patch I have now is in the
attachment.

Unfortunately, the best patch results in two new failures on ppc64 (other
patches are even worse):

gcc.target/powerpc/dform-3.c scan-assembler-not mfvsrd
gcc.target/powerpc/dform-3.c scan-assembler-not mfvsrld

I'll think more how to avoid these 2 failures.  If I succeed, I'll submit a
patch.  But there is probability that the PR will not be fixed at all.

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2021-02-19 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

--- Comment #7 from Vladimir Makarov  ---
Created attachment 50225
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50225&action=edit
A candidate patch

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-20 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

--- Comment #6 from Hongtao.liu  ---
(In reply to Alexander Monakov from comment #5)
> afaict LRA is just following IRA decisions, and IRA allocates that pseudo to
> memory due to costs.
> 
> Not sure where strange cost is coming from, but it depends on x86 tuning
> options: with -mtune=skylake we get the expected code, with -mtune=haswell
> we get 128-bit vectors right and extra load for 256-bit, with -mtune=generic
> both cases have extra loads.

in 

  /* If this insn loads a parameter from its stack slot, then it
 represents a savings, rather than a cost, if the parameter is
 stored in memory.  Record this fact.

 Similarly if we're loading other constants from memory (constant
 pool, TOC references, small data areas, etc) and this is the only
 assignment to the destination pseudo.

 Don't do this if SET_SRC (set) isn't a general operand, if it is
 a memory requiring special instructions to load it, decreasing
 mem_cost might result in it being loaded using the specialized
 instruction into a register, then stored into stack and loaded
 again from the stack.  See PR52208.

 Don't do this if SET_SRC (set) has side effect.  See PR56124.  */
  if (set != 0 && REG_P (SET_DEST (set)) && MEM_P (SET_SRC (set))
  && (note = find_reg_note (insn, REG_EQUIV, NULL_RTX)) != NULL_RTX
  && ((MEM_P (XEXP (note, 0))
   && !side_effects_p (SET_SRC (set)))
  || (CONSTANT_P (XEXP (note, 0))
  && targetm.legitimate_constant_p (GET_MODE (SET_DEST (set)),
XEXP (note, 0))
  && REG_N_SETS (REGNO (SET_DEST (set))) == 1))
  && general_operand (SET_SRC (set), GET_MODE (SET_SRC (set)))
  /* LRA does not use equiv with a symbol for PIC code.  */
  && (! ira_use_lra_p || ! pic_offset_table_rtx
  || ! contains_symbol_ref_p (XEXP (note, 0
{
  enum reg_class cl = GENERAL_REGS;
  rtx reg = SET_DEST (set);
  int num = COST_INDEX (REGNO (reg));

  COSTS (costs, num)->mem_cost
-= ira_memory_move_cost[GET_MODE (reg)][cl][1] * frequency;
  record_address_regs (GET_MODE (SET_SRC (set)),
   MEM_ADDR_SPACE (SET_SRC (set)),
   XEXP (SET_SRC (set), 0), 0, MEM, SCRATCH,
   frequency * 2);
  counted_mem = true;
}
---

for 

(insn 9 8 11 3 (set (reg:V2DI 88 [ _16 ])
(mem:V2DI (plus:DI (reg/v/f:DI 91 [ input ])
(reg:DI 89 [ ivtmp.11 ])) [0 MEM[(const __m128i *
{ref-all})input_7(D) + ivtmp.11_40 * 1]+0 S16 A128]))
"/export/users2/liuhongt/tools-build/build_gcc11_master_debug/gcc/include/emmintrin.h":697:10
1405 {movv2di_internal}

mem_cost for r88 would minus ira_memory_move_cost[V2DImode][GENERAL_REGS][1],
and got -11808 as an initial value, but for reality it should minus
ira_memory_move_cost[V2DImode][SSE_REGS][1], then have -5905 as an initial
value. It seems it adds too much preference to memory here.

Then in the later record_operand_costs, when ira found r88 would also be used
in shift and ior instruction, the mem_cost for r88 increases, but still smaller 
than costs of SSE_REGS because we add too much preference to memory in the
upper. Finally, ira would choose memory for r88 because it has lowest cost and
it's suboptimal.

a10(r88,l1) costs: SSE_FIRST_REG:0,0 NO_REX_SSE_REGS:0,0 SSE_REGS:0,0
MEM:-984,-984

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

--- Comment #5 from Alexander Monakov  ---
afaict LRA is just following IRA decisions, and IRA allocates that pseudo to
memory due to costs.

Not sure where strange cost is coming from, but it depends on x86 tuning
options: with -mtune=skylake we get the expected code, with -mtune=haswell we
get 128-bit vectors right and extra load for 256-bit, with -mtune=generic both
cases have extra loads.

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-12 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

Jakub Jelinek  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2020-10-12
   Keywords||ra
 CC||jakub at gcc dot gnu.org,
   ||vmakarov at gcc dot gnu.org

--- Comment #4 from Jakub Jelinek  ---
Seems it is LRA that creates this (propagates the MEM load into the ior
instruction, even when the memory location is already loaded for another
instruction earlier).  Perhaps it wouldn't do it if there weren't the subregs
(the load is in V4SImode, while ior uses V2DImode).

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

Richard Biener  changed:

   What|Removed |Added

 Depends on||93943
   Target Milestone|--- |8.5

--- Comment #3 from Richard Biener  ---
Maybe the same issue as PR93943?  We seem to be quite trigger-happy with
substituting memory for operands.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93943
[Bug 93943] IRA/LRA happily rematerialize (un-CSEs) loads without register
pressure

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
Intrinsics being type-agnostic cause vector subregs to appear before register
allocation: the pseudo coming from the load has mode V2DI, the shift needs to
be done in mode V4SI, the bitwise-or and the store are done in mode V2DI again.
Subreg in the bitwise-or appears to be handled inefficiently. Didn't dig deeper
as to what happens during allocation.

FWIW, using generic vectors allows to avoid introducing such mismatches, and
indeed the variant coded with generic vectors does not have extra loads. For
your original code you'll have to convert between generic vectors and __m128i
to use the shuffle intrinsic. The last paragraphs in "Vector Extensions"
chapter [1] suggest using a union for that purpose in C; in C++ reinterpreting
via union is formally UB, so another approach could be used (probably simply
converting via assignment).

[1] https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

typedef uint32_t u32v4 __attribute__((vector_size(16)));
void gcc_double_load_128(int8_t *__restrict out, const int8_t *__restrict
input)
{
u32v4 *vin = (u32v4 *)input;
u32v4 *vout = (u32v4 *)out;
for (unsigned i=0 ; i<1024; i+=16) {
u32v4 in = *vin++;
*vout++ = in | (in >> 4);
}
}

Above code on Compiler Explorer: https://godbolt.org/z/MKPvxb

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-10 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

--- Comment #1 from Peter Cordes  ---
Forgot to include https://godbolt.org/z/q44r13