Re: [PATCH][AArch64] Avoid paradoxical subregs for vector initialisation

2018-05-29 Thread Kyrill Tkachov

Hi Richard,

On 29/05/18 15:26, Richard Sandiford wrote:

Kyrill  Tkachov  writes:

Hi all,

The recent changes to aarch64_expand_vector_init cause an ICE in the
attached testcase.  The register allocator "ICEs with Max. number of
generated reload insns per insn is achieved (90)"

That is because aarch64_expand_vector_init creates a paradoxical subreg to move 
a DImode value
into a V2DI vector:
(insn 74 72 76 8 (set (reg:V2DI 287 [ _166 ])
  (subreg:V2DI (reg/v/f:DI 112 [ d ]) 0)) 1050 {*aarch64_simd_movv2di}

This is done because we want to express that the whole of the V2DI
vector will be written so that init-regs doesn't try to
zero-initialise it before we overwrite each lane individually anyway.

This can go bad for because if the DImode value is allocated in, say,
x30: the last register in that register class, the V2DI subreg of that
isn't valid or meaningful and that seems to cause the trouble.

It's kinda hard to know what the right solution for this is.
We could emit a duplicate of the value into all the lanes of the vector, but we 
have tests that test against that
(we're trying to avoid unnecessary duplicates)

What this patch does is it defines a pattern for moving a scalar into
lane 0 of a vector using a simple FMOV or LDR and represents that as a
merging with a vector of zeroes.  That way, the instruction represents
a full write of the destination vector but doesn't "read" more bits
from the source than necessary. The zeroing effect is also a more
accurate representation of the semantics of FMOV.

This feels like a hack.  Either the paradoxical subreg of the pseudo
is invalid for some reason (in which case we should ICE when it's formed,
not just in the case of x30 being allocated) or the subreg is valid,
in which case the RA should handle it correctly (and the backend should
give it the information it needs to do that).

I could see the argument for ignoring the problem for expediency if the
patch was a clean-up in its own right, but I think it's wrong to add so
much code to paper over a bug.


I see what you mean. Do you have any thoughts on where in RA we'd go about 
fixing this?
Since I don't know my way around RA I tried in the place I was most comfortable 
changing :)

In the meantime if adding this much code is not desirable to avoid this bug then
there is a more conservative "fix" of removing the maxv == 1 path from 
expand_vector_init
and emitting a duplicate of the first element to all the lanes of the vector 
followed by
a number of inserts.

Thanks,
Kyrill


Thanks,
Richard




Re: [PATCH][AArch64] Avoid paradoxical subregs for vector initialisation

2018-05-29 Thread Richard Sandiford
Kyrill  Tkachov  writes:
> Hi all,
>
> The recent changes to aarch64_expand_vector_init cause an ICE in the
> attached testcase.  The register allocator "ICEs with Max. number of
> generated reload insns per insn is achieved (90)"
>
> That is because aarch64_expand_vector_init creates a paradoxical subreg to 
> move a DImode value
> into a V2DI vector:
> (insn 74 72 76 8 (set (reg:V2DI 287 [ _166 ])
>  (subreg:V2DI (reg/v/f:DI 112 [ d ]) 0)) 1050 {*aarch64_simd_movv2di}
>
> This is done because we want to express that the whole of the V2DI
> vector will be written so that init-regs doesn't try to
> zero-initialise it before we overwrite each lane individually anyway.
>
> This can go bad for because if the DImode value is allocated in, say,
> x30: the last register in that register class, the V2DI subreg of that
> isn't valid or meaningful and that seems to cause the trouble.
>
> It's kinda hard to know what the right solution for this is.
> We could emit a duplicate of the value into all the lanes of the vector, but 
> we have tests that test against that
> (we're trying to avoid unnecessary duplicates)
>
> What this patch does is it defines a pattern for moving a scalar into
> lane 0 of a vector using a simple FMOV or LDR and represents that as a
> merging with a vector of zeroes.  That way, the instruction represents
> a full write of the destination vector but doesn't "read" more bits
> from the source than necessary. The zeroing effect is also a more
> accurate representation of the semantics of FMOV.

This feels like a hack.  Either the paradoxical subreg of the pseudo
is invalid for some reason (in which case we should ICE when it's formed,
not just in the case of x30 being allocated) or the subreg is valid,
in which case the RA should handle it correctly (and the backend should
give it the information it needs to do that).

I could see the argument for ignoring the problem for expediency if the
patch was a clean-up in its own right, but I think it's wrong to add so
much code to paper over a bug.

Thanks,
Richard


[PATCH][AArch64] Avoid paradoxical subregs for vector initialisation

2018-05-29 Thread Kyrill Tkachov

Hi all,

The recent changes to aarch64_expand_vector_init cause an ICE in the attached 
testcase.
The register allocator "ICEs with Max. number of generated reload insns per insn is 
achieved (90)"

That is because aarch64_expand_vector_init creates a paradoxical subreg to move 
a DImode value
into a V2DI vector:
(insn 74 72 76 8 (set (reg:V2DI 287 [ _166 ])
(subreg:V2DI (reg/v/f:DI 112 [ d ]) 0)) 1050 {*aarch64_simd_movv2di}

This is done because we want to express that the whole of the V2DI vector will 
be written
so that init-regs doesn't try to zero-initialise it before we overwrite each 
lane individually
anyway.

This can go bad for because if the DImode value is allocated in, say, x30: the 
last register
in that register class, the V2DI subreg of that isn't valid or meaningful and 
that seems to cause the trouble.

It's kinda hard to know what the right solution for this is.
We could emit a duplicate of the value into all the lanes of the vector, but we 
have tests that test against that
(we're trying to avoid unnecessary duplicates)

What this patch does is it defines a pattern for moving a scalar into lane 0 of 
a vector using a simple FMOV or LDR
and represents that as a merging with a vector of zeroes.
That way, the instruction represents a full write of the destination vector but doesn't 
"read" more bits from the source
than necessary. The zeroing effect is also a more accurate representation of 
the semantics of FMOV.

Bootstrapped and tested on aarch64-none-linux-gnu and tested also on 
aarch64_be-none-elf.

Ok for trunk?

Thanks,
Kyrill

2018-05-29  Kyrylo Tkachov  

* config/aarch64/aarch64-simd.md (aarch64_simd_vec_mov):
New define_insn.
* config/aarch64/aarch64.c (aarch64_get_aarch64_simd_vec_mov_code):
New function.
(aarch64_expand_vector_init): Avoid creating paradoxical subregs
of integer modes.

2018-05-29  Kyrylo Tkachov  

* c-c++-common/torture/aarch64-vect-init-1.c: New test.
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 9f4cb72f5c321136cfcc263bcb7c956365dbc512..292848984081e1fb517b85fc10816e50f5d0c841 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -808,6 +808,25 @@ (define_insn "aarch64_simd_vec_set"
   [(set_attr "type" "neon_ins, neon_from_gp, neon_load1_one_lane")]
 )
 
+
+;; Helper for aarch64_expand_vector_init.
+
+(define_insn "aarch64_simd_vec_mov"
+  [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
+	(vec_merge:VALL_F16
+	(vec_duplicate:VALL_F16
+		(match_operand: 1 "general_operand" "w,?r,m"))
+	(match_operand:VALL_F16 2 "aarch64_simd_imm_zero" )
+	(match_operand:SI 3 "immediate_operand")))]
+  "TARGET_SIMD
+   && ENDIAN_LANE_N (, exact_log2 (INTVAL (operands[3]))) == 0"
+  "@
+   fmov\\t%d0, %d1
+   fmov\\t%d0, %1
+   ldr\\t%0, %1"
+  [(set_attr "type" "fmov,f_mcr,neon_load1_1reg")]
+)
+
 (define_insn "*aarch64_simd_vec_copy_lane"
   [(set (match_operand:VALL_F16 0 "register_operand" "=w")
 	(vec_merge:VALL_F16
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b9759c638b010567f5047ff5129912986a30b49a..fdd8d2f587087180c71ce71003ec074dceeb6242 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -13879,6 +13879,44 @@ aarch64_simd_make_constant (rtx vals)
 return NULL_RTX;
 }
 
+/* Return the insn_code to use for MODE from the aarch64_simd_vec_mov
+   family of patterns.  */
+
+insn_code
+aarch64_get_aarch64_simd_vec_mov_code (machine_mode mode)
+{
+  switch (mode)
+{
+  case E_V8QImode:
+	return CODE_FOR_aarch64_simd_vec_movv8qi;
+  case E_V16QImode:
+	return CODE_FOR_aarch64_simd_vec_movv16qi;
+  case E_V4HImode:
+	return CODE_FOR_aarch64_simd_vec_movv4hi;
+  case E_V8HImode:
+	return CODE_FOR_aarch64_simd_vec_movv8hi;
+  case E_V2SImode:
+	return CODE_FOR_aarch64_simd_vec_movv2si;
+  case E_V4SImode:
+	return CODE_FOR_aarch64_simd_vec_movv4si;
+  case E_V2DImode:
+	return CODE_FOR_aarch64_simd_vec_movv2di;
+  case E_V4HFmode:
+	return CODE_FOR_aarch64_simd_vec_movv4hf;
+  case E_V8HFmode:
+	return CODE_FOR_aarch64_simd_vec_movv8hf;
+  case E_V2SFmode:
+	return CODE_FOR_aarch64_simd_vec_movv2sf;
+  case E_V4SFmode:
+	return CODE_FOR_aarch64_simd_vec_movv4sf;
+  case E_V2DFmode:
+	return CODE_FOR_aarch64_simd_vec_movv2df;
+  default:
+	gcc_unreachable ();
+}
+  return CODE_FOR_nothing;
+}
+
 /* Expand a vector initialisation sequence, such that TARGET is
initialised to contain VALS.  */
 
@@ -13967,7 +14005,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 	 are equally useless to us, in which case just immediately set the
 	 vector register using the first element.  */
 
-  if (maxv == 1)
+  if (maxv == 1 && GET_MODE_NUNITS (mode).is_constant ())
 	{
 	  /* For vectors of two 64-bit elements, we can do even better.  */
 	  if (n_elts == 2
@@ -13999,12 +14037,24 @@