Re: [PATCH 1/1] [fwprop]: Add the support of forwarding the vec_duplicate rtx

2023-01-18 Thread Richard Sandiford via Gcc-patches
"丁乐华"  writes:
> > I don't think this pattern is correct, because SEL isn't commutative
> > in the vector operands.
>
> Indeed, I think I should invert PRED operand or the comparison
> operator which produce the PRED operand first.

That would work, but it would no longer be a win.  The vectoriser already
has code to try to reuse existing predicates where possible, to increase
the chances that the operand order of VEC_COND_EXPRs is reasonable.

> > I think this should be:
> >
> > if (...)
> >  to = XEXP (to, 0);>
> > and should be before the REG_P test. We don't want to treat
> > arbitrary duplicates as profitable.
>
> Agree, the adjustment is more rigorous.
>
> > It's not obvious that vec_duplicate is special enough that we should
> > treat it differently from other unary operators. For example,
> > zero_extend and sign_extend don't seem fundamentally more expensive
> > than vec_duplicate.
>
> Juzhe and I also discussed offline recently. We also have widened vector
> operator that needs to be added, this can be finished in RTL with forwarding
> instead of adding widen GIMPLE internal function. We think we can add a
> TARGET HOOK, for example: 
> `rtx try_forward (rtx dest, rtx src, rtx use_insn, rtx def_insn)`
>
>
> If it returns NULL_RTX, it means that it cannot be forwarded, otherwise
> it means replace the dest part in use_insn with the returned rtx.
> Letting the backend decide which ones can be forwarded has several
> advantages compared to:
> 1. Let the insn related to TARGET, such as unspec, also can be forwarded,
>   and when forwarding, the corresponding content can be extracted
>   from def_insn instead of the complete src part.
> 2. By default this HOOK returns NULL_TREE, which can reduce compatibility
>   issues.

Personally, I'm not in favour of a hook along these lines.  I think
it would effectively split the pass between target-independent and
target-specific code, which (a) tends to lead to more duplication
between targets and (b) makes it harder to test for correctness
(as opposed to performance) when updating the target-independent code.

If a value can't be forwarded, then either (a) substitution will fail
to give a valid instruction or (b) the new instruction will be more
costly than the old one (as measured by existing hooks).

The possible downsides (e.g. on register pressure, as you mention below)
are something that target-independent code should deal with, since it
can look at the function as a whole.

> > It's a while since I looked at this code, but I assume that, even after
> > this change, we will still require the new in-loop instruction to be
> > no more expensive than the old in-loop instruction. Is that right?
>
>
> Yeah. Forwarding vec_duplicate maybe reduce the use of vector registers,
> but increase the life cycle of scalar registers. If the scalar register 
> pressure
> is higher, this change may become more expensive. This decision does not
> feel very easy to make, is there some way to do this?

Yeah.  But on many architectures, scalar floats are stored in the same
register file as vectors, so whether this is a problem will depend also
on the mode of the scalar.

Also, the cost is different if we eliminate all uses of the duplicate in
the loop vs. if we only eliminate some.

The handling of flag_ira_hoist_pressure is one example of code that
tries to use register pressure to guide optimisation, but I don't
know the code very well.  (Of course, if we did reuse that,
we'd want to commonise it rather than duplicate it.)

Thanks,
Richard


Re: [PATCH 1/1] [fwprop]: Add the support of forwarding the vec_duplicate rtx

2023-01-17 Thread 丁乐华
 I don't think this pattern is correct, because SEL isn't commutative
 in the vector operands.


Indeed, I think I should invert PRED operand or the comparison
operatorwhich produce the PRED operand first.


I think this should be:

 if (...)
  to = XEXP (to, 0);
 and should be before the REG_P test. We don't want to treat
 arbitrary duplicates as profitable.

Agree, theadjustmentis more rigorous.

It's not obvious that vec_duplicate is special enough that we should
 treat it differently from other unary operators. For example,
 zero_extend and sign_extend don't seem fundamentally more expensive
 than vec_duplicate.

Juzhe and I also discussed offline recently. We also have widened vector
operator that needs to be added, this can be finished in RTL with forwarding
instead of adding widen GIMPLE internal function. We think we can add a
TARGET HOOK, for example:
`rtx try_forward (rtx dest, rtx src, rtx use_insn, rtx def_insn)`


If it returns NULL_RTX, it means that it cannot be forwarded, otherwise
it means replace the dest part in use_insn with the returned rtx.
Letting the backend decide which ones can be forwarded has several
advantages compared to:
1. Let the insn related to TARGET, such as unspec, also can be forwarded,
  and when forwarding, the corresponding content can be extracted
  from def_insn instead of the complete src part.
2. By default this HOOK returns NULL_TREE, which can reduce compatibility
  issues.


 It's a while since I looked at this code, but I assume that, even after
 this change, we will still require the new in-loop instruction to be
 no more expensive than the old in-loop instruction. Is that right?


Yeah. Forwarding vec_duplicate maybe reduce the use of vector registers,
but increase the life cycle of scalar registers. If the scalar register pressure
is higher, this change may become more expensive. This decisiondoes not
feel very easy to make, is there some way to do this?


Best,




Lehua

Re: [PATCH 1/1] [fwprop]: Add the support of forwarding the vec_duplicate rtx

2023-01-17 Thread Jeff Law via Gcc-patches




On 1/17/23 09:00, Richard Sandiford via Gcc-patches wrote:



But the idea of the fwprop change looks OK to me in principle.
What we have now seems conservative, based on heuristics that
haven't been updated in a long time.  So relaxing them a bit seems
like a good idea.  IIRC Jeff had another case in which the current
heuristics were too strict.

Two actually, though neither is relevant to this particular problem.

Jeff


Re: [PATCH 1/1] [fwprop]: Add the support of forwarding the vec_duplicate rtx

2023-01-17 Thread Richard Sandiford via Gcc-patches
lehua.d...@rivai.ai writes:
> From: Lehua Ding 
>
> ps: Resend for adjusting the width of each line of text.
>
> Hi,
>
> When I was adding the new RISC-V auto-vectorization function, I found that
> converting `vector-reg1 vop vector-vreg2` to `scalar-reg3 vop vectorreg2`
> is not very easy to handle where `vector-reg1` is a vec_duplicate_expr.
> For example the bellow gimple IR:
>
> ```gimple
> 
> vect_cst__51 = [vec_duplicate_expr] z_14(D);
>
> 
> vect_iftmp.13_53 = .LEN_COND_ADD(mask__40.9_47, vect__6.12_50, vect_cst__51, 
> { 0.0, ... }, curr_cnt_60);
> ```
>
> I once wanted to add corresponding functions to gimple IR, such as adding
> .LEN_COND_ADD_VS, and then convert .LEN_COND_ADD to .LEN_COND_ADD_VS in 
> match.pd.
> This method can be realized, but it will cause too many similar internal 
> functions
> to be added to gimple IR. It doesn't feel necessary. Later, I tried to 
> combine them
> on the combine pass but failed. Finally, I thought of adding the ability to 
> support
> forwarding `(vec_duplciate reg)` in fwprop pass, so I have this patch.
>
> Because the current upstream does not support the RISC-V automatic 
> vectorization
> function, I found an example in sve that can also be optimized and simply 
> tried
> it. For the float type, one instruction can be reduced, for example the 
> bellow C
> code. The difference between the new and old assembly code is that the new one
> uses the mov instruction to directly move the scalar variable to the vector 
> register.
> The old assembly code first moves the scalar variable to the vector register 
> outside
> the loop, and then uses the sel instruction. Compared with the entire 
> assembly code,
> the new assembly code has one instruction less. In addition, I noticed that 
> some
> instructions in the new assembly code are ahead of the `ble .L1` instruction.
> I debugged and found that the modification was made in the ce1 pass. This pass
> believes that moving up is more beneficial to performance.
>
> In addition, for the int type, compared with the float type, the new assembly 
> code
> will have one more `fmov s2, w2` instruction, so I can't judge whether the
> performance is better than the previous one. In fact, I mainly do RISC-V 
> development work.
>
> This patch is an exploratory patch and has not been tested too much. I mainly
> want to see your suggestions on whether this method is feasible and possible
> potential problems.
>
> Best,
> Lehua Ding
>
> ```c
> /* compiler options: -O3 -march=armv8.2-a+sve -S */
> void test1 (int *pred, float *x, float z, int n)
> {
>  for (int i = 0; i < n; i += 1)
>{
>  x[i] = pred[i] != 1 ? x[i] : z;
>}
> }
> ```
>
> The old assembly code like this (compiler explorer link: 
> https://godbolt.org/z/hxTnEhaqY):
>
> ```asm
> test1:
>  cmp w2, 0
>  ble.L1
>  mov x3, 0
>  cntw x4
>  mov z0.s, s0
>  whilelo p0.s, wzr, w2
>  ptrue p2.b, all
> .L3:
>  ld1w z2.s, p0/z, [x0, x3, lsl 2]
>  ld1w z1.s, p0/z, [x1, x3, lsl 2]
>  cmpne p1.s, p2/z, z2.s, #1
>  sel z1.s, p1, z1.s, z0.s
>  st1w z1.s, p0, [x1, x3, lsl 2]
>  add x3, x3, x4
>  while lo p0.s, w3, w2
>  b.any.L3
> .L1:
>  ret
> ```
>
> The new assembly code like this:
>
> ```asm
> test1:
>  whilelo p0.s, wzr, w2
>  mov x3, 0
>  cntw x4
>  ptrue p2.b, all
>  cmp w2, 0
>  ble.L1
> .L3:
>  ld1w z2.s, p0/z, [x0, x3, lsl 2]
>  ld1w z1.s, p0/z, [x1, x3, lsl 2]
>  cmpne p1.s, p2/z, z2.s, #1
>  mov z1.s, p1/m, s0
>  st1w z1.s, p0, [x1, x3, lsl 2]
>  add x3, x3, x4
>  while lo p0.s, w3, w2
>  b.any.L3
> .L1:
>  ret
> ```
>
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-sve.md (@aarch64_sel_dup_vs): Add new 
> pattern to capture new opeands order
> * fwprop.cc (fwprop_propagation::profitable_p): Add new check
> (reg_single_def_for_src_p): Add new function for src rtx
> (forward_propagate_into): Change to new function call
>
> ---
>  gcc/config/aarch64/aarch64-sve.md | 20 
>  gcc/fwprop.cc | 16 +++-
>  2 files changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index b8cc47ef5fc..84d8ed0924d 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -7636,6 +7636,26 @@
>[(set_attr "movprfx" "*,*,yes,yes,yes,yes")]
>  )
>  
> +;; Swap the order of operand 1 and operand 2 so that it matches the above 
> pattern
> +(define_insn_and_split "@aarch64_sel_dup_vs"
> +  [(set (match_operand:SVE_ALL 0 "register_operand" "=?w, w, ??w, ?, ??, 
> ?")
> + (unspec:SVE_ALL
> +   [(match_operand: 3 "register_operand" "Upl, Upl, Upl, Upl, 
> Upl, Upl")
> +   (match_operand:SVE_ALL 

Re: [PATCH 1/1] [fwprop]: Add the support of forwarding the vec_duplicate rtx

2023-01-13 Thread juzhe.zhong
Hi, Richard. Would you mind take a look at this patch?
This is a proposal patch (We could add more testcase for ARM in the future).
But we want to know if this patch is a correct approach to achieve what we want.

In RVV (RISC-V Vector), we have a bunch of instructions: 
vadd.vx/vsub.vx/vmul.vx..etc.
Such instructions allows CPU do the operations between vector and scalar 
directly without any vector duplicate or broadcast instruction.
So this patch is quite important for RVV auto-vectorizaton support which can 
reduce a lot of gimple IR pattern.
I known GCC 13 is not the appropriate for this patch, we hope this can be done 
in GCC 14.

Thank you so much.


juzhe.zh...@rivai.ai
 
From: lehua.ding
Date: 2023-01-13 17:42
To: gcc-patches
CC: richard.sandiford; juzhe.zhong; Lehua Ding
Subject: [PATCH 1/1] [fwprop]: Add the support of forwarding the vec_duplicate 
rtx
From: Lehua Ding 
 
ps: Resend for adjusting the width of each line of text.
 
Hi,
 
When I was adding the new RISC-V auto-vectorization function, I found that
converting `vector-reg1 vop vector-vreg2` to `scalar-reg3 vop vectorreg2`
is not very easy to handle where `vector-reg1` is a vec_duplicate_expr.
For example the bellow gimple IR:
 
```gimple

vect_cst__51 = [vec_duplicate_expr] z_14(D);
 

vect_iftmp.13_53 = .LEN_COND_ADD(mask__40.9_47, vect__6.12_50, vect_cst__51, { 
0.0, ... }, curr_cnt_60);
```
 
I once wanted to add corresponding functions to gimple IR, such as adding
.LEN_COND_ADD_VS, and then convert .LEN_COND_ADD to .LEN_COND_ADD_VS in 
match.pd.
This method can be realized, but it will cause too many similar internal 
functions
to be added to gimple IR. It doesn't feel necessary. Later, I tried to combine 
them
on the combine pass but failed. Finally, I thought of adding the ability to 
support
forwarding `(vec_duplciate reg)` in fwprop pass, so I have this patch.
 
Because the current upstream does not support the RISC-V automatic vectorization
function, I found an example in sve that can also be optimized and simply tried
it. For the float type, one instruction can be reduced, for example the bellow C
code. The difference between the new and old assembly code is that the new one
uses the mov instruction to directly move the scalar variable to the vector 
register.
The old assembly code first moves the scalar variable to the vector register 
outside
the loop, and then uses the sel instruction. Compared with the entire assembly 
code,
the new assembly code has one instruction less. In addition, I noticed that some
instructions in the new assembly code are ahead of the `ble .L1` instruction.
I debugged and found that the modification was made in the ce1 pass. This pass
believes that moving up is more beneficial to performance.
 
In addition, for the int type, compared with the float type, the new assembly 
code
will have one more `fmov s2, w2` instruction, so I can't judge whether the
performance is better than the previous one. In fact, I mainly do RISC-V 
development work.
 
This patch is an exploratory patch and has not been tested too much. I mainly
want to see your suggestions on whether this method is feasible and possible
potential problems.
 
Best,
Lehua Ding
 
```c
/* compiler options: -O3 -march=armv8.2-a+sve -S */
void test1 (int *pred, float *x, float z, int n)
{
 for (int i = 0; i < n; i += 1)
   {
 x[i] = pred[i] != 1 ? x[i] : z;
   }
}
```
 
The old assembly code like this (compiler explorer link: 
https://godbolt.org/z/hxTnEhaqY):
 
```asm
test1:
 cmp w2, 0
 ble.L1
 mov x3, 0
 cntw x4
 mov z0.s, s0
 whilelo p0.s, wzr, w2
 ptrue p2.b, all
.L3:
 ld1w z2.s, p0/z, [x0, x3, lsl 2]
 ld1w z1.s, p0/z, [x1, x3, lsl 2]
 cmpne p1.s, p2/z, z2.s, #1
 sel z1.s, p1, z1.s, z0.s
 st1w z1.s, p0, [x1, x3, lsl 2]
 add x3, x3, x4
 while lo p0.s, w3, w2
 b.any.L3
.L1:
 ret
```
 
The new assembly code like this:
 
```asm
test1:
 whilelo p0.s, wzr, w2
 mov x3, 0
 cntw x4
 ptrue p2.b, all
 cmp w2, 0
 ble.L1
.L3:
 ld1w z2.s, p0/z, [x0, x3, lsl 2]
 ld1w z1.s, p0/z, [x1, x3, lsl 2]
 cmpne p1.s, p2/z, z2.s, #1
 mov z1.s, p1/m, s0
 st1w z1.s, p0, [x1, x3, lsl 2]
 add x3, x3, x4
 while lo p0.s, w3, w2
 b.any.L3
.L1:
 ret
```
 
 
gcc/ChangeLog:
 
* config/aarch64/aarch64-sve.md (@aarch64_sel_dup_vs): Add new 
pattern to capture new opeands order
* fwprop.cc (fwprop_propagation::profitable_p): Add new check
(reg_single_def_for_src_p): Add new function for src rtx
(forward_propagate_into): Change to new function call
 
---
gcc/config/aarch64/aarch64-sve.md | 20 
gcc/fwprop.cc | 16 +++-
2 files changed, 35 insertions(+), 1 deletion(-)
 
diff --git 

[PATCH 1/1] [fwprop]: Add the support of forwarding the vec_duplicate rtx

2023-01-13 Thread lehua . ding
From: Lehua Ding 

ps: Resend for adjusting the width of each line of text.

Hi,

When I was adding the new RISC-V auto-vectorization function, I found that
converting `vector-reg1 vop vector-vreg2` to `scalar-reg3 vop vectorreg2`
is not very easy to handle where `vector-reg1` is a vec_duplicate_expr.
For example the bellow gimple IR:

```gimple

vect_cst__51 = [vec_duplicate_expr] z_14(D);


vect_iftmp.13_53 = .LEN_COND_ADD(mask__40.9_47, vect__6.12_50, vect_cst__51, { 
0.0, ... }, curr_cnt_60);
```

I once wanted to add corresponding functions to gimple IR, such as adding
.LEN_COND_ADD_VS, and then convert .LEN_COND_ADD to .LEN_COND_ADD_VS in 
match.pd.
This method can be realized, but it will cause too many similar internal 
functions
to be added to gimple IR. It doesn't feel necessary. Later, I tried to combine 
them
on the combine pass but failed. Finally, I thought of adding the ability to 
support
forwarding `(vec_duplciate reg)` in fwprop pass, so I have this patch.

Because the current upstream does not support the RISC-V automatic vectorization
function, I found an example in sve that can also be optimized and simply tried
it. For the float type, one instruction can be reduced, for example the bellow C
code. The difference between the new and old assembly code is that the new one
uses the mov instruction to directly move the scalar variable to the vector 
register.
The old assembly code first moves the scalar variable to the vector register 
outside
the loop, and then uses the sel instruction. Compared with the entire assembly 
code,
the new assembly code has one instruction less. In addition, I noticed that some
instructions in the new assembly code are ahead of the `ble .L1` instruction.
I debugged and found that the modification was made in the ce1 pass. This pass
believes that moving up is more beneficial to performance.

In addition, for the int type, compared with the float type, the new assembly 
code
will have one more `fmov s2, w2` instruction, so I can't judge whether the
performance is better than the previous one. In fact, I mainly do RISC-V 
development work.

This patch is an exploratory patch and has not been tested too much. I mainly
want to see your suggestions on whether this method is feasible and possible
potential problems.

Best,
Lehua Ding

```c
/* compiler options: -O3 -march=armv8.2-a+sve -S */
void test1 (int *pred, float *x, float z, int n)
{
 for (int i = 0; i < n; i += 1)
   {
 x[i] = pred[i] != 1 ? x[i] : z;
   }
}
```

The old assembly code like this (compiler explorer link: 
https://godbolt.org/z/hxTnEhaqY):

```asm
test1:
 cmp w2, 0
 ble.L1
 mov x3, 0
 cntw x4
 mov z0.s, s0
 whilelo p0.s, wzr, w2
 ptrue p2.b, all
.L3:
 ld1w z2.s, p0/z, [x0, x3, lsl 2]
 ld1w z1.s, p0/z, [x1, x3, lsl 2]
 cmpne p1.s, p2/z, z2.s, #1
 sel z1.s, p1, z1.s, z0.s
 st1w z1.s, p0, [x1, x3, lsl 2]
 add x3, x3, x4
 while lo p0.s, w3, w2
 b.any.L3
.L1:
 ret
```

The new assembly code like this:

```asm
test1:
 whilelo p0.s, wzr, w2
 mov x3, 0
 cntw x4
 ptrue p2.b, all
 cmp w2, 0
 ble.L1
.L3:
 ld1w z2.s, p0/z, [x0, x3, lsl 2]
 ld1w z1.s, p0/z, [x1, x3, lsl 2]
 cmpne p1.s, p2/z, z2.s, #1
 mov z1.s, p1/m, s0
 st1w z1.s, p0, [x1, x3, lsl 2]
 add x3, x3, x4
 while lo p0.s, w3, w2
 b.any.L3
.L1:
 ret
```


gcc/ChangeLog:

* config/aarch64/aarch64-sve.md (@aarch64_sel_dup_vs): Add new 
pattern to capture new opeands order
* fwprop.cc (fwprop_propagation::profitable_p): Add new check
(reg_single_def_for_src_p): Add new function for src rtx
(forward_propagate_into): Change to new function call

---
 gcc/config/aarch64/aarch64-sve.md | 20 
 gcc/fwprop.cc | 16 +++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index b8cc47ef5fc..84d8ed0924d 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -7636,6 +7636,26 @@
   [(set_attr "movprfx" "*,*,yes,yes,yes,yes")]
 )
 
+;; Swap the order of operand 1 and operand 2 so that it matches the above 
pattern
+(define_insn_and_split "@aarch64_sel_dup_vs"
+  [(set (match_operand:SVE_ALL 0 "register_operand" "=?w, w, ??w, ?, ??, 
?")
+   (unspec:SVE_ALL
+ [(match_operand: 3 "register_operand" "Upl, Upl, Upl, Upl, 
Upl, Upl")
+   (match_operand:SVE_ALL 1 "aarch64_simd_reg_or_zero" "0, 0, Dz, Dz, 
w, w")
+  (vec_duplicate:SVE_ALL
+ (match_operand: 2 "register_operand" "r, w, r, w, r, w"))]
+ UNSPEC_SEL))]
+  "TARGET_SVE"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+(unspec:SVE_ALL
+  [(match_dup 3)
+   

[PATCH 1/1] [fwprop]: Add the support of forwarding the vec_duplicate rtx

2023-01-13 Thread lehua . ding
From: Lehua Ding 

Hi,

When I was adding the new RISC-V auto-vectorization function, I found that 
converting `vector-reg1 vop vector-vreg2` to `scalar-reg3 vop vectorreg2` is 
not very easy to handle where `vector-reg1` is a vec_duplicate_expr. For 
example the bellow gimple IR:

```gimple

vect_cst__51 = [vec_duplicate_expr] z_14(D);


vect_iftmp.13_53 = .LEN_COND_ADD(mask__40.9_47, vect__6.12_50, vect_cst__51, { 
0.0, ... }, curr_cnt_60);
```

I once wanted to add corresponding functions to gimple IR, such as adding 
.LEN_COND_ADD_VS, and then convert .LEN_COND_ADD to .LEN_COND_ADD_VS in 
match.pd. This method can be realized, but it will cause too many similar 
internal functions to be added to gimple IR. It doesn't feel necessary. Later, 
I tried to combine them on the combine pass but failed. Finally, I thought of 
adding the ability to support forwarding `(vec_duplciate reg)` in fwprop pass, 
so I have this patch.

Because the current upstream does not support the RISC-V automatic 
vectorization function, I found an example in sve that can also be optimized 
and simply tried it. For the float type, one instruction can be reduced, for 
example the bellow C code. The difference between the new and old assembly code 
is that the new one uses the mov instruction to directly move the scalar 
variable to the vector register. The old assembly code first moves the scalar 
variable to the vector register outside the loop, and then uses the sel 
instruction. Compared with the entire assembly code, the new assembly code has 
one instruction less. In addition, I noticed that some instructions in the new 
assembly code are ahead of the `ble .L1` instruction. I debugged and found that 
the modification was made in the ce1 pass. This pass believes that moving up is 
more beneficial to performance.

In addition, for the int type, compared with the float type, the new assembly 
code will have one more `fmov s2, w2` instruction, so I can't judge whether the 
performance is better than the previous one. In fact, I mainly do RISC-V 
development work.

This patch is an exploratory patch and has not been tested too much. I mainly 
want to see your suggestions on whether this method is feasible and possible 
potential problems.

```c
/* compiler options: -O3 -march=armv8.2-a+sve -S */
void test1 (int *pred, float *x, float z, int n)
{
 for (int i = 0; i < n; i += 1)
   {
 x[i] = pred[i] != 1 ? x[i] : z;
   }
}
```

The old assembly code like this (compiler explorer link: 
https://godbolt.org/z/hxTnEhaqY):

```asm
test1:
 cmp w2, 0
 ble.L1
 mov x3, 0
 cntw x4
 mov z0.s, s0
 whilelo p0.s, wzr, w2
 ptrue p2.b, all
.L3:
 ld1w z2.s, p0/z, [x0, x3, lsl 2]
 ld1w z1.s, p0/z, [x1, x3, lsl 2]
 cmpne p1.s, p2/z, z2.s, #1
 sel z1.s, p1, z1.s, z0.s
 st1w z1.s, p0, [x1, x3, lsl 2]
 add x3, x3, x4
 while lo p0.s, w3, w2
 b.any.L3
.L1:
 ret
```

The new assembly code like this:

```asm
test1:
 whilelo p0.s, wzr, w2
 mov x3, 0
 cntw x4
 ptrue p2.b, all
 cmp w2, 0
 ble.L1
.L3:
 ld1w z2.s, p0/z, [x0, x3, lsl 2]
 ld1w z1.s, p0/z, [x1, x3, lsl 2]
 cmpne p1.s, p2/z, z2.s, #1
 mov z1.s, p1/m, s0
 st1w z1.s, p0, [x1, x3, lsl 2]
 add x3, x3, x4
 while lo p0.s, w3, w2
 b.any.L3
.L1:
 ret
```


gcc/ChangeLog:

* config/aarch64/aarch64-sve.md (@aarch64_sel_dup_vs): Add new 
pattern to capture new opeands order
* fwprop.cc (fwprop_propagation::profitable_p): Add new check
(reg_single_def_for_src_p): Add new function for src rtx
(forward_propagate_into): Change to new function call

---
 gcc/config/aarch64/aarch64-sve.md | 20 
 gcc/fwprop.cc | 16 +++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index b8cc47ef5fc..84d8ed0924d 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -7636,6 +7636,26 @@
   [(set_attr "movprfx" "*,*,yes,yes,yes,yes")]
 )
 
+;; Swap the order of operand 1 and operand 2 so that it matches the above 
pattern
+(define_insn_and_split "@aarch64_sel_dup_vs"
+  [(set (match_operand:SVE_ALL 0 "register_operand" "=?w, w, ??w, ?, ??, 
?")
+   (unspec:SVE_ALL
+ [(match_operand: 3 "register_operand" "Upl, Upl, Upl, Upl, 
Upl, Upl")
+   (match_operand:SVE_ALL 1 "aarch64_simd_reg_or_zero" "0, 0, Dz, Dz, 
w, w")
+  (vec_duplicate:SVE_ALL
+ (match_operand: 2 "register_operand" "r, w, r, w, r, w"))]
+ UNSPEC_SEL))]
+  "TARGET_SVE"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+(unspec:SVE_ALL
+  [(match_dup 3)
+   (vec_duplicate:SVE_ALL (match_dup 2))
+   (match_dup 1)]
+