Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization
On 12/15/23 13:52, juzhe.zh...@rivai.ai wrote: > Do you mean : > > /* We need to use precomputed mask for such situation and such mask > can only be computed in compile-time known size modes. */ > bool indices_fit_selector_p > = GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, > 256); > if (!indices_fit_selector_p && !vec_len.is_constant ()) > return false; Yes and then reuse this in the if. Regards Robin
Re: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization
Do you mean : /* We need to use precomputed mask for such situation and such mask can only be computed in compile-time known size modes. */ bool indices_fit_selector_p = GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, 256); if (!indices_fit_selector_p && !vec_len.is_constant ()) return false; juzhe.zh...@rivai.ai From: Robin Dapp Date: 2023-12-15 20:44 To: juzhe.zh...@rivai.ai; gcc-patches CC: rdapp.gcc; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization > Oh. I think it should be renamed into not_fit. > > Is this following make sense to you ? > > /* We need to use precomputed mask for such situation and such mask > can only be computed in compile-time known size modes. */ > bool indices_not_fit_selector_p > = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode))); > if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8 > && indices_not_fit_selector_p > && !vec_len.is_constant ()) > return false; Mhm, right, I don't think this makes it nicer overall. Maybe just like the following then: bool ..._p = GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, 256); if (!..._p && !vec_len.is_constant ()) then later if (..._p) ... else ... Regards Robin
Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization
> Oh. I think it should be renamed into not_fit. > > Is this following make sense to you ? > > /* We need to use precomputed mask for such situation and such mask > can only be computed in compile-time known size modes. */ > bool indices_not_fit_selector_p > = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode))); > if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8 > && indices_not_fit_selector_p > && !vec_len.is_constant ()) > return false; Mhm, right, I don't think this makes it nicer overall. Maybe just like the following then: bool ..._p = GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, 256); if (!..._p && !vec_len.is_constant ()) then later if (..._p) ... else ... Regards Robin
Re: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization
Oh. I think it should be renamed into not_fit. Is this following make sense to you ? /* We need to use precomputed mask for such situation and such mask can only be computed in compile-time known size modes. */ bool indices_not_fit_selector_p = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode))); if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8 && indices_not_fit_selector_p && !vec_len.is_constant ()) return false; juzhe.zh...@rivai.ai From: Robin Dapp Date: 2023-12-15 20:25 To: juzhe.zh...@rivai.ai; gcc-patches CC: rdapp.gcc; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization On 12/15/23 13:16, juzhe.zh...@rivai.ai wrote: > >>> bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE >>> (GET_MODE_INNER (vmode))); > No, I think it will make us miss some optimization. > > For example, for poly value [16,16] maybe_ge ([16,16], 65536) which makes us > missed merge optimization but > we definitely can do merge optimization. I didn't mean to skip the && !vec_len.is_constant (), that should stay. Just the first part of condition that can be re-used in the if as well (inverted). Regards Robin
Re: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization
Do you mean like this ? /* We need to use precomputed mask for such situation and such mask can only be computed in compile-time known size modes. */ bool indices_fit_selector_p = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode))); if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8、 && indices_fit_selector_p && !vec_len.is_constant ()) return false; juzhe.zh...@rivai.ai From: Robin Dapp Date: 2023-12-15 20:25 To: juzhe.zh...@rivai.ai; gcc-patches CC: rdapp.gcc; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization On 12/15/23 13:16, juzhe.zh...@rivai.ai wrote: > >>> bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE >>> (GET_MODE_INNER (vmode))); > No, I think it will make us miss some optimization. > > For example, for poly value [16,16] maybe_ge ([16,16], 65536) which makes us > missed merge optimization but > we definitely can do merge optimization. I didn't mean to skip the && !vec_len.is_constant (), that should stay. Just the first part of condition that can be re-used in the if as well (inverted). Regards Robin
Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization
On 12/15/23 13:16, juzhe.zh...@rivai.ai wrote: > >>> bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE >>> (GET_MODE_INNER (vmode))); > No, I think it will make us miss some optimization. > > For example, for poly value [16,16] maybe_ge ([16,16], 65536) which makes us > missed merge optimization but > we definitely can do merge optimization. I didn't mean to skip the && !vec_len.is_constant (), that should stay. Just the first part of condition that can be re-used in the if as well (inverted). Regards Robin
Re: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization
>> bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE >> (GET_MODE_INNER (vmode))); No, I think it will make us miss some optimization. For example, for poly value [16,16] maybe_ge ([16,16], 65536) which makes us missed merge optimization but we definitely can do merge optimization. >> Also add a comment that the non-constant case is handled by >> shuffle_decompress_patterns in case we have a HImode vector twice the >> size that can hold our indices. Ok. >> Comment should go inside the if branch. Ok. >>Also add this as comment: >>As the mask is a simple {0, 1, ...} pattern and the length is known we can >>store it in a scalar register and broadcast it to a mask register. Ok. >>I would have hoped that a simple >> v.quick_push (gen_int_mode (0b01010101, QImode)); >>suffices but that will probably clash if there are more than >>two npatterns. No, we definitely can not use this. more details you can see the current test vmerge-*.c . We have various patterns: E.g. 0, nunits + 1, nunits+ 2, ... it is 011 nunits, 1, 2 it 100. Many different kinds of patterns can be used vmerge optimization. juzhe.zh...@rivai.ai From: Robin Dapp Date: 2023-12-15 19:14 To: Juzhe-Zhong; gcc-patches CC: rdapp.gcc; kito.cheng; kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization Hi Juzhe, in general looks OK. > + /* We need to use precomputed mask for such situation and such mask > + can only be computed in compile-time known size modes. */ > + if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8 && maybe_ge (vec_len, > 256) > + && !vec_len.is_constant ()) > +return false; > + We could make this a separate variable like: bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode))); Also add a comment that the non-constant case is handled by shuffle_decompress_patterns in case we have a HImode vector twice the size that can hold our indices. >/* MASK = SELECTOR < NUNTIS ? 1 : 0. */ Comment should go inside the if branch. > + if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, > 256)) > +} > + else > +{ > + /* For EEW8 and NUNITS may be larger than 255, we can't use vmsltu > + directly to generate the selector mask, instead, we can only use > + precomputed mask. I find that comment a bit misleading as it's not the vmsltu itself but rather that the indices cannot be held. > + > + E.g. selector = <0, 257, 2, 259> for EEW8 vector with NUNITS = 256, we > + don't have a QImode scalar register to hold larger than 255. */ We also cannot hold that in a vector QImode register, and, since there is no larger HI mode vector we cannot create a larger selector. Also add this as comment: As the mask is a simple {0, 1, ...} pattern and the length is known we can store it in a scalar register and broadcast it to a mask register. > + gcc_assert (vec_len.is_constant ()); > + int size = CEIL (GET_MODE_NUNITS (mask_mode).to_constant (), 8); > + machine_mode mode = get_vector_mode (QImode, size).require (); > + rtx tmp = gen_reg_rtx (mode); > + rvv_builder v (mode, 1, size); > + for (int i = 0; i < vec_len.to_constant () / 8; i++) > + { > + uint8_t value = 0; > + for (int j = 0; j < 8; j++) > + { > + int index = i * 8 + j; > + if (known_lt (d->perm[index], 256)) > + value |= 1 << j; > + } I would have hoped that a simple v.quick_push (gen_int_mode (0b01010101, QImode)); suffices but that will probably clash if there are more than two npatterns. Regards Robin
Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization
Hi Juzhe, in general looks OK. > + /* We need to use precomputed mask for such situation and such mask > + can only be computed in compile-time known size modes. */ > + if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8 && maybe_ge (vec_len, > 256) > + && !vec_len.is_constant ()) > +return false; > + We could make this a separate variable like: bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode))); Also add a comment that the non-constant case is handled by shuffle_decompress_patterns in case we have a HImode vector twice the size that can hold our indices. >/* MASK = SELECTOR < NUNTIS ? 1 : 0. */ Comment should go inside the if branch. > + if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, > 256)) > +} > + else > +{ > + /* For EEW8 and NUNITS may be larger than 255, we can't use vmsltu > + directly to generate the selector mask, instead, we can only use > + precomputed mask. I find that comment a bit misleading as it's not the vmsltu itself but rather that the indices cannot be held. > + > + E.g. selector = <0, 257, 2, 259> for EEW8 vector with NUNITS = 256, we > + don't have a QImode scalar register to hold larger than 255. */ We also cannot hold that in a vector QImode register, and, since there is no larger HI mode vector we cannot create a larger selector. Also add this as comment: As the mask is a simple {0, 1, ...} pattern and the length is known we can store it in a scalar register and broadcast it to a mask register. > + gcc_assert (vec_len.is_constant ()); > + int size = CEIL (GET_MODE_NUNITS (mask_mode).to_constant (), 8); > + machine_mode mode = get_vector_mode (QImode, size).require (); > + rtx tmp = gen_reg_rtx (mode); > + rvv_builder v (mode, 1, size); > + for (int i = 0; i < vec_len.to_constant () / 8; i++) > + { > + uint8_t value = 0; > + for (int j = 0; j < 8; j++) > + { > + int index = i * 8 + j; > + if (known_lt (d->perm[index], 256)) > + value |= 1 << j; > + } I would have hoped that a simple v.quick_push (gen_int_mode (0b01010101, QImode)); suffices but that will probably clash if there are more than two npatterns. Regards Robin
[PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization
This patch fixes the following FAILs in "full coverage" testing: Running target riscv-sim/-march=rv64gcv_zvl256b/-mabi=lp64d/-mcmodel=medlow/--param=riscv-autovec-lmul=m8/--param=riscv-autovec-preference=fixed-vlmax FAIL: gcc.dg/vect/vect-strided-mult-char-ls.c -flto -ffat-lto-objects execution test FAIL: gcc.dg/vect/vect-strided-mult-char-ls.c execution test FAIL: gcc.dg/vect/vect-strided-u8-i2.c -flto -ffat-lto-objects execution test FAIL: gcc.dg/vect/vect-strided-u8-i2.c execution test The root cause is vmerge optimization on this following IR: _45 = VEC_PERM_EXPR ; It's obvious we have many index > 255 in shuffle indice. Here we use vmerge optimizaiton which is available but incorrect codgen cause run fail. The bug codegen: vsetvli zero,a4,e8,m8,ta,ma vmsltu.vi v0,v0,0 -> it should be 256 instead of 0, but since it is EEW8 vector, 256 is not a available value that 8bit register can hold it. vmerge.vvm v8,v8,v16,v0 After this patch: vmv.v.x v0,a6 vmerge.vvm v8,v8,v16,v0 gcc/ChangeLog: * config/riscv/riscv-v.cc (shuffle_merge_patterns): Fix bug. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/bug-1.c: New test. --- gcc/config/riscv/riscv-v.cc | 49 --- .../gcc.target/riscv/rvv/autovec/bug-1.c | 39 +++ 2 files changed, 81 insertions(+), 7 deletions(-) create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/bug-1.c diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc index 680e2a0e03a..b8ba597682b 100644 --- a/gcc/config/riscv/riscv-v.cc +++ b/gcc/config/riscv/riscv-v.cc @@ -2987,20 +2987,55 @@ shuffle_merge_patterns (struct expand_vec_perm_d *d) && !d->perm.series_p (i, n_patterns, vec_len + i, n_patterns)) return false; + /* We need to use precomputed mask for such situation and such mask + can only be computed in compile-time known size modes. */ + if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8 && maybe_ge (vec_len, 256) + && !vec_len.is_constant ()) +return false; + if (d->testing_p) return true; machine_mode mask_mode = get_mask_mode (vmode); rtx mask = gen_reg_rtx (mask_mode); - rtx sel = vec_perm_indices_to_rtx (sel_mode, d->perm); - /* MASK = SELECTOR < NUNTIS ? 1 : 0. */ - rtx x = gen_int_mode (vec_len, GET_MODE_INNER (sel_mode)); - insn_code icode = code_for_pred_cmp_scalar (sel_mode); - rtx cmp = gen_rtx_fmt_ee (LTU, mask_mode, sel, x); - rtx ops[] = {mask, cmp, sel, x}; - emit_vlmax_insn (icode, COMPARE_OP, ops); + if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, 256)) +{ + rtx sel = vec_perm_indices_to_rtx (sel_mode, d->perm); + rtx x = gen_int_mode (vec_len, GET_MODE_INNER (sel_mode)); + insn_code icode = code_for_pred_cmp_scalar (sel_mode); + rtx cmp = gen_rtx_fmt_ee (LTU, mask_mode, sel, x); + rtx ops[] = {mask, cmp, sel, x}; + emit_vlmax_insn (icode, COMPARE_OP, ops); +} + else +{ + /* For EEW8 and NUNITS may be larger than 255, we can't use vmsltu +directly to generate the selector mask, instead, we can only use +precomputed mask. + +E.g. selector = <0, 257, 2, 259> for EEW8 vector with NUNITS = 256, we +don't have a QImode scalar register to hold larger than 255. */ + gcc_assert (vec_len.is_constant ()); + int size = CEIL (GET_MODE_NUNITS (mask_mode).to_constant (), 8); + machine_mode mode = get_vector_mode (QImode, size).require (); + rtx tmp = gen_reg_rtx (mode); + rvv_builder v (mode, 1, size); + for (int i = 0; i < vec_len.to_constant () / 8; i++) + { + uint8_t value = 0; + for (int j = 0; j < 8; j++) + { + int index = i * 8 + j; + if (known_lt (d->perm[index], 256)) + value |= 1 << j; + } + v.quick_push (gen_int_mode (value, QImode)); + } + emit_move_insn (tmp, v.build ()); + emit_move_insn (mask, gen_lowpart (mask_mode, tmp)); +} /* TARGET = MASK ? OP0 : OP1. */ /* swap op0 and op1 since the order is opposite to pred_merge. */ diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/bug-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/bug-1.c new file mode 100644 index 000..88059971503 --- /dev/null +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/bug-1.c @@ -0,0 +1,39 @@ +/* { dg-do compile } */ +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax -fno-vect-cost-model -O3 -fdump-tree-optimized" } */ + +#include + +#define N 64 + +typedef struct +{ + unsigned char a; + unsigned char b; +} s; + +int +main1 (s *arr) +{ + s *ptr = arr; + s res[N]; + int i; + + for (i = 0; i < N; i++) +{ + res[i].a = ptr->b - ptr->a; + res[i].b = ptr->b + ptr->a; + ptr++; +} +/* check results: */