https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111317

--- Comment #2 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Pan Li <pa...@gcc.gnu.org>:

https://gcc.gnu.org/g:f6d787c231905063dc3b55ce7028e348b74719be

commit r14-6488-gf6d787c231905063dc3b55ce7028e348b74719be
Author: Juzhe-Zhong <juzhe.zh...@rivai.ai>
Date:   Wed Dec 13 17:21:07 2023 +0800

    Middle-end: Adjust decrement IV style partial vectorization COST model

    Hi, before this patch, a simple conversion case for RVV codegen:

    foo:
            ble     a2,zero,.L8
            addiw   a5,a2,-1
            li      a4,6
            bleu    a5,a4,.L6
            srliw   a3,a2,3
            slli    a3,a3,3
            add     a3,a3,a0
            mv      a5,a0
            mv      a4,a1
            vsetivli        zero,8,e16,m1,ta,ma
    .L4:
            vle8.v  v2,0(a5)
            addi    a5,a5,8
            vzext.vf2       v1,v2
            vse16.v v1,0(a4)
            addi    a4,a4,16
            bne     a3,a5,.L4
            andi    a5,a2,-8
            beq     a2,a5,.L10
    .L3:
            slli    a4,a5,32
            srli    a4,a4,32
            subw    a2,a2,a5
            slli    a2,a2,32
            slli    a5,a4,1
            srli    a2,a2,32
            add     a0,a0,a4
            add     a1,a1,a5
            vsetvli zero,a2,e16,m1,ta,ma
            vle8.v  v2,0(a0)
            vzext.vf2       v1,v2
            vse16.v v1,0(a1)
    .L8:
            ret
    .L10:
            ret
    .L6:
            li      a5,0
            j       .L3

    This vectorization go through first loop:

            vsetivli        zero,8,e16,m1,ta,ma
    .L4:
            vle8.v  v2,0(a5)
            addi    a5,a5,8
            vzext.vf2       v1,v2
            vse16.v v1,0(a4)
            addi    a4,a4,16
            bne     a3,a5,.L4

    Each iteration processes 8 elements.

    For a scalable vectorization with VLEN > 128 bits CPU, it's ok when VLEN =
128.
    But, as long as VLEN > 128 bits, it will waste the CPU resources. That is,
e.g. VLEN = 256bits.
    only half of the vector units are working and another half is idle.

    After investigation, I realize that I forgot to adjust COST for SELECT_VL.
    So, adjust COST for SELECT_VL styple length vectorization. We adjust COST
from 3 to 2. since
    after this patch:

    foo:
            ble     a2,zero,.L5
    .L3:
            vsetvli a5,a2,e16,m1,ta,ma     -----> SELECT_VL cost.
            vle8.v  v2,0(a0)
            slli    a4,a5,1                -----> additional shift of outcome
SELECT_VL for memory address calculation.
            vzext.vf2       v1,v2
            sub     a2,a2,a5
            vse16.v v1,0(a1)
            add     a0,a0,a5
            add     a1,a1,a4
            bne     a2,zero,.L3
    .L5:
            ret

    This patch is a simple fix that I previous forgot.

    Ok for trunk ?

    If not, I am going to adjust cost in backend cost model.

            PR target/111317

    gcc/ChangeLog:

            * tree-vect-loop.cc (vect_estimate_min_profitable_iters): Adjust
for COST for decrement IV.

    gcc/testsuite/ChangeLog:

            * gcc.dg/vect/costmodel/riscv/rvv/pr111317.c: New test.
  • [Bug target/111317] RISC-V: Inc... cvs-commit at gcc dot gnu.org via Gcc-bugs

Reply via email to