[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

cvs-commit at gcc dot gnu.org via Gcc-bugs Mon, 02 Nov 2020 22:29:49 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789


--- Comment #36 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Kewen Lin <li...@gcc.gnu.org>:

https://gcc.gnu.org/g:f5e18dd9c7dacc9671044fc669bd5c1b26b6bdba

commit r11-4637-gf5e18dd9c7dacc9671044fc669bd5c1b26b6bdba
Author: Kewen Lin <li...@gcc.gnu.org>
Date:   Tue Nov 3 02:51:47 2020 +0000

    pass: Run cleanup passes before SLP [PR96789]

    As the discussion in PR96789, we found that some scalar stmts
    which can be eliminated by some passes after SLP, but we still
    modeled their costs when trying to SLP, it could impact
    vectorizer's decision.  One typical case is the case in PR96789
    on target Power.

    As Richard suggested there, this patch is to introduce one pass
    called pre_slp_scalar_cleanup which has some secondary clean up
    passes, for now they are FRE and DSE.  It introduces one new
    TODO flags group called pending TODO flags, unlike normal TODO
    flags, the pending TODO flags are passed down in the pipeline
    until one of its consumers can perform the requested action.
    Consumers should then clear the flags for the actions that they
    have taken.

    Soem compilation time statistics on all SPEC2017 INT bmks were
    collected on one Power9 machine for several option sets below:
      A1: -Ofast -funroll-loops
      A2: -O1
      A3: -O1 -funroll-loops
      A4: -O2
      A5: -O2 -funroll-loops

    the corresponding increment rate is trivial:
      A1       A2       A3        A4        A5
      0.08%    0.00%    -0.38%    -0.10%    -0.05%

    Bootstrapped/regtested on powerpc64le-linux-gnu P8.

    gcc/ChangeLog:

            PR tree-optimization/96789
            * function.h (struct function): New member unsigned pending_TODOs.
            * passes.c (class pass_pre_slp_scalar_cleanup): New class.
            (make_pass_pre_slp_scalar_cleanup): New function.
            (pass_data_pre_slp_scalar_cleanup): New pass data.
            * passes.def: (pass_pre_slp_scalar_cleanup): New pass, add
            pass_fre and pass_dse as its children.
            * timevar.def (TV_SCALAR_CLEANUP): New timevar.
            * tree-pass.h (PENDING_TODO_force_next_scalar_cleanup): New
            pending TODO flag.
            (make_pass_pre_slp_scalar_cleanup): New declare.
            * tree-ssa-loop-ivcanon.c (tree_unroll_loops_completely_1):
            Once any outermost loop gets unrolled, flag cfun pending_TODOs
            PENDING_TODO_force_next_scalar_cleanup on.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/96789
            * gcc.dg/tree-ssa/ssa-dse-28.c: Adjust.
            * gcc.dg/tree-ssa/ssa-dse-29.c: Likewise.
            * gcc.dg/vect/bb-slp-41.c: Likewise.
            * gcc.dg/tree-ssa/pr96789.c: New test.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

Reply via email to