On Thu, 14 Jan 2021, Qing Zhao wrote:

Hi, 
More data on code size and compilation time with CPU2017:

********Compilation time data:   the numbers are the slowdown against the
default “no”:

benchmarks  A/no D/no
                        
500.perlbench_r 5.19% 1.95%
502.gcc_r 0.46% -0.23%
505.mcf_r 0.00% 0.00%
520.omnetpp_r 0.85% 0.00%
523.xalancbmk_r 0.79% -0.40%
525.x264_r -4.48% 0.00%
531.deepsjeng_r 16.67% 16.67%
541.leela_r  0.00%  0.00%
557.xz_r 0.00%  0.00%
                        
507.cactuBSSN_r 1.16% 0.58%
508.namd_r 9.62% 8.65%
510.parest_r 0.48% 1.19%
511.povray_r 3.70% 3.70%
519.lbm_r 0.00% 0.00%
521.wrf_r 0.05% 0.02%
526.blender_r 0.33% 1.32%
527.cam4_r -0.93% -0.93%
538.imagick_r 1.32% 3.95%
544.nab_r  0.00% 0.00%

From the above data, looks like that the compilation time impact
from implementation A and D are almost the same.
*******code size data: the numbers are the code size increase against the
default “no”:
benchmarks A/no D/no
                        
500.perlbench_r 2.84% 0.34%
502.gcc_r 2.59% 0.35%
505.mcf_r 3.55% 0.39%
520.omnetpp_r 0.54% 0.03%
523.xalancbmk_r 0.36%  0.39%
525.x264_r 1.39% 0.13%
531.deepsjeng_r 2.15% -1.12%
541.leela_r 0.50% -0.20%
557.xz_r 0.31% 0.13%
                        
507.cactuBSSN_r 5.00% -0.01%
508.namd_r 3.64% -0.07%
510.parest_r 1.12% 0.33%
511.povray_r 4.18% 1.16%
519.lbm_r 8.83% 6.44%
521.wrf_r 0.08% 0.02%
526.blender_r 1.63% 0.45%
527.cam4_r  0.16% 0.06%
538.imagick_r 3.18% -0.80%
544.nab_r 5.76% -1.11%
Avg 2.52% 0.36%

From the above data, the implementation D is always better than A, it’s a
surprising to me, not sure what’s the reason for this.

D probably inhibits most interesting loop transforms (check SPEC FP
performance).  It will also most definitely disallow SRA which, when
an aggregate is not completely elided, tends to grow code.

********stack usage data, I added -fstack-usage to the compilation line when
compiling CPU2017 benchmarks. And all the *.su files were generated for each
of the modules.
Since there a lot of such files, and the stack size information are embedded
in each of the files.  I just picked up one benchmark 511.povray to
check. Which is the one that 
has the most runtime overhead when adding initialization (both A and D). 

I identified all the *.su files that are different between A and D and do a
diff on those *.su files, and looks like that the stack size is much higher
with D than that with A, for example:

$ diff build_base_auto_init.D.0000/bbox.su
build_base_auto_init.A.0000/bbox.su5c5
< bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
pov::BBOX_TREE**&, long int*, long int, long int) 160 static
---
> bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
pov::BBOX_TREE**&, long int*, long int, long int) 96 static

$ diff build_base_auto_init.D.0000/image.su
build_base_auto_init.A.0000/image.su
9c9
< image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 624
static
---
> image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 272
static
….
Looks like that implementation D has more stack size impact than A. 

Do you have any insight on what the reason for this?

D will keep all initialized aggregates as aggregates and live which
means stack will be allocated for it.  With A the usual optimizations
to reduce stack usage can be applied.

Let me know if you have any comments and suggestions.

First of all I would check whether the prototype implementations
work as expected.

Richard.


thanks.

Qing
      On Jan 13, 2021, at 1:39 AM, Richard Biener <rguent...@suse.de>
      wrote:

      On Tue, 12 Jan 2021, Qing Zhao wrote:

            Hi, 

            Just check in to see whether you have any comments
            and suggestions on this:

            FYI, I have been continue with Approach D
            implementation since last week:

            D. Adding  calls to .DEFFERED_INIT during
            gimplification, expand the .DEFFERED_INIT during
            expand to
            real initialization. Adjusting uninitialized pass
            with the new refs with “.DEFFERED_INIT”.

            For the remaining work of Approach D:

            ** complete the implementation of
            -ftrivial-auto-var-init=pattern;
            ** complete the implementation of uninitialized
            warnings maintenance work for D. 

            I have completed the uninitialized warnings
            maintenance work for D.
            And finished partial of the
            -ftrivial-auto-var-init=pattern implementation. 

            The following are remaining work of Approach D:

              ** -ftrivial-auto-var-init=pattern for VLA;
              **add a new attribute for variable:
            __attribute((uninitialized)
            the marked variable is uninitialized intentionaly
            for performance purpose.
              ** adding complete testing cases;


            Please let me know if you have any objection on my
            current decision on implementing approach D. 


      Did you do any analysis on how stack usage and code size are
      changed 
      with approach D?  How does compile-time behave (we could gobble
      up
      lots of .DEFERRED_INIT calls I guess)?

      Richard.

            Thanks a lot for your help.

            Qing


                  On Jan 5, 2021, at 1:05 PM, Qing Zhao
                  via Gcc-patches
                  <gcc-patches@gcc.gnu.org> wrote:

                  Hi,

                  This is an update for our previous
                  discussion. 

                  1. I implemented the following two
                  different implementations in the latest
                  upstream gcc:

                  A. Adding real initialization during
                  gimplification, not maintain the
                  uninitialized warnings.

                  D. Adding  calls to .DEFFERED_INIT
                  during gimplification, expand the
                  .DEFFERED_INIT during expand to
                  real initialization. Adjusting
                  uninitialized pass with the new refs
                  with “.DEFFERED_INIT”.

                  Note, in this initial implementation,
                  ** I ONLY implement
                  -ftrivial-auto-var-init=zero, the
                  implementation of
                  -ftrivial-auto-var-init=pattern 
                     is not done yet.  Therefore, the
                  performance data is only about
                  -ftrivial-auto-var-init=zero. 

                  ** I added an temporary  option
                  -fauto-var-init-approach=A|B|C|D  to
                  choose implementation A or D for 
                     runtime performance study.
                  ** I didn’t finish the uninitialized
                  warnings maintenance work for D. (That
                  might take more time than I expected). 

                  2. I collected runtime data for CPU2017
                  on a x86 machine with this new gcc for
                  the following 3 cases:

                  no: default. (-g -O2 -march=native )
                  A:  default +
                   -ftrivial-auto-var-init=zero
                  -fauto-var-init-approach=A 
                  D:  default +
                   -ftrivial-auto-var-init=zero
                  -fauto-var-init-approach=D 

                  And then compute the slowdown data for
                  both A and D as following:

                  benchmarks A / no D /no

                  500.perlbench_r 1.25% 1.25%
                  502.gcc_r 0.68% 1.80%
                  505.mcf_r 0.68% 0.14%
                  520.omnetpp_r 4.83% 4.68%
                  523.xalancbmk_r 0.18% 1.96%
                  525.x264_r 1.55% 2.07%
                  531.deepsjeng_ 11.57% 11.85%
                  541.leela_r 0.64% 0.80%
                  557.xz_  -0.41% -0.41%

                  507.cactuBSSN_r 0.44% 0.44%
                  508.namd_r 0.34% 0.34%
                  510.parest_r 0.17% 0.25%
                  511.povray_r 56.57% 57.27%
                  519.lbm_r 0.00% 0.00%
                  521.wrf_r  -0.28% -0.37%
                  526.blender_r 16.96% 17.71%
                  527.cam4_r 0.70% 0.53%
                  538.imagick_r 2.40% 2.40%
                  544.nab_r 0.00% -0.65%

                  avg 5.17% 5.37%

                  From the above data, we can see that in
                  general, the runtime performance
                  slowdown for 
                  implementation A and D are similar for
                  individual benchmarks.

                  There are several benchmarks that have
                  significant slowdown with the new added
                  initialization for both
                  A and D, for example, 511.povray_r,
                  526.blender_, and 531.deepsjeng_r, I
                  will try to study a little bit
                  more on what kind of new initializations
                  introduced such slowdown. 

                  From the current study so far, I think
                  that approach D should be good enough
                  for our final implementation. 
                  So, I will try to finish approach D with
                  the following remaining work

                      ** complete the implementation of
                  -ftrivial-auto-var-init=pattern;
                      ** complete the implementation of
                  uninitialized warnings maintenance work
                  for D. 


                  Let me know if you have any comments and
                  suggestions on my current and future
                  work.

                  Thanks a lot for your help.

                  Qing

                        On Dec 9, 2020, at 10:18 AM,
                        Qing Zhao via Gcc-patches
                        <gcc-patches@gcc.gnu.org>
                        wrote:

                        The following are the
                        approaches I will implement
                        and compare:

                        Our final goal is to keep
                        the uninitialized warning
                        and minimize the run-time
                        performance cost.

                        A. Adding real
                        initialization during
                        gimplification, not maintain
                        the uninitialized warnings.
                        B. Adding real
                        initialization during
                        gimplification, marking them
                        with “artificial_init”. 
                          Adjusting uninitialized
                        pass, maintaining the
                        annotation, making sure the
                        real init not
                          Deleted from the fake
                        init. 
                        C.  Marking the DECL for an
                        uninitialized auto variable
                        as “no_explicit_init” during
                        gimplification,
                           maintain this
                        “no_explicit_init” bit till
                        after
                        pass_late_warn_uninitialized,
                        or till pass_expand, 
                           add real initialization
                        for all DECLs that are
                        marked with
                        “no_explicit_init”.
                        D. Adding .DEFFERED_INIT
                        during gimplification,
                        expand the .DEFFERED_INIT
                        during expand to
                          real initialization.
                        Adjusting uninitialized pass
                        with the new refs with
                        “.DEFFERED_INIT”.


                        In the above, approach A
                        will be the one that have
                        the minimum run-time cost,
                        will be the base for the
                        performance
                        comparison. 

                        I will implement approach D
                        then, this one is expected
                        to have the most run-time
                        overhead among the above
                        list, but
                        Implementation should be the
                        cleanest among B, C, D.
                        Let’s see how much more
                        performance overhead this
                        approach
                        will be. If the data is
                        good, maybe we can avoid the
                        effort to implement B, and
                        C. 

                        If the performance of D is
                        not good, I will implement B
                        or C at that time.

                        Let me know if you have any
                        comment or suggestions.

                        Thanks.

                        Qing





      -- 
      Richard Biener <rguent...@suse.de>
      SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
      Nuernberg,
      Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)




Reply via email to