On Thu, 14 Jan 2021, Qing Zhao wrote:
Hi, More data on code size and compilation time with CPU2017: ********Compilation time data: the numbers are the slowdown against the default “no”: benchmarks A/no D/no 500.perlbench_r 5.19% 1.95% 502.gcc_r 0.46% -0.23% 505.mcf_r 0.00% 0.00% 520.omnetpp_r 0.85% 0.00% 523.xalancbmk_r 0.79% -0.40% 525.x264_r -4.48% 0.00% 531.deepsjeng_r 16.67% 16.67% 541.leela_r 0.00% 0.00% 557.xz_r 0.00% 0.00% 507.cactuBSSN_r 1.16% 0.58% 508.namd_r 9.62% 8.65% 510.parest_r 0.48% 1.19% 511.povray_r 3.70% 3.70% 519.lbm_r 0.00% 0.00% 521.wrf_r 0.05% 0.02% 526.blender_r 0.33% 1.32% 527.cam4_r -0.93% -0.93% 538.imagick_r 1.32% 3.95% 544.nab_r 0.00% 0.00% From the above data, looks like that the compilation time impact from implementation A and D are almost the same. *******code size data: the numbers are the code size increase against the default “no”: benchmarks A/no D/no 500.perlbench_r 2.84% 0.34% 502.gcc_r 2.59% 0.35% 505.mcf_r 3.55% 0.39% 520.omnetpp_r 0.54% 0.03% 523.xalancbmk_r 0.36% 0.39% 525.x264_r 1.39% 0.13% 531.deepsjeng_r 2.15% -1.12% 541.leela_r 0.50% -0.20% 557.xz_r 0.31% 0.13% 507.cactuBSSN_r 5.00% -0.01% 508.namd_r 3.64% -0.07% 510.parest_r 1.12% 0.33% 511.povray_r 4.18% 1.16% 519.lbm_r 8.83% 6.44% 521.wrf_r 0.08% 0.02% 526.blender_r 1.63% 0.45% 527.cam4_r 0.16% 0.06% 538.imagick_r 3.18% -0.80% 544.nab_r 5.76% -1.11% Avg 2.52% 0.36% From the above data, the implementation D is always better than A, it’s a surprising to me, not sure what’s the reason for this.
D probably inhibits most interesting loop transforms (check SPEC FP performance). It will also most definitely disallow SRA which, when an aggregate is not completely elided, tends to grow code.
********stack usage data, I added -fstack-usage to the compilation line when compiling CPU2017 benchmarks. And all the *.su files were generated for each of the modules. Since there a lot of such files, and the stack size information are embedded in each of the files. I just picked up one benchmark 511.povray to check. Which is the one that has the most runtime overhead when adding initialization (both A and D). I identified all the *.su files that are different between A and D and do a diff on those *.su files, and looks like that the stack size is much higher with D than that with A, for example: $ diff build_base_auto_init.D.0000/bbox.su build_base_auto_init.A.0000/bbox.su5c5 < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**, pov::BBOX_TREE**&, long int*, long int, long int) 160 static --- > bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**, pov::BBOX_TREE**&, long int*, long int, long int) 96 static $ diff build_base_auto_init.D.0000/image.su build_base_auto_init.A.0000/image.su 9c9 < image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 624 static --- > image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 272 static …. Looks like that implementation D has more stack size impact than A. Do you have any insight on what the reason for this?
D will keep all initialized aggregates as aggregates and live which means stack will be allocated for it. With A the usual optimizations to reduce stack usage can be applied.
Let me know if you have any comments and suggestions.
First of all I would check whether the prototype implementations work as expected. Richard.
thanks. Qing On Jan 13, 2021, at 1:39 AM, Richard Biener <rguent...@suse.de> wrote: On Tue, 12 Jan 2021, Qing Zhao wrote: Hi, Just check in to see whether you have any comments and suggestions on this: FYI, I have been continue with Approach D implementation since last week: D. Adding calls to .DEFFERED_INIT during gimplification, expand the .DEFFERED_INIT during expand to real initialization. Adjusting uninitialized pass with the new refs with “.DEFFERED_INIT”. For the remaining work of Approach D: ** complete the implementation of -ftrivial-auto-var-init=pattern; ** complete the implementation of uninitialized warnings maintenance work for D. I have completed the uninitialized warnings maintenance work for D. And finished partial of the -ftrivial-auto-var-init=pattern implementation. The following are remaining work of Approach D: ** -ftrivial-auto-var-init=pattern for VLA; **add a new attribute for variable: __attribute((uninitialized) the marked variable is uninitialized intentionaly for performance purpose. ** adding complete testing cases; Please let me know if you have any objection on my current decision on implementing approach D. Did you do any analysis on how stack usage and code size are changed with approach D? How does compile-time behave (we could gobble up lots of .DEFERRED_INIT calls I guess)? Richard. Thanks a lot for your help. Qing On Jan 5, 2021, at 1:05 PM, Qing Zhao via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: Hi, This is an update for our previous discussion. 1. I implemented the following two different implementations in the latest upstream gcc: A. Adding real initialization during gimplification, not maintain the uninitialized warnings. D. Adding calls to .DEFFERED_INIT during gimplification, expand the .DEFFERED_INIT during expand to real initialization. Adjusting uninitialized pass with the new refs with “.DEFFERED_INIT”. Note, in this initial implementation, ** I ONLY implement -ftrivial-auto-var-init=zero, the implementation of -ftrivial-auto-var-init=pattern is not done yet. Therefore, the performance data is only about -ftrivial-auto-var-init=zero. ** I added an temporary option -fauto-var-init-approach=A|B|C|D to choose implementation A or D for runtime performance study. ** I didn’t finish the uninitialized warnings maintenance work for D. (That might take more time than I expected). 2. I collected runtime data for CPU2017 on a x86 machine with this new gcc for the following 3 cases: no: default. (-g -O2 -march=native ) A: default + -ftrivial-auto-var-init=zero -fauto-var-init-approach=A D: default + -ftrivial-auto-var-init=zero -fauto-var-init-approach=D And then compute the slowdown data for both A and D as following: benchmarks A / no D /no 500.perlbench_r 1.25% 1.25% 502.gcc_r 0.68% 1.80% 505.mcf_r 0.68% 0.14% 520.omnetpp_r 4.83% 4.68% 523.xalancbmk_r 0.18% 1.96% 525.x264_r 1.55% 2.07% 531.deepsjeng_ 11.57% 11.85% 541.leela_r 0.64% 0.80% 557.xz_ -0.41% -0.41% 507.cactuBSSN_r 0.44% 0.44% 508.namd_r 0.34% 0.34% 510.parest_r 0.17% 0.25% 511.povray_r 56.57% 57.27% 519.lbm_r 0.00% 0.00% 521.wrf_r -0.28% -0.37% 526.blender_r 16.96% 17.71% 527.cam4_r 0.70% 0.53% 538.imagick_r 2.40% 2.40% 544.nab_r 0.00% -0.65% avg 5.17% 5.37% From the above data, we can see that in general, the runtime performance slowdown for implementation A and D are similar for individual benchmarks. There are several benchmarks that have significant slowdown with the new added initialization for both A and D, for example, 511.povray_r, 526.blender_, and 531.deepsjeng_r, I will try to study a little bit more on what kind of new initializations introduced such slowdown. From the current study so far, I think that approach D should be good enough for our final implementation. So, I will try to finish approach D with the following remaining work ** complete the implementation of -ftrivial-auto-var-init=pattern; ** complete the implementation of uninitialized warnings maintenance work for D. Let me know if you have any comments and suggestions on my current and future work. Thanks a lot for your help. Qing On Dec 9, 2020, at 10:18 AM, Qing Zhao via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: The following are the approaches I will implement and compare: Our final goal is to keep the uninitialized warning and minimize the run-time performance cost. A. Adding real initialization during gimplification, not maintain the uninitialized warnings. B. Adding real initialization during gimplification, marking them with “artificial_init”. Adjusting uninitialized pass, maintaining the annotation, making sure the real init not Deleted from the fake init. C. Marking the DECL for an uninitialized auto variable as “no_explicit_init” during gimplification, maintain this “no_explicit_init” bit till after pass_late_warn_uninitialized, or till pass_expand, add real initialization for all DECLs that are marked with “no_explicit_init”. D. Adding .DEFFERED_INIT during gimplification, expand the .DEFFERED_INIT during expand to real initialization. Adjusting uninitialized pass with the new refs with “.DEFFERED_INIT”. In the above, approach A will be the one that have the minimum run-time cost, will be the base for the performance comparison. I will implement approach D then, this one is expected to have the most run-time overhead among the above list, but Implementation should be the cleanest among B, C, D. Let’s see how much more performance overhead this approach will be. If the data is good, maybe we can avoid the effort to implement B, and C. If the performance of D is not good, I will implement B or C at that time. Let me know if you have any comment or suggestions. Thanks. Qing -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)