答复: Re: Is fcommon related with performance optimization logic?
Sorry to use another e-mail due to network issue. I tried -fsection-anchors option. But it does not apply to the target. Best regards Clark Zhao This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! 发件人: 赵海峰 [mailto:zju@qq.com] 发送时间: 2024年5月31日 16:51 收件人: Zhaohaifeng(Clark,CIS-HCE) 主题: Fw: Re: Is fcommon related with performance optimization logic? ---Original--- From: "David Brown"mailto:david.br...@hesbynett.no>> Date: Thu, May 30, 2024 22:19 PM To: "Andrew Pinski"mailto:pins...@gmail.com>>;"赵海峰"mailto:zju@qq.com>>; Cc: "gcc"mailto:gcc@gcc.gnu.org>>; Subject: Re: Is fcommon related with performance optimization logic? On 30/05/2024 04:26, Andrew Pinski via Gcc wrote: > On Wed, May 29, 2024 at 7:13 PM 赵海峰 via Gcc wrote: >> >> Dear Sir/Madam, >> >> >> We found that running on intel SPR UnixBench compiled with gcc 10.3 performs >> worse than with gcc 8.5 for dhry2reg benchmark. >> >> >> I found it related with -fcommon option which is disabled in 10.3 by >> default. Fcommon will make global variables addresses in special order in >> bss section(watching by nm -n) whatever they are defined in source code. >> >> >> We are wondering if fcommon has some special performance optimization >> process? >> >> >> (I also post the subject to gcc-help. Hope to get some suggestion in this >> mail list. Sorry for bothering.) > > This was already filed as > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114532 . But someone > needs to go in and do more analysis of what is going wrong. The > biggest difference for x86_64 is how the variables are laid out and by > who (the compiler or the linker). There is some notion that > -fno-common increases the number of L1-dcache-load-misses and that > points to the layout of the variable differences causing the > difference. But nobody has gone and seen which variables are laid out > differently and why. I am suspecting that small changes in the > code/variables would cause layout differences which will cause the > cache misses which can cause the performance which is almost all by > accident. > I suspect adding -fdata-sections will cause another performance > difference here too. And there is not much GCC can do about this since > data layout is "hard" to do to get the best performance always. > (I am most familiar with embedded systems with static linking, rather than dealing with GOT and other aspects of linking on big systems.) I think -fno-common should allow -fsection-anchors to do a much better job. If symbols are put in the common section, the compiler does not know their relative position until link time. But if they are in bss or data sections (with or without -fdata-sections), it can at least use anchors to access data in the translation unit that defines the data objects. David > Thanks, > Andrew Pinski > >> >> >> Best regards. >> >> >> Clark Zhao >
答复: Re: Is fcommon related with performance optimization logic?
Thanks. the UnixBench source code is as following: unsigned long Run_Index; Rec_Pointer Ptr_Glob, Next_Ptr_Glob; int Int_Glob; Boolean Bool_Glob; char Ch_1_Glob, Ch_2_Glob; int Arr_1_Glob [50]; int Arr_2_Glob [50] [50]; Boolean Reg = true; long Begin_Time, End_Time, User_Time; float Microseconds, Dhrystones_Per_Second; Some key results are as following : 1. Using gcc 10.3 the variables are arranged from the last Dhrystone_Per_Second to the first Ptr_Glob, both in assembly and the final binary. 0x004040c0 0x0008 B stderr@GLIBC_2.2.5 0x004040c8 0x0001 b completed.0 0x004040e0 0x0004 B Dhrystones_Per_Second 0x004040e4 0x0004 B Microseconds 0x004040e8 0x0008 B User_Time 0x004040f0 0x0008 B End_Time 0x004040f8 0x0008 B Begin_Time 0x00404100 0x0004 B Reg 0x00404120 0x2710 B Arr_2_Glob 0x00406840 0x00c8 B Arr_1_Glob 0x00406908 0x0001 B Ch_2_Glob 0x00406909 0x0001 B Ch_1_Glob 0x0040690c 0x0004 B Bool_Glob 0x00406910 0x0004 B Int_Glob 0x00406918 0x0008 B Next_Ptr_Glob 0x00406920 0x0008 B Ptr_Glob 0x00406928 0x0008 B Run_Index If we change the sequence of the variables in the source code, the sequence in assembly and binary is also changed as the same logic, using gcc 10.3. 2. Using gcc 8.5 the variables are arranged as following both in assembly and final binary, 0x004040c0 0x0008 B stderr@GLIBC_2.2.5 0x004040c8 0x0001 b completed.0 0x004040e0 0x0008 B Begin_Time 0x00404100 0x2710 B Arr_2_Glob 0x00406810 0x0001 B Ch_2_Glob 0x00406818 0x0008 B Run_Index 0x00406820 0x0004 B Microseconds 0x00406828 0x0008 B Ptr_Glob 0x00406830 0x0004 B Dhrystones_Per_Second 0x00406838 0x0008 B End_Time 0x00406840 0x0004 B Int_Glob 0x00406844 0x0004 B Bool_Glob 0x00406848 0x0008 B User_Time 0x00406850 0x0008 B Next_Ptr_Glob 0x00406860 0x00c8 B Arr_1_Glob 0x00406928 0x0001 B Ch_1_Glob If the variable sequence is changed in the source code, the sequence in assembly and binary is NOT changed using gcc 8.5. So we can see that the assembling process take effect and fcommon will arrange the variables following some special logic. 3. If we make some change to the source code, by adding some int arrays between the variables, the performance of using gcc 10.3 is similar as gcc 8.5. So it can be infered that variable caching process is changed in this case which has great impact in this problem. So it is the problem that whether the fcommon has some expected performance optimization logic. If not, maybe it is just some random performance result. But the variable arrangement reveals that it has some special logic. Best regards, Clark Zhao This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! 发件人: 赵海峰 [mailto:zju@qq.com] 发送时间: 2024年5月31日 16:27 收件人: Zhaohaifeng(Clark,CIS-HCE) 主题: Fw: Re: Is fcommon related with performance optimization logic? ---Original--- From: "Andrew Pinski"mailto:pins...@gmail.com>> Date: Thu, May 30, 2024 10:27 AM To: "赵海峰"mailto:zju@qq.com>>; Cc: "gcc"mailto:gcc@gcc.gnu.org>>; Subject: Re: Is fcommon related with performance optimization logic? On Wed, May 29, 2024 at 7:13 PM 赵海峰 via Gcc wrote: > > Dear Sir/Madam, > > > We found that running on intel SPR UnixBench compiled with gcc 10.3 performs > worse than with gcc 8.5 for dhry2reg benchmark. > > > I found it related with -fcommon option which is disabled in 10.3 by default. > Fcommon will make global variables addresses in special order in bss > section(watching by nm -n) whatever th
Re: Is fcommon related with performance optimization logic?
On 30/05/2024 04:26, Andrew Pinski via Gcc wrote: On Wed, May 29, 2024 at 7:13 PM 赵海峰 via Gcc wrote: Dear Sir/Madam, We found that running on intel SPR UnixBench compiled with gcc 10.3 performs worse than with gcc 8.5 for dhry2reg benchmark. I found it related with -fcommon option which is disabled in 10.3 by default. Fcommon will make global variables addresses in special order in bss section(watching by nm -n) whatever they are defined in source code. We are wondering if fcommon has some special performance optimization process? (I also post the subject to gcc-help. Hope to get some suggestion in this mail list. Sorry for bothering.) This was already filed as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114532 . But someone needs to go in and do more analysis of what is going wrong. The biggest difference for x86_64 is how the variables are laid out and by who (the compiler or the linker). There is some notion that -fno-common increases the number of L1-dcache-load-misses and that points to the layout of the variable differences causing the difference. But nobody has gone and seen which variables are laid out differently and why. I am suspecting that small changes in the code/variables would cause layout differences which will cause the cache misses which can cause the performance which is almost all by accident. I suspect adding -fdata-sections will cause another performance difference here too. And there is not much GCC can do about this since data layout is "hard" to do to get the best performance always. (I am most familiar with embedded systems with static linking, rather than dealing with GOT and other aspects of linking on big systems.) I think -fno-common should allow -fsection-anchors to do a much better job. If symbols are put in the common section, the compiler does not know their relative position until link time. But if they are in bss or data sections (with or without -fdata-sections), it can at least use anchors to access data in the translation unit that defines the data objects. David Thanks, Andrew Pinski Best regards. Clark Zhao
Re: Is fcommon related with performance optimization logic?
On Wed, May 29, 2024 at 7:13 PM 赵海峰 via Gcc wrote: > > Dear Sir/Madam, > > > We found that running on intel SPR UnixBench compiled with gcc 10.3 performs > worse than with gcc 8.5 for dhry2reg benchmark. > > > I found it related with -fcommon option which is disabled in 10.3 by default. > Fcommon will make global variables addresses in special order in bss > section(watching by nm -n) whatever they are defined in source code. > > > We are wondering if fcommon has some special performance optimization process? > > > (I also post the subject to gcc-help. Hope to get some suggestion in this > mail list. Sorry for bothering.) This was already filed as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114532 . But someone needs to go in and do more analysis of what is going wrong. The biggest difference for x86_64 is how the variables are laid out and by who (the compiler or the linker). There is some notion that -fno-common increases the number of L1-dcache-load-misses and that points to the layout of the variable differences causing the difference. But nobody has gone and seen which variables are laid out differently and why. I am suspecting that small changes in the code/variables would cause layout differences which will cause the cache misses which can cause the performance which is almost all by accident. I suspect adding -fdata-sections will cause another performance difference here too. And there is not much GCC can do about this since data layout is "hard" to do to get the best performance always. Thanks, Andrew Pinski > > > Best regards. > > > Clark Zhao
Is fcommon related with performance optimization logic?
Dear Sir/Madam, We found that running on intel SPR UnixBench compiled with gcc 10.3 performs worse than with gcc 8.5 for dhry2reg benchmark. I found it related with -fcommon option which is disabled in 10.3 by default. Fcommon will make global variables addresses in special order in bss section(watching by nm -n) whatever they are defined in source code. We are wondering if fcommon has some special performance optimization process? (I also post the subject to gcc-help. Hope to get some suggestion in this mail list. Sorry for bothering.) Best regards. Clark Zhao