https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860
--- Comment #4 from prathamesh3492 at gcc dot gnu.org ---
Hi Tamar,
Sorry for late response.
perf profile for povray with LTO:
Compiled with 82d6d385f97 (commit before a2f4be3dae0):
20.03% pov::All_CSG_Intersect_Intersections
16.42% pov::All_Plane_Intersections
10.29%
pov::All_Sphere_Intersections
10.10% pov::Intersect_BBox_Tree
Compiled with a2f4be3dae0:
19.51%
pov::All_CSG_Intersect_Intersections
16.91% pov::All_Plane_Intersections
12.53% pov::All_Sphere_Intersections
9.81% pov::Intersect_BBox_Tree
I verified there are no code-gen differences for any of the above hot
functions.
Running size on povray_r_exe.out shows a slight code-size decrease of 344 bytes
for text section:
Compiled with 82d6d385f97: 1101505
Compiled with a2f4be3dae0: 1101161
Curiously, there’s a meaningful difference for pov::All_Sphere_Intersections,
which seems to be caused due to following adrp instruction (with no code-gen
changes in All_Sphere_Intersections):
Compiled with 82d6d385f97:
18.07 │4aec44: adrp x0, 4e0000 <pov::SetCommandOption(POVMSData*, unsigned
int, pov::shelldata*) [clone .isra.0]+0x1c0>
1.77 │4aec48: ldr d28, [x0, #2784]
Compiled with a2f4be3dae0:
28.93 │4aeae4: adrp x0, 4e0000 <pov::Warning(unsigned int, char const*,
...) [clone .constprop.0]+0x100>
1.27 │4aeae8: ldr d28, [x0, #2432]
This seems to come from following condition in Intersect_Sphere (which gets
inlined into All_Sphere Intersections):
if ((OCSquared >= Radius2) && (t_Closest_Approach < EPSILON))
As far as I see, there’s no difference between both adrp instructions except
the address (4aec44 vs 4aeae4). And as far as I know, adrp will only calculate
pc-relative page address (and not load any data). To check for any possible
icache misses I used L1I_CACHE_REFILL counter, and turns out that there are 64%
more L1 icache misses for above adrp instruction with a2f4be3dae0 compared to
82d6d385f97, which may (partially) explain the performance difference ?
Although perf stat shows there are around 7% more L1 icache misses for whole
program run with 82d6d385f97 compared to a2f4be3dae0.
I could (repeatedly) reproduce the issue on two neoverse-v2 machines.
The full command line passed to the compiler was:
"-O3 -Wl,-z,muldefs -lm -fallow-argument-mismatch -fpermissive -fstack-arrays
-flto -march=native -mcpu=neoverse-v2"
Thanks,
Prathamesh