[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #13 from Vladimir Makarov --- (In reply to Tamar Christina from comment #12) > (In reply to Vladimir Makarov from comment #11) > > I just expressed my point of view to the bottom-up approach. If somebody > > implements any new RA approach which at least does not hurt credible > > benchmarks (e.g. SPEC) and improve some benchmarks and does not complicate > > existing RA too much, nobody will have legitimate arguments not to include > > the new code into GCC. > > > > I think that may be for some cases bottom-up approach could work better. > > Probably this is code for number crunching (with a lot of loop iterations). > > For some cases top-down approach works better for loops with smaller number > > iterations (e.g. most loops in GCC itself). > > > > I don't know much about the current RA implementation, but would it be > possible you think to have this be heuristics driven? or do you think it > would be too expensive to try multiple strategies and keep the one that > works best? Probably trying multiple strategies will be too expensive. Also compile-time evaluation of allocation cost is not that accurate, especially taking into account that the local register allocator can change global RA decisions and the changes are unpredictable. If the top-down algorithm works better for some cases, we could get some experience when it works and try to find heuristics to choose the right algorithm. I think it would be more realistic approach. It sounds like a case for ML but I am sure ML will not work. It is not a smooth function, besides ML create a lot of problems with GCC building (benchmark set, costly building, valdiation and using ML model etc). So heuristics built on understanding is a way to go. But I speculated too much because there is no bottom-up RA implementation with finding its advantages (if there are any) and its disadvantages.
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #12 from Tamar Christina --- (In reply to Vladimir Makarov from comment #11) > (In reply to Tamar Christina from comment #10) > > Hi Vlad, > > > > Just curious if you had a chance to think about an approach to this that > > would be acceptable. > > Sorry for not working on this issue more although I thought about the > problem for some time w/o finding any possible small code changes could > solve the problem. > Ah, hmm that's too bad but thanks for looking into it! > I just expressed my point of view to the bottom-up approach. If somebody > implements any new RA approach which at least does not hurt credible > benchmarks (e.g. SPEC) and improve some benchmarks and does not complicate > existing RA too much, nobody will have legitimate arguments not to include > the new code into GCC. > > I think that may be for some cases bottom-up approach could work better. > Probably this is code for number crunching (with a lot of loop iterations). > For some cases top-down approach works better for loops with smaller number > iterations (e.g. most loops in GCC itself). > I don't know much about the current RA implementation, but would it be possible you think to have this be heuristics driven? or do you think it would be too expensive to try multiple strategies and keep the one that works best?
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #11 from Vladimir Makarov --- (In reply to Tamar Christina from comment #10) > Hi Vlad, > > Just curious if you had a chance to think about an approach to this that > would be acceptable. Sorry for not working on this issue more although I thought about the problem for some time w/o finding any possible small code changes could solve the problem. I just expressed my point of view to the bottom-up approach. If somebody implements any new RA approach which at least does not hurt credible benchmarks (e.g. SPEC) and improve some benchmarks and does not complicate existing RA too much, nobody will have legitimate arguments not to include the new code into GCC. I think that may be for some cases bottom-up approach could work better. Probably this is code for number crunching (with a lot of loop iterations). For some cases top-down approach works better for loops with smaller number iterations (e.g. most loops in GCC itself). Simply, I tried already bottom-up approach and don't want to work on this again because if it does not work I will have to throw this work off again. I am mostly in maintenance mode for GCC RA, don't want to work on big RA changes. This work would be also very hard to sell to my management. If somebody implements bottom-up RA as an optional (or default if it is better on SPEC) algorithm and this improves some benchmarks, I will not object. The work I think would take several months for a person familiar with RA.
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 Tamar Christina changed: What|Removed |Added CC||tnfchris at gcc dot gnu.org --- Comment #10 from Tamar Christina --- Hi Vlad, Just curious if you had a chance to think about an approach to this that would be acceptable.
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #9 from Feng Xue --- (In reply to Feng Xue from comment #5) > > I would say that top-down algorithm behaves better than bottom-up one. I > > implemented Callahan-Koblentz (bottom-up algorithm) in GCC about 10 years > > ago and it behaved worse than the current one. I think adding an additional > Intuitively, it's better to give preference to regions with higher frequency > to let them pick up registers earlier and with more freedom, which matches > with bottom-up coloring strategy. > > > pass over regions probably would solve the problem but it will complicate > > already complicated RA (which is about 60K lines now). In any case I'll > > think about this problem. > > > > The problem you are mentioning in this code is quite insignificant (one > > memory access in the top most region). > But it will be significant inside multi-level loop. Actually, the problem is > exposed by a 8-level loop in a real Fortran application. > > > > > I also checked the attachments. What I see is GCC generates a better big > > loop body (a loop with high register pressure). GCC generated loop body has > > 50 x86-64 insns with 22 memory accesses vs 56 with 26 memory access for > > LLVM. > As far as how to spilling live range at loop boundary is concerned, gcc is > not very efficient. To make loop body not be the focus, when we remove two > live-range variables in original case, we can get same amount of memory > accesses for gcc and llvm, both of which generates 9 register spills in loop > body. I attached modified case and assembly files. For the modified case, allocation of llvm and gcc remain nearly the same for loop body, but as to 'lv0', llvm gets better spill/split result over gcc.
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #8 from Feng Xue --- Created attachment 46294 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46294=edit asm file 2 generated by llvm
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #7 from Feng Xue --- Created attachment 46293 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46293=edit asm file 2 generated by gcc
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #6 from Feng Xue --- Created attachment 46292 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46292=edit test case with less live ranges
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #5 from Feng Xue --- > I would say that top-down algorithm behaves better than bottom-up one. I > implemented Callahan-Koblentz (bottom-up algorithm) in GCC about 10 years > ago and it behaved worse than the current one. I think adding an additional Intuitively, it's better to give preference to regions with higher frequency to let them pick up registers earlier and with more freedom, which matches with bottom-up coloring strategy. > pass over regions probably would solve the problem but it will complicate > already complicated RA (which is about 60K lines now). In any case I'll > think about this problem. > > The problem you are mentioning in this code is quite insignificant (one > memory access in the top most region). But it will be significant inside multi-level loop. Actually, the problem is exposed by a 8-level loop in a real Fortran application. > > I also checked the attachments. What I see is GCC generates a better big > loop body (a loop with high register pressure). GCC generated loop body has > 50 x86-64 insns with 22 memory accesses vs 56 with 26 memory access for LLVM. As far as how to spilling live range at loop boundary is concerned, gcc is not very efficient. To make loop body not be the focus, when we remove two live-range variables in original case, we can get same amount of memory accesses for gcc and llvm, both of which generates 9 register spills in loop body. I attached modified case and assembly files.
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #4 from Vladimir Makarov --- (In reply to Feng Xue from comment #0) > Current regional RA uses a top-down allocation order, which may not properly > split a long live range that crosses sub-region with high register pressure. > > In the following graph, lv0 is live in whole outer region, and suppose inner > region is under high register pressure due to lots of live ranges inside it. > > According to RA algorithm, out region is processed firstly, lv0 is picked up > as spill candidate. And then turn to inner region, also the part of lv0 in > inner region is marked as being spilled. Finally result is that the whole > lv0 should be spilled. But if in area excluding inner region, there is with > low register pressure, we can get a better choice to place lv0 in register > instead of memory, and only spill/reload lv0 at boundary of > entry-into(A)/exist-from(B) inner region. In other word, inner region > boundary are split points for lv0. > > >| >outer | lv0 >region | __ split point >|/ > .--A---. > | | | > | | | | | > | inner | | lv1 | | > | region | | | | > | | | lv2 | | > | | | | | > | | | > '--B---' >|\__ >|split point >| > > Here is an example to show this. gcc produces bad spills as we point > out(-m64 -O3, for x86-64), but llvm generates better spill/reload as we > expect. > Thank you for your detailed report and proposed solutions. I would say that top-down algorithm behaves better than bottom-up one. I implemented Callahan-Koblentz (bottom-up algorithm) in GCC about 10 years ago and it behaved worse than the current one. I think adding an additional pass over regions probably would solve the problem but it will complicate already complicated RA (which is about 60K lines now). In any case I'll think about this problem. The problem you are mentioning in this code is quite insignificant (one memory access in the top most region). I also checked the attachments. What I see is GCC generates a better big loop body (a loop with high register pressure). GCC generated loop body has 50 x86-64 insns with 22 memory accesses vs 56 with 26 memory access for LLVM. > int value[20]; > > int user0; > int user1; > int user2[100]; > int user3; > > int fncall(void); > > void foo(int cond) > { > int lv_out = value[0]; > int i; > > user0 = lv_out; /* Better to place lv_out in register. */ > > if (cond) { > int sum = 0; > int lv_in_1 = value[1]; > int lv_in_2 = value[2]; > int lv_in_3 = value[3]; > int lv_in_4 = value[4]; > int lv_in_5 = value[5]; > int lv_in_6 = value[6]; > int lv_in_7 = value[7]; > int lv_in_8 = value[8]; > int lv_in_9 = value[9]; > int lv_in_10 = value[10]; > int lv_in_11 = value[11]; > int lv_in_12 = value[12]; > int lv_in_13 = value[13]; > int lv_in_14 = value[14]; > int lv_in_15 = value[15]; > > /* Better to spill lv_out here */ > > for (i = 0; i < 1000; i++) { > sum += lv_in_1; > sum += lv_in_2; > sum += lv_in_3; > sum += lv_in_4; > sum += lv_in_5; > sum += lv_in_6; > sum += lv_in_7; > sum += lv_in_8; > sum += lv_in_9; > sum += lv_in_10; > sum += lv_in_11; > sum += lv_in_12; > sum += lv_in_13; > sum += lv_in_14; > sum += lv_in_15; > > fncall(); > > lv_in_1 ^= i; > lv_in_2 ^= i; > lv_in_3 ^= i; > lv_in_4 ^= i; > lv_in_5 ^= i; > lv_in_6 ^= i; > lv_in_7 ^= i; > lv_in_8 ^= i; > lv_in_9 ^= i; > lv_in_10 ^= i; > lv_in_11 ^= i; > lv_in_12 ^= i; > lv_in_13 ^= i; > lv_in_14 ^= i; > lv_in_15 ^= i; > } > > /* Better to reload lv_out here */ > > user1 = sum; > } > > for (i = 0; i < 100; i++) { > user2[i ^ 100] = lv_out; /* Better to place lv_out in register */ > } > > user3 = lv_out; /* Better to place lv_out in register */ > } > > > For top-down allocation, we can only adjust inner region allocation result, > but no way to refine decision that has been made on outside live-range, it > is an intrinsic weakness of the top-down algorithm. To fix that, we may need > to add a new
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #3 from Feng Xue --- Created attachment 46237 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46237=edit test case for aarch64 Add another case composed for aarch64.
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #2 from Feng Xue --- Created attachment 46219 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46219=edit asm file generated by llvm
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #1 from Feng Xue --- Created attachment 46218 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46218=edit asm file generated by gcc