Re: GCC Optimisation, Part 0: Introduction
On Fri, 29 Apr 2011, Laurynas Biveinis wrote: Thanks Diego, please send me the form. I'll sign it as soon as my contributions require it. Don't wait; sign it right away - it might take a while to process it. Hey all, I should have told you, I sent the email so I'm now waiting for the papers. Thank you for your interest :-) Dimitris
Re: GCC Optimisation, Part 0: Introduction
On 04/27/2011 03:28 PM, Yuan Pengfei wrote: Any other advice will be appreciated. I think you can look into llvm-clang. It compiles faster and uses much less memory than gcc. It is also a completely different compiler. It doesn't make sense to compare the two, unless Dimitrios wants to rewrite GCC in 3 months. Joe's suggestion (take a look at the speedup areas page, investigate which bullets can still be a priority) and talking to Nicola to help him with benchmarks and get up to speed with the code, are the best in this thread. Regarding bitmaps, I have patches to use vector instructions. The compile-time difference was basically a wash, but I can give them to you if you are interested. Alternatively, the ebitmap.[ch] files in GCC are totally unused, and better in theory. The main hurdle is that ebitmaps are not growable. Still, they should be usable in many cases (or they can be made growable! :). A nice exercise to get up to speed with the code is to make GCC record a set of bitmap operations from real compilation, and then have a driver that uses that set of operations for benchmarking. This is the only sane way to extract the real performance of the data structure. It doesn't avoid the need for validation in the compiler, of course, where other effects can come into play (caching etc.), but it is a useful start. From the list in the webpage, the following items should be easier than the others: * Split simulate_stmt into simulate_stmt and simulate_phi because the caller of simulate_stmt knows whether it's got a statement or a PHI node. * Get rid of EXPR_LIST and INSN_LIST * cxx_binding should be 16 bytes, not 20. * Eliminate binfos for POD structs and other no-inheritance classes (not so easy) * Put the string at the end of the IDENTIFIER_NODE using the trailing array hack (or possibly in the ht_identifier, see libcpp/include/symtab.h and libcpp/symtab.c) Paolo
Re: GCC Optimisation, Part 0: Introduction
On Fri, Apr 29, 2011 at 9:18 AM, Paolo Bonzini bonz...@gnu.org wrote: On 04/27/2011 03:28 PM, Yuan Pengfei wrote: Any other advice will be appreciated. I think you can look into llvm-clang. It compiles faster and uses much less memory than gcc. It is also a completely different compiler. It doesn't make sense to compare the two, unless Dimitrios wants to rewrite GCC in 3 months. Joe's suggestion (take a look at the speedup areas page, investigate which bullets can still be a priority) and talking to Nicola to help him with benchmarks and get up to speed with the code, are the best in this thread. Regarding bitmaps, I have patches to use vector instructions. The compile-time difference was basically a wash, but I can give them to you if you are interested. Alternatively, the ebitmap.[ch] files in GCC are totally unused, and better in theory. The main hurdle is that ebitmaps are not growable. Still, they should be usable in many cases (or they can be made growable! :). If they are not growable then they are a candidate to replace sbitmaps (or bitmaps where sbitmaps are not used because the bitmap will be only sparsely populated). As usual most of the time improving the container implementation isn't the best kind of optimization. Instead exchanging the container type used or the algorithms will make more of a difference. A nice exercise to get up to speed with the code is to make GCC record a set of bitmap operations from real compilation, and then have a driver that uses that set of operations for benchmarking. This is the only sane way to extract the real performance of the data structure. It doesn't avoid the need for validation in the compiler, of course, where other effects can come into play (caching etc.), but it is a useful start. From the list in the webpage, the following items should be easier than the others: * Split simulate_stmt into simulate_stmt and simulate_phi because the caller of simulate_stmt knows whether it's got a statement or a PHI node. * Get rid of EXPR_LIST and INSN_LIST * cxx_binding should be 16 bytes, not 20. * Eliminate binfos for POD structs and other no-inheritance classes (not so easy) * Put the string at the end of the IDENTIFIER_NODE using the trailing array hack (or possibly in the ht_identifier, see libcpp/include/symtab.h and libcpp/symtab.c) Not so easy because C++ uses TREE_TYPE of it and thus you'd lose the nice sharing of identifiers. * Embed gimple_seq_node in gimple itself. This should simplify some iterators, avoid pointer indirection and eventually be more cache friendly. No idea why it wasn't done this way from the start. Richard. Paolo
Re: GCC Optimisation, Part 0: Introduction
Thank you all for your ideas, they are much appreciated. I will certainly investigate into the areas you mentioned, so do keep the feedback coming. I will certainly comment on them, once I have a better understanding. And I'd like to get in sync with existing work, so that duplicate effort is avoided (for example I'm already researching on finding an alternative hash function to replace htab_hash_string(), I don't know if Nicola has done that). But for now I will be reading some tutorial slides I found from HiPEAC and CGO conferences, so that I understand GCC's architecture. Slides are good but are missing many things, so if you have video/audio transcripts of tutorials they would come in handy. It is also a priority to standardise the way of measuring gcc's speed (which files to compile with what options), and dedicate some old machinery to that purpose. Perhaps this has already been done, and I can use existing setups? I found the following very useful pages: http://gcc.opensuse.org/ http://gcc.gnu.org/wiki/PerformanceTesting On another note, I can see that a similar project was accepted for 2007 GSOC, titled as Speeding up GCC for fun and profit. Where can I find more info? Finally I'd appreciate any guidelines regarding submitting patches for review or development in general. For example is diff -u the proper format to use, or should I include Changelog diff in all patches, or should they all be sent to gcc-patches list even if the patch is initial version just for reviewing, etc. Thanks, Dimitris
Re: GCC Optimisation, Part 0: Introduction
On Fri, Apr 29, 2011 at 09:24, Dimitrios Apostolou ji...@gmx.net wrote: Thank you all for your ideas, they are much appreciated. I will certainly investigate into the areas you mentioned, so do keep the feedback coming. I will certainly comment on them, once I have a better understanding. And I'd like to get in sync with existing work, so that duplicate effort is avoided (for example I'm already researching on finding an alternative hash function to replace htab_hash_string(), I don't know if Nicola has done that). But for now I will be reading some tutorial slides I found from HiPEAC and CGO conferences, so that I understand GCC's architecture. Slides are good but are missing many things, so if you have video/audio transcripts of tutorials they would come in handy. I don't think there are any such transcripts, sorry. I remember one year they had video going for the GCC summit, but I don't think they were ever released. It is also a priority to standardise the way of measuring gcc's speed (which files to compile with what options), and dedicate some old machinery to that purpose. Perhaps this has already been done, and I can use existing setups? I found the following very useful pages: http://gcc.opensuse.org/ http://gcc.gnu.org/wiki/PerformanceTesting Both are useful. My recommendation is to decide on a set of programs that you will use as your build timing benchmark and operate on those. A good source of code may be a big project like KDE or OpenOffice. On another note, I can see that a similar project was accepted for 2007 GSOC, titled as Speeding up GCC for fun and profit. Where can I find more info? I don't know what came of that, actually. Eric? Finally I'd appreciate any guidelines regarding submitting patches for review or development in general. For example is diff -u the proper format to use, or should I include Changelog diff in all patches, or should they all be sent to gcc-patches list even if the patch is initial version just for reviewing, etc. You should follow the guidelines in http://gcc.gnu.org/contribute.html. The very first one is a copyright assignment. Have you started that process? If not, I will send you the form. Diego.
Re: GCC Optimisation, Part 0: Introduction
Paolo == Paolo Bonzini bonz...@gnu.org writes: Paolo * Put the string at the end of the IDENTIFIER_NODE using the trailing Paolo array hack (or possibly in the ht_identifier, see Paolo libcpp/include/symtab.h and libcpp/symtab.c) I implemented this once: http://gcc.gnu.org/ml/gcc-patches/2008-03/msg01293.html It did not go in because a different approach (the one in the code now) was deemed clearer. Tom
Re: GCC Optimisation, Part 0: Introduction
On Fri, 29 Apr 2011, Diego Novillo wrote: The very first one is a copyright assignment. Have you started that process? If not, I will send you the form. Thanks Diego, please send me the form. I'll sign it as soon as my contributions require it. Dimitris
Re: GCC Optimisation, Part 0: Introduction
On Fri, Apr 29, 2011 at 09:18:56AM +0200, Paolo Bonzini wrote: * Get rid of EXPR_LIST and INSN_LIST This is reasonably difficult, though particular subprojects may be easy enough. Notable uses of EXPR_LIST: - loop-iv.c - the interface to TARGET_FUNCTION_VALUE - the scheduler - REG_NOTES - var-tracking.c - reload Notable uses of INSN_LIST: - the scheduler - reload - gcse.c The biggest uses of each in the scheduler ought to be easy to deal with, but the scheduler manipulates the lists in peculiar ways. * cxx_binding should be 16 bytes, not 20. Not your fault, but comments like this on SpeedupAreas are so opaque as to be useless. *Why* should cxx_binding be 16 bytes? Should we take the next member out and have a VEC someplace instead of chaining? Are we duplicating information in the members themselves? Etc. -Nathan
Re: GCC Optimisation, Part 0: Introduction
On 04/29/2011 04:15 PM, Nathan Froyd wrote: * cxx_binding should be 16 bytes, not 20. Not your fault, but comments like this on SpeedupAreas are so opaque as to be useless. *Why* should cxx_binding be 16 bytes? Should we take the next member out and have a VEC someplace instead of chaining? Are we duplicating information in the members themselves? Etc. Sorry, you're right. It's about cache lines I guess, and moving the bitfields into one of the pointers. Paolo
Re: GCC Optimisation, Part 0: Introduction
On Fri, Apr 29, 2011 at 04:20:15PM +0200, Paolo Bonzini wrote: On 04/29/2011 04:15 PM, Nathan Froyd wrote: * cxx_binding should be 16 bytes, not 20. Not your fault, but comments like this on SpeedupAreas are so opaque as to be useless. *Why* should cxx_binding be 16 bytes? Should we take the next member out and have a VEC someplace instead of chaining? Are we duplicating information in the members themselves? Etc. Sorry, you're right. It's about cache lines I guess, and moving the bitfields into one of the pointers. Gross. :) -Nathan
Re: GCC Optimisation, Part 0: Introduction
Thanks Diego, please send me the form. I'll sign it as soon as my contributions require it. Don't wait; sign it right away - it might take a while to process it. -- Laurynas
Re: GCC Optimisation, Part 0: Introduction
On Wed, Apr 27, 2011 at 10:29 PM, Basile Starynkevitch bas...@starynkevitch.net wrote: Here are some areas I'll look closer to, as shown by some early profiling I performed: * hash tables (both htab and symtab) There is probably a lot of tuning possible around GCC hash tables. However, I would imagine that tuned performance might depend on the particular hardware running the so modified GCC. And C++ is becoming an acceptable programming language inside GCC. So replacing current C hashtables by some fancier C++ collections might be considered (but I have no idea if this will be a performance win, perhaps no!; it could be a readability win.). Perhaps some hashtables are used to associate data to important structures (like gimple) which are difficult to change. Perhaps if you have strong measurements that adding information inside gimple instead of keeping an hashtable for it you might be able to remove some hashtables, but this is difficult, since GCC essential data types have been very hand-optimized. It seems to me that hash table in GCC is more like mapping(or dictionary, or associated array, or function(Key-Value)), instead of containter. I think the main problem of hash table is that conflict rate is unpredictable, so the lookup time is unpredictable. At it's worst condition, you will run equal method on all the elements to find a slot. Perhaps using B+ tree to implement mapping may be an alternative. Using hash table , you should implement hash and equal methods. Using B+ tree, you should only implement compare method. Using B+ tree, the time complexity of lookup is O(log(n)), which is much better. I don't think C++ STL map is necessarily better than hash table or B+ tree. For every (Key, Value) pair, it will instantiate a new class, rendering the resulting code size unnecessarily big. * ggc_internal_alloc_stat() or maybe implementing proper memory management instead of garbage collection, for hottest callers My opinion is that a garbage collector is (or can become) a proper memory manager. If you really want to switch back to manual memory management, please be sure to make changes which will be easy to revert. I believe that the current GCC garbage collector could be improved (but that is very hard work!) Agree. -- Chiheng Xu Wuhan,China
Re: GCC Optimisation, Part 0: Introduction
On Thu, Apr 28, 2011 at 2:52 PM, Chiheng Xu chiheng...@gmail.com wrote: On Wed, Apr 27, 2011 at 10:29 PM, Basile Starynkevitch bas...@starynkevitch.net wrote: Here are some areas I'll look closer to, as shown by some early profiling I performed: * hash tables (both htab and symtab) There is probably a lot of tuning possible around GCC hash tables. However, I would imagine that tuned performance might depend on the particular hardware running the so modified GCC. And C++ is becoming an acceptable programming language inside GCC. So replacing current C hashtables by some fancier C++ collections might be considered (but I have no idea if this will be a performance win, perhaps no!; it could be a readability win.). Perhaps some hashtables are used to associate data to important structures (like gimple) which are difficult to change. Perhaps if you have strong measurements that adding information inside gimple instead of keeping an hashtable for it you might be able to remove some hashtables, but this is difficult, since GCC essential data types have been very hand-optimized. It seems to me that hash table in GCC is more like mapping(or dictionary, or associated array, or function(Key-Value)), instead of containter. I think the main problem of hash table is that conflict rate is unpredictable, so the lookup time is unpredictable. At it's worst condition, you will run equal method on all the elements to find a slot. Perhaps using B+ tree to implement mapping may be an alternative. Using hash table , you should implement hash and equal methods. Using B+ tree, you should only implement compare method. Using B+ tree, the time complexity of lookup is O(log(n)), which is much better. Well, with a good hash-function your average lookup complexity is O(1) which is much better. Richard.
Re: GCC Optimisation, Part 0: Introduction
On 4/28/2011 8:55 AM, Richard Guenther wrote: It seems to me that hash table in GCC is more like mapping(or dictionary, or associated array, or function(Key-Value)), instead of containter. I think the main problem of hash table is that conflict rate is unpredictable, so the lookup time is unpredictable. At it's worst condition, you will run equal method on all the elements to find a slot. Perhaps using B+ tree to implement mapping may be an alternative. Using hash table , you should implement hash and equal methods. Using B+ tree, you should only implement compare method. Using B+ tree, the time complexity of lookup is O(log(n)), which is much better. Well, with a good hash-function your average lookup complexity is O(1) which is much better. I think the hash table is a much better choice than the B+ tree. You really are much more interested in average case performance in a compiler than worst case, especially when the worst case will not happen in practice. Richard.
Re: GCC Optimisation, Part 0: Introduction
On Thu, Apr 28, 2011 at 9:13 PM, Robert Dewar de...@adacore.com wrote: I think the hash table is a much better choice than the B+ tree. You really are much more interested in average case performance in a compiler than worst case, especially when the worst case will not happen in practice. Basically, I agree with you. Hash table is better in most cases, but is not always better, especially on ultra large key set. O(1) is better than O(log(n)), but is not much better. If you have 1 million elements, at most, you only need 20 comparisons. Comparison may be much cheaper than hashing. 20 comparisons may be faster than 1 hashing. -- Chiheng Xu Wuhan,China
Re: GCC Optimisation, Part 0: Introduction
On Fri, Apr 29, 2011 at 8:07 AM, Yuan Pengfei cool...@qq.com wrote: B+ tree is more commonly used in file systems. In memory, I think RB-tree is better. RB-tree vs. hash table is just like map vs. unordered_map. Any balanced tree that have O(log(n)) lookup complexity, including splay tree. -- Chiheng Xu Wuhan,China
Re: GCC Optimisation, Part 0: Introduction
2011/4/27 Dimitrios Apostolou ji...@gmx.net: * ggc_internal_alloc_stat() or maybe implementing proper memory management instead of garbage collection, for hottest caller This one can easily take much more time than three months. I've been working in this area, right now I'm working on allocating RTL outside GC. There are some things one can do in three months, but which will not necessarily result in speedups, only perhaps will enable future speed-ups: - Add object type tag to GC-allocated objects; - Introduce explicit marking stack instead of current implicit execution-based stack; Having these two it should be possible to replace GC implementation with something more advanced, for example, incremental GC. -- Laurynas