Re: [libc-coord] Add new ABI '__memcmpeq()' to libc
Would it make sense to extend this proposal to include __strcmpeq() and __strncmpeq()? Both are already available internally in GCC in form of BUILT_IN_STRCMP_EQ and BUILT_IN_STRNCMP_EQ (tree-ssa-strlen.c detects them in handle_builtin_string_cmp() and builtins.c tries to inline them in expand_builtin_memcmp()). However, they are currently restricted to cases where the length of the string or the size of the array (of both arguments) is known. A use case for strcmpeq() would be the comparison of std::type_info objects (equality and inequality operator) in libstdc++. On Tue, Sep 21, 2021 at 9:54 PM Noah Goldstein via Gcc wrote: > > On Fri, Sep 17, 2021 at 9:27 AM Florian Weimer via Libc-alpha < > libc-al...@sourceware.org> wrote: > > > * Joseph Myers: > > > > > I was supposing a build-time decision (using > > GCC_GLIBC_VERSION_GTE_IFELSE > > > to know if the glibc version on the target definitely has this > > function). > > > But if we add a header declaration, you could check for __memcmpeq being > > > declared (and so cover arbitrary C libraries, not just glibc, and avoid > > > issues of needing to disable this logic for freestanding compilations, > > > which would otherwise be an issue if a glibc-target toolchain is used > > for > > > a freestanding kernel compilation). The case of people calling > > > __builtin_memcmp (or declaring memcmp themselves) without string.h > > > included probably isn't one it's important to optimize. > > > > The header-less case looks relevant to C++ and other language front > > ends, though. So a GCC_GLIBC_VERSION_GTE_IFELSE check could still make > > sense for them. > > > > (Dropping libc-coord.) > > > > Thanks, > > Florian > > > > > What are we going with? > > Should I go forward with the proposal in GLIBC? > > If I should go forward with it should I include a def in string.h?
Re: Priority of builtins expansion strategies
On Tue, Jul 13, 2021 at 2:59 PM Richard Biener wrote: > > On Tue, Jul 13, 2021 at 2:19 PM Christoph Müllner via Gcc > wrote: > > > > On Tue, Jul 13, 2021 at 2:11 AM Alexandre Oliva wrote: > > > > > > On Jul 12, 2021, Christoph Müllner wrote: > > > > > > > * Why does the generic by-pieces infrastructure have a higher priority > > > > than the target-specific expansion via INSNs like setmem? > > > > > > by-pieces was not affected by the recent change, and IMHO it generally > > > makes sense for it to have priority over setmem. It generates only > > > straigh-line code for constant-sized blocks. Even if you can beat that > > > with some machine-specific logic, you'll probably end up generating > > > equivalent code at least in some cases, and then, you probably want to > > > carefully tune the settings that select one or the other, or disable > > > by-pieces altogether. > > > > > > > > > by-multiple-pieces, OTOH, is likely to be beaten by machine-specific > > > looping constructs, if any are available, so setmem takes precedence. > > > > > > My testing involved bringing it ahead of the insns, to exercise the code > > > more thoroughly even on x86*, but the submitted patch only used > > > by-multiple-pieces as a fallback. > > > > Let me give you an example of what by-pieces does on RISC-V (RV64GC). > > The following code... > > > > void* do_memset0_8 (void *p) > > { > > return memset (p, 0, 8); > > } > > > > void* do_memset0_15 (void *p) > > { > > return memset (p, 0, 15); > > } > > > > ...becomes (you can validate that with compiler explorer): > > > > do_memset0_8(void*): > > sb zero,0(a0) > > sb zero,1(a0) > > sb zero,2(a0) > > sb zero,3(a0) > > sb zero,4(a0) > > sb zero,5(a0) > > sb zero,6(a0) > > sb zero,7(a0) > > ret > > do_memset0_15(void*): > > sb zero,0(a0) > > sb zero,1(a0) > > sb zero,2(a0) > > sb zero,3(a0) > > sb zero,4(a0) > > sb zero,5(a0) > > sb zero,6(a0) > > sb zero,7(a0) > > sb zero,8(a0) > > sb zero,9(a0) > > sb zero,10(a0) > > sb zero,11(a0) > > sb zero,12(a0) > > sb zero,13(a0) > > sb zero,14(a0) > > ret > > > > Here is what a setmemsi expansion in the backend can do (in case > > unaligned access is cheap): > > > > 003c : > > 3c: 00053023sd zero,0(a0) > > 40: 8082ret > > > > 007e : > > 7e: 00053023sd zero,0(a0) > > 82: 000533a3sd zero,7(a0) > > 86: 8082ret > > > > Is there a way to generate similar code with the by-pieces infrastructure? > > Sure - tell it unaligned access is cheap. See alignment_for_piecewise_move > and how it uses slow_unaligned_access. Thanks for the pointer. I already knew about slow_unaligned_access, but I was not aware of overlap_op_by_pieces_p. Enabling both gives exactly the same as above. Thanks, Christoph > > > > * And if there are no particular reasons, would it be acceptable to > > > > change the order? > > > > > > I suppose moving insns ahead of by-pieces might break careful tuning of > > > multiple platforms, so I'd rather we did not make that change. > > > > Only platforms that have "setmemsi" implemented would be affected. > > And those platforms (arm, frv, ft32, nds32, pa, rs6000, rx, visium) > > have a carefully tuned > > implementation of the setmem expansion. I can't imagine that these > > setmem expansions > > produce less optimal code than the by-pieces infrastructure (which has > > less knowledge > > about the target). > > > > Thanks, > > Christoph
Re: Priority of builtins expansion strategies
On Tue, Jul 13, 2021 at 2:11 AM Alexandre Oliva wrote: > > On Jul 12, 2021, Christoph Müllner wrote: > > > * Why does the generic by-pieces infrastructure have a higher priority > > than the target-specific expansion via INSNs like setmem? > > by-pieces was not affected by the recent change, and IMHO it generally > makes sense for it to have priority over setmem. It generates only > straigh-line code for constant-sized blocks. Even if you can beat that > with some machine-specific logic, you'll probably end up generating > equivalent code at least in some cases, and then, you probably want to > carefully tune the settings that select one or the other, or disable > by-pieces altogether. > > > by-multiple-pieces, OTOH, is likely to be beaten by machine-specific > looping constructs, if any are available, so setmem takes precedence. > > My testing involved bringing it ahead of the insns, to exercise the code > more thoroughly even on x86*, but the submitted patch only used > by-multiple-pieces as a fallback. Let me give you an example of what by-pieces does on RISC-V (RV64GC). The following code... void* do_memset0_8 (void *p) { return memset (p, 0, 8); } void* do_memset0_15 (void *p) { return memset (p, 0, 15); } ...becomes (you can validate that with compiler explorer): do_memset0_8(void*): sb zero,0(a0) sb zero,1(a0) sb zero,2(a0) sb zero,3(a0) sb zero,4(a0) sb zero,5(a0) sb zero,6(a0) sb zero,7(a0) ret do_memset0_15(void*): sb zero,0(a0) sb zero,1(a0) sb zero,2(a0) sb zero,3(a0) sb zero,4(a0) sb zero,5(a0) sb zero,6(a0) sb zero,7(a0) sb zero,8(a0) sb zero,9(a0) sb zero,10(a0) sb zero,11(a0) sb zero,12(a0) sb zero,13(a0) sb zero,14(a0) ret Here is what a setmemsi expansion in the backend can do (in case unaligned access is cheap): 003c : 3c: 00053023sd zero,0(a0) 40: 8082ret 007e : 7e: 00053023sd zero,0(a0) 82: 000533a3sd zero,7(a0) 86: 8082ret Is there a way to generate similar code with the by-pieces infrastructure? > > * And if there are no particular reasons, would it be acceptable to > > change the order? > > I suppose moving insns ahead of by-pieces might break careful tuning of > multiple platforms, so I'd rather we did not make that change. Only platforms that have "setmemsi" implemented would be affected. And those platforms (arm, frv, ft32, nds32, pa, rs6000, rx, visium) have a carefully tuned implementation of the setmem expansion. I can't imagine that these setmem expansions produce less optimal code than the by-pieces infrastructure (which has less knowledge about the target). Thanks, Christoph
Priority of builtins expansion strategies
Hi, I'm working on some platform-specific optimizations for memset/memcpy/strcpy/strncpy. However, I am having difficulties understanding how my code should be integrated. Initially, I got inspired by rs6000-string.c, where I see expansion code for instructions like setmemsi or cmpstrsi. However, that expansion code is not always called. Instead, the first strategy is using the generic by-pieces infrastructure. To understand what I mean, let's have a look at memset (expand_builtin_memset_args). The backend can provide a tailored code sequence by expanding setmem. However, there is also a generic solution available using the by-pieces infrastructure. The generic by-pieces infrastructure has a higher priority than the target-specific setmem expansion. However, the recently added by-multiple-pieces infrastructure has lower priority than setmem. See: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/builtins.c;h=39ab139b7e1c06c98d2db1aef2b3a6095ffbec63;hb=HEAD#l7004 The same observation is true for most (all?) other uses of builtins. The current priority requires me to duplicate the condition code to decide if my optimization can be applied to the following places: 1) in TARGET_USE_BY_PIECES_INFRASTRUCTURE_P () to block by-pieces 2) in the setmem expansion to gate the optimization As I would expect that a target-specific mechanism is preferred over a generic mechanism, my questions are: * Why does the generic by-pieces infrastructure have a higher priority than the target-specific expansion via INSNs like setmem? * And if there are no particular reasons, would it be acceptable to change the order? Thanks, Christoph
Re: LTO Dead Field Elimination
Hi Richard, On 7/27/20 2:36 PM, Richard Biener wrote: > On Fri, Jul 24, 2020 at 5:43 PM Erick Ochoa > wrote: >> >> This patchset brings back struct reorg to GCC. >> >> We’ve been working on improving cache utilization recently and would >> like to share our current implementation to receive some feedback on it. >> >> Essentially, we’ve implemented the following components: >> >> Type-based escape analysis to determine if we can reorganize a type >> at link-time >> >> Dead-field elimination to remove unused fields of a struct at >> link-time >> >> The type-based escape analysis provides a list of types, that are not >> visible outside of the current linking unit (e.g. parameter types of >> external functions). >> >> The dead-field elimination pass analyses non-escaping structs for fields >> that are not used in the linking unit and thus can be removed. The >> resulting struct has a smaller memory footprint, which allows for a >> higher cache utilization. >> >> As a side-effect a couple of new infrastructure code has been written >> (e.g. a type walker, which we were really missing in GCC), which can be >> of course reused for other passes as well. >> >> We’ve prepared a patchset in the following branch: >> >>refs/vendors/ARM/heads/arm-struct-reorg-wip > > Just had some time to peek into this. Ugh. The code doesn't look like > GCC code looks :/ It doesn't help to have one set of files per C++ class > (25!). Any suggestions how to best structure these? Are there some coding guidelines in the GCC project, which can help us to match the expectation? > The code itself is undocumented - it's hard to understand what the purpose > of all the Walker stuff is. > > You also didn't seem to know walk_tree () nor walk_gimple* (). True, we were not aware of that code. Thanks for pointing to that code. We will have a look. > Take as example - I figured to look for IPA pass entries, then I see > > + > +static void > +collect_types () > +{ > + GimpleTypeCollector collector; > + collector.walk (); > + collector.print_collected (); > + ptrset_t types = collector.get_pointer_set (); > + GimpleCaster caster (types); > + caster.walk (); > + if (flag_print_cast_analysis) > +caster.print_reasons (); > + ptrset_t casting = caster.get_sets (); > + fix_escaping_types_in_set (casting); > + GimpleAccesser accesser; > + accesser.walk (); > + if (flag_print_access_analysis) > +accesser.print_accesses (); > + record_field_map_t record_field_map = accesser.get_map (); > + TypeIncompleteEquality equality; > + bool has_fields_that_can_be_deleted = false; > + typedef std::set field_offsets_t; > > there's no comments (not even file-level) that explains how type escape > is computed. > > Sorry, but this isn't even close to be coarsely reviewable. Sad to hear. We'll work on the input that you provided and provide a new version. Thanks, Christoph > >> We’ve also added a subsection in the GCC internals document to allow >> other compiler devs to better understand our design and implementation. >> A generated PDF can be found here: >> >> https://cloud.theobroma-systems.com/s/aWwxPiDJ3nCgc7F >> https://cloud.theobroma-systems.com/s/aWwxPiDJ3nCgc7F/download >> >> page: 719 >> >> We’ve been testing the pass against a range of in-tree tests and >> real-life applications (e.g. all SPEC CPU2017 C benchmarks). For >> testing, please see testing subsection in the gcc internals we prepared. >> >> Currently we see the following limitations: >> >> * It is not a "true" ipa pass yes. That is, we can only succeed with >> -flto-partition=none. >> * Currently it is not safe to use -fipa-sra. >> * Brace constructors not supported now. We handle this gracefully. >> * Only C as of now. >> * Results of sizeof() and offsetof() are generated in the compiler >> frontend and thus can’t be changed later at link time. There are a >> couple of ideas to resolve this, but that’s currently unimplemented. >> * At this point we’d like to thank the GCC community for their patient >> help so far on the mailing list and in other channels. And we ask for >> your support in terms of feedback, comments and testing. >> >> Thanks!