[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #25 from Adam Warner --- Documenting a workaround I've found for the unnecessary sign extension. I'm still perplexed at the improbability of this appearing to work! workaround_bsr_sign_extension.c: #include uint64_t bsr_u64(uint64_t a) { if (sizeof(unsigned long) == 8) return 63 - __builtin_clzl(a); if (sizeof(unsigned long long) == 8) return 63 - __builtin_clzll(a); } uint64_t bsr_u64_alt(uint64_t a) { if (sizeof(unsigned long) == 8) return UINT64_C(63) - __builtin_clzl(a); if (sizeof(unsigned long long) == 8) return UINT64_C(63) - __builtin_clzll(a); } int main(void) {return 0;} $ gcc -O3 workaround_bsr_sign_extension.c && objdump -d -m i386:x86-64:intel a.out|less Relevant output: 1140 : 1140: 48 0f bd c7 bsrrax,rdi 1144: 48 98 cdqe 1146: c3 ret 1147: 66 0f 1f 84 00 00 00nopWORD PTR [rax+rax*1+0x0] 114e: 00 00 1150 : 1150: 48 0f bd c7 bsrrax,rdi 1154: c3 ret In the alternative implementation the superfluous 32- to 64-bit sign extension instruction CDQE is no longer generated.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 Andrew Pinski changed: What|Removed |Added CC||gpiez at web dot de --- Comment #24 from Andrew Pinski --- *** Bug 50168 has been marked as a duplicate of this bug. ***
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #23 from Alexander Monakov --- In libcpp, search_line_fast implementations suffer from this, and what is worse, attempts to workaround it by explicitly requesting zero extension don't work: char *foo(char *p, int x) { return p + (unsigned)__builtin_ctz(x); } The above code is deliberately asking for zero extension, and yet various optimizations in GCC transform it back to costlier form with sign extension. (FWIW, LLVM gets this right)
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #22 from Steinar H. Gunderson --- Yes; I had a project recently where I won 15% overall performance (!) by using an asm statement with bsf instead of ffsll(). (This skipped both the sign extension and the check for zero; my argument could never be zero, due to a loop condition, but GCC didn't manage to remove the check nevertheless.)
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 Dávid Bolvanský changed: What|Removed |Added CC||david.bolvansky at gmail dot com --- Comment #21 from Dávid Bolvanský --- long long foo (int x) { return __builtin_ctz (x); } still produces foo(int): xor eax, eax tzcnt eax, edi cdqe ret instead of foo(int):# @foo(int) tzcnt eax, edi ret
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #20 from Jakub Jelinek jakub at gcc dot gnu.org --- Author: jakub Date: Sat Jul 6 09:34:17 2013 New Revision: 200731 URL: http://gcc.gnu.org/viewcvs?rev=200731root=gccview=rev Log: PR target/29776 * fold-const.c (tree_call_nonnegative_warnv_p): Return true for BUILT_IN_C{LZ,LRSB}*. * tree.h (CASE_INT_FN): Add FN##IMAX case. * tree-vrp.c (extract_range_basic): Handle BUILT_IN_{FFS,PARITY,POPCOUNT,C{LZ,TZ,LRSB}}*. For BUILT_IN_CONSTANT_P if argument isn't (D) of PARM_DECL, fall thru to code calling set_value*. * builtins.c (expand_builtin): Remove *IMAX cases. (fold_builtin_bitop): For BUILT_IN_CLRSB* return NULL_TREE if width is bigger than 2*HWI. * libgcc2.c (__floattisf): Avoid undefined signed overflow. * gcc.dg/tree-ssa/vrp89.c: New test. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/vrp89.c Modified: trunk/gcc/ChangeLog trunk/gcc/builtins.c trunk/gcc/fold-const.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vrp.c trunk/gcc/tree.h trunk/libgcc/ChangeLog trunk/libgcc/libgcc2.c
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #15 from Ondrej Bilka neleai at seznam dot cz --- On Thu, Jul 04, 2013 at 07:46:07PM +, glisse at gcc dot gnu.org wrote: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #14 from Marc Glisse glisse at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #12) Not all CPUs that have defined behavior for 0 define it to the precision though, and even on i?86 it is undefined even when using lzcnt/tzcnt on older CPUs. Even the libgcc routines provide various return values for 0 depending on target (look for COUNT_LEADING_ZEROS_0). So it is not an option to redefine the builtins, they are undefined behavior for 0 and that can't change. I didn't mean to document the value for 0 or have other optimizations create calls to this builtin relying on this property. I just meant that if we already have a call to clz and the argument happens to be 0 (which is undefined), we could optimize based on the assumption that the result is the precision, that may break less code than your current patch. Otherwise we might as well assert (well, tell it to VRP) that the argument is non-zero. But you are probably right that trying to work around the undefined value for 0 is wrong, another builtin should be used instead. Wait, when user triggers undefined behaviour anything is possible. So we can change return value and be still correct, rigth?
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #16 from Jakub Jelinek jakub at gcc dot gnu.org --- Defining the value at this point would mean that we'd need to generate slower/larger code on various targets (including i386/x86_64 when not targetting the really most recent CPUs only) and possibly create new libgcc entrypoints, at least for targets where the current library routine doesn't return the value that would be newly defined. That would slow down users that know they aren't passing 0 to the builtins (all correct ones). If you want builtins with defined behavior for 0, the only way is to add new ones, but the question is what the return value should be for 0. Or __builtin_ffs can be used, but that is 0 for 0 and otherwise 1 + __builtin_ctz, so if you wanted ctz from that, you'd subtract one, which would imply that for 0 this ctz variant would return -1. BTW, the patch broke regtest on x86_64, because libgcc contained code like: void foo (__int128_t u) { if ((long long) u == u) return something; long long hi = u 64; if (hi 0) hi = -hi; int count = __builtin_clzll (hi); if (count == 0) return somethingelse; ... } VRP figured out that count can't be 0 in a valid program (which is true), so I had to fix it up to do hi = -(unsigned long long) hi; so that it doesn't trigger undefined behavior for the minimum value of u.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #17 from Jakub Jelinek jakub at gcc dot gnu.org --- Unfortunately looking at longlong.h, the header on some targets uses __builtin_clz* or __builtin_ctz* even with defining COUNT_LEADING_ZEROS_0 to some value (e.g. on alpha, arm, avr, coldfire, mips). libgcc itself doesn't use COUNT_LEADING_ZEROS_0 macro, nor does glibc, but wonder if other projects don't use it. Code that happened to work (albeit, according to gcc documentation (but not longlong.h) undefined) could break with my patch. So, either we'd need to change longlong.h not to define COUNT_LEADING_ZEROS_0 when count_leading_zeros is implemented using the builtin, or we'd need to be more conservative in the VRP patch. The problem is what is more conservative. The C?Z_DEFINED_VALUE_AT_ZERO macros aren't unfortunately immediately usable, because they are well defined just for some integer modes (the ones for which there is some hw insn or at least optab), while the builtin might be for some other mode (larger or smaller). If the builtin is for smaller mode and HW supports only wider insns, generally clz is expanded as clzlarger + (largerbitsize - thisbitsize), ctz just as doing larger ctz and not adjusting in any way. So perhaps VRP could check whether there is an optab for the mode of the builtin and if yes, look at its C.Z_DEFINED_VALUE_AT_ZERO, otherwise assume undefined. That would let the longlong.h work, supposedly it uses the builtins only for the modes which are in hw (otherwise the libgcc implementation using those macros would be not working). Because trying to handle all the cases would be a mess, looking at CLZ_DEFINED_VALUE_AT_ZERO when it is defined it is usually number of bits in the mode and even the various extra clz expansions (both the widening which is expanded as wider clz + difference of the wider and narrower mode bitsize and doubleword narrowing which is expanded as runtime non-zero check on the first word and either clz of the first word or bitsize difference + clz of second word) probably honor it, but for ctz some targets use bitsize, some targets use -1, but e.g. when widening we'd happily return bitsize of the larger mode and not adjust, or when ctz is expanded using clz it can be again different. So, thoughts on this?
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #18 from Jakub Jelinek jakub at gcc dot gnu.org --- Or yet another option is just to always use [0, prec] VR for CLZ (and lower it for defined range, as the patch does right now), i.e. for CLZ change maxi = prec - 1; to maxi = prec; and drop if (maxi == prec) maxi = prec - 1;, looking for all targets where the value for 0 is defined for CLZ it is mode size, and it ought to behave well even with the widening/narrowing expansions. And for CTZ just leave 0 as undefined, because there really are just too many possibilities what you can get, and it isn't relevant to longlong.h either, because there is no COUNT_TRAILING_ZEROS_0 macro.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #19 from Jakub Jelinek jakub at gcc dot gnu.org --- Also, perhaps expand_builtin_unop for the int bitops builtin (ffs/parity/popcount/clz/ctz/clrsb) optabs, if target returned by expand_unop is wider than target_mode, perhaps we could set SUBREG_PROMOTED_VAR_P and !SUBREG_PROMOTED_UNSIGNED_P. When the builtins are expanded using libcall, the libcalls return int and so does the builtin, thus the target_mode should then be equal to what we have, and otherwise I'd think the result should be already extended, as in target should be the return value, not some value with target value only in the low bits. But that would require verification for all targets.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 Jakub Jelinek jakub at gcc dot gnu.org changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #8 from Jakub Jelinek jakub at gcc dot gnu.org --- Created attachment 30453 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=30453action=edit gcc49-pr29776.patch Untested VRP patch. There is SSA_NAME_RANGE_INFO support being reviewed currently which will make VRP info persistent and some zero/sign-extension pass that makes use of that info, perhaps that will take care of the rest. I'm not 100% sure about CLZ/CTZ in the patch, because it could return any value for argument of 0, but as we document it as undefined behavior, perhaps it is fine. With more code we could do even better than this and define from argument's VR smaller VR_RANGE than this patch does, say for ffs if 0 isn't in the VR_RANGE of the argument, the minimum value of the builtin won't be 0, but 1, and from maximum value we could take floor_log of the maximum value plus 1 as the maximum of the resulting range. Ditto for popcount, for parity the current [0, 1] is sufficient, for clz we could use precision minus floor_log of the maximum value as the minimum of resulting vr, for ctz derive from it maximum. Dunno if it is worth it though.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 Jakub Jelinek jakub at gcc dot gnu.org changed: What|Removed |Added Attachment #30453|0 |1 is obsolete|| --- Comment #9 from Jakub Jelinek jakub at gcc dot gnu.org --- Created attachment 30454 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=30454action=edit gcc49-pr29776.patch It wasn't actually that hard to improve the VRs in some cases.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #10 from dean at arctic dot org --- On Thu, 4 Jul 2013, jakub at gcc dot gnu.org wrote: I'm not 100% sure about CLZ/CTZ in the patch, because it could return any value for argument of 0, but as we document it as undefined behavior, perhaps it is fine. this is unfortunate really -- the newer LZCNT/TZCNT instructions on x86 explicitly return 8*sizeof(type) for input of zero because the undefined behaviour of BSF/BSR posed a lot of problems in inner loops. it might be useful to add explicit builtins for this behaviour. thanks for taking a look at the bug :) -dean
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #11 from Marc Glisse glisse at gcc dot gnu.org --- (In reply to dean from comment #10) this is unfortunate really -- the newer LZCNT/TZCNT instructions on x86 explicitly return 8*sizeof(type) for input of zero Same as the power cntlz* instructions. And in both cases we implement intrinsics as a call to __builtin_clz*. So it would make sense to assume this behavior in the optimizations (if it is undefined behavior, picking this behavior won't break more things than assuming it never happens).
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #12 from Jakub Jelinek jakub at gcc dot gnu.org --- Not all CPUs that have defined behavior for 0 define it to the precision though, and even on i?86 it is undefined even when using lzcnt/tzcnt on older CPUs. Even the libgcc routines provide various return values for 0 depending on target (look for COUNT_LEADING_ZEROS_0). So it is not an option to redefine the builtins, they are undefined behavior for 0 and that can't change. There is __builtin_ffs that provides defined behavior for 0.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #13 from joseph at codesourcery dot com joseph at codesourcery dot com --- For the RTL operations and optabs, we have CLZ_DEFINED_VALUE_AT_ZERO and CTZ_DEFINED_VALUE_AT_ZERO, but as noted in the documentation they do not refer to definedness for the built-in functions. (That's similar to SHIFT_COUNT_TRUNCATED / TARGET_SHIFT_TRUNCATION_MASK, which again describe semantics at the RTL level but do not affect the undefinedness of out-of-range shifts in C/C++ source code or GIMPLE.)
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #14 from Marc Glisse glisse at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #12) Not all CPUs that have defined behavior for 0 define it to the precision though, and even on i?86 it is undefined even when using lzcnt/tzcnt on older CPUs. Even the libgcc routines provide various return values for 0 depending on target (look for COUNT_LEADING_ZEROS_0). So it is not an option to redefine the builtins, they are undefined behavior for 0 and that can't change. I didn't mean to document the value for 0 or have other optimizations create calls to this builtin relying on this property. I just meant that if we already have a call to clz and the argument happens to be 0 (which is undefined), we could optimize based on the assumption that the result is the precision, that may break less code than your current patch. Otherwise we might as well assert (well, tell it to VRP) that the argument is non-zero. But you are probably right that trying to work around the undefined value for 0 is wrong, another builtin should be used instead. There is __builtin_ffs that provides defined behavior for 0. So, should that be used for the x86 and power intrinsics? Or a new __builtin_clz0?
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 sgunderson at bigfoot dot com changed: What|Removed |Added CC||sgunderson at bigfoot dot com --- Comment #6 from sgunderson at bigfoot dot com --- Without knowing anything about the GCC internals here, I could perhaps also point out that GCC should know that these have limited range. As a trivial example: int foo(int x) { int z = __builtin_ctz(x); if (z 2000) { return 1; } else { return 0; } } There's no way this function can return anything but 0, and VRP should probably be taught that. (I wonder if this would fix the unneccessary sign extension too?)
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #7 from sgunderson at bigfoot dot com --- Wait, sorry, someone's already pointed that out. Ignore me, then... I can at least confirm it still happens with GCC 4.8.1.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #4 from Uros Bizjak ubizjak at gmail dot com 2012-07-30 11:36:23 UTC --- Perhaps REE can be taught about ctz giving a non-negative result. Maybe we need some VRP pass to remove these extensions. Please note an example from (duplicate) PR54115, where we generate: #include stdint.h uint64_t foo(long x){ return __builtin_ctzl(x); } foo: bsfq %rdi, %rax cltq this is DImode - DImode operation, followed by sign-extend from SImode partial reg. Your examples in comment #3: bsfl %esi, %esi movslq %esi, %rsi can be fixed by slapping any_extend:DI to CTZ pattern, to consume either ZERO_EXTEND or SIGN_EXTEND of the value (it doesn't matter for ranges [0..(some small number)]). The DImode example above can be fixed by adding SUBREG to all patterns. But, I think that there is more optimal implementation than exploding the number of bit manipulating patterns.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 --- Comment #5 from Uros Bizjak ubizjak at gmail dot com 2012-07-30 12:48:19 UTC --- Created attachment 27895 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=27895 Target patch that handles ctz extensions x86 target patch that teaches gcc about extensions for ctz. Patched gcc generates no extension for: long long foo (int x) { return __builtin_ctz (x); } and int foo (long long x) { return __builtin_ctzll (x); } (and their unsigned variants) on x86.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 Andrew Pinski pinskia at gcc dot gnu.org changed: What|Removed |Added CC||neleai at seznam dot cz --- Comment #2 from Andrew Pinski pinskia at gcc dot gnu.org 2012-07-29 08:15:14 UTC --- *** Bug 54115 has been marked as a duplicate of this bug. ***
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 Steven Bosscher steven at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed|2011-12-24 00:00:00 |2012-07-29 0:00 CC||uros at gcc dot gnu.org --- Comment #3 from Steven Bosscher steven at gcc dot gnu.org 2012-07-29 13:23:30 UTC --- Tested with trunk r189929 on x86-64 (generic). #define MODE signed // or unsigned unsigned mask[8]; unsigned foo(unsigned y, unsigned char x) { MODE int c = (MODE) __builtin_ctz(x); unsigned int m = mask[c]; return y m; } With MODE==signed: movzbl %sil, %esi movl%edi, %eax rep; bsfl %esi, %esi movslq %esi, %rsi andlmask(,%rsi,4), %eax ret With MODE==unsigned: movzbl %sil, %esi movl%edi, %eax rep; bsfl %esi, %esi movl%esi, %esi andlmask(,%rsi,4), %eax ret The movl %esi, %esi is a zeroextendsidi: #(insn:TI 9 2 10 2 (parallel [ #(set (reg:SI 4 si [orig:60 D.1716 ] [60]) #(ctz:SI (reg:SI 4 si [orig:68 D.1715 ] [68]))) #(clobber (reg:CC 17 flags)) #]) t.c:6 666 {ctzsi2} # (expr_list:REG_UNUSED (reg:CC 17 flags) #(nil))) rep; bsfl %esi, %esi # 9 ctzsi2 [length = 4] #(insn:TI 10 9 12 2 (set (reg:DI 4 si [orig:70 D.1716 ] [70]) #(zero_extend:DI (reg:SI 4 si [orig:60 D.1716 ] [60]))) t.c:7 113 {*zero_extendsidi2_rex64} # (nil)) movl%esi, %esi # 10*zero_extendsidi2_rex64/1 [length = 2] Perhaps REE can be taught about ctz giving a non-negative result.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 Andrew Pinski pinskia at gcc dot gnu.org changed: What|Removed |Added Keywords||missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed||2011-12-24 Ever Confirmed|0 |1 --- Comment #1 from Andrew Pinski pinskia at gcc dot gnu.org 2011-12-24 06:46:16 UTC --- Trying 9 - 10: Failed to match this instruction: (set (reg:DI 69 [ D.1710 ]) (sign_extend:DI (ctz:SI (reg:SI 67 [ x ] Should be changed to zero_extend and then that could be matched really.