Re: Add value range support into memcpy/memset expansion
On Tue, 19 Nov 2013, Jan Hubicka wrote: Hi, this patch fixes two issues with memcpy testcase - silences warning and updates the template as suggested by Uros in the PR. The testcase still fails on i386. This is because we end up with: ;; Function t (t, funcdef_no=0, decl_uid=1763, symbol_order=2) t (unsigned int c) { void * b.0_4; void * a.1_5; bb 2: if (c_2(D) = 9) goto bb 3; else goto bb 4; bb 3: b.0_4 = b; a.1_5 = a; memcpy (a.1_5, b.0_4, c_2(D)); bb 4: return; } and we have no useful value range on c_2 because assert_expr was removed, while in 64bit version there is a cast in bb 3 that preserves the info. Solving this is an independent (and I guess not terribly easy) problem. Hmm, I thought Jakub fixed this already (with the checking whether there are any uses of c_2(D) before the conditional)? Or is this a different case? Richard. Regtested x86_64-linux, will commit it shortly. Index: ChangeLog === --- ChangeLog (revision 204984) +++ ChangeLog (working copy) @@ -1,3 +1,10 @@ +2013-11-18 Jan Hubicka j...@suse.cz + Uros Bizjak ubiz...@gmail.com + + PR middle-end/59175 + * gcc.target/i386/memcpy-2.c: Fix template; + add +1 so the testcase passes at 32bit. + 2013-11-18 Dominique d'Humieres domi...@lps.ens.fr * c-c++-common/cilk-plus/PS/reduction-3.c: Use stdlib.h. Index: gcc.target/i386/memcpy-2.c === --- gcc.target/i386/memcpy-2.c(revision 204984) +++ gcc.target/i386/memcpy-2.c(working copy) @@ -1,11 +1,11 @@ /* { dg-do compile } */ /* { dg-options -O2 } */ -/* Memcpy should be inlined because block size is known. */ -/* { dg-final { scan-assembler-not memcpy } } */ void *a; void *b; t(unsigned int c) { if (c10) -memcpy (a,b,c); +__builtin_memcpy (a,b,c+1); } +/* Memcpy should be inlined because block size is known. */ +/* { dg-final { scan-assembler-not (jmp|call)\[\\t \]*memcpy } } */ -- Richard Biener rguent...@suse.de SUSE / SUSE Labs SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 GF: Jeff Hawn, Jennifer Guild, Felix Imendorffer
Re: Add value range support into memcpy/memset expansion
On Tue, Nov 19, 2013 at 09:50:56AM +0100, Richard Biener wrote: this patch fixes two issues with memcpy testcase - silences warning and updates the template as suggested by Uros in the PR. The testcase still fails on i386. This is because we end up with: ;; Function t (t, funcdef_no=0, decl_uid=1763, symbol_order=2) t (unsigned int c) { void * b.0_4; void * a.1_5; bb 2: if (c_2(D) = 9) goto bb 3; else goto bb 4; bb 3: b.0_4 = b; a.1_5 = a; memcpy (a.1_5, b.0_4, c_2(D)); bb 4: return; } and we have no useful value range on c_2 because assert_expr was removed, while in 64bit version there is a cast in bb 3 that preserves the info. Solving this is an independent (and I guess not terribly easy) problem. Hmm, I thought Jakub fixed this already (with the checking whether there are any uses of c_2(D) before the conditional)? Or is this a different case? It was a different case, I've only handled the case where you have if (c = 10) __builtin_unreachable (); and c doesn't have any immediate uses before this form of assertion. In Honza's testcase there is no __builtin_unreachable, but instead return at that point, changing the value range of c_2(D) in his case would be far more controversial - for __builtin_unreachable () it means if you pass c_2(D) 10 to the function and the if (c = 10) test is reached, it will be invalid (from this POV that transformation isn't 100% valid either, because there is no proof that if you call the function, then that condition stmt will be executed). While in the above case c_2(D) of 10 is completely valid. Perhaps better might be for __builtin_unreachable () to just add there an internal pass through call (perhaps doing nothing at all) that would just preserve the range and non-zero bits info, we could perhaps allow some extra optimizations on it (like, if we propagate to the first argument anything other than some SSA_NAME, we just remove the call and propagate it further), the question is what else should we do so that it doesn't inhibit some optimizations unnecessarily. But if the user cared enough to insert an __builtin_unreachable () assertion, perhaps it might be worth preserving that info. That said, for Honza's case keeping around some internal pass through call might be even far more expensive. Jakub
Re: Add value range support into memcpy/memset expansion
On Sun, Nov 17, 2013 at 3:38 PM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this is version I comitted. It also adds a testcase and enables the support in i386 backend. Honza * doc/md.texi (setmem, movstr): Update documentation. * builtins.c (determine_block_size): New function. (expand_builtin_memcpy): Use it and pass it to emit_block_move_hints. (expand_builtin_memset_args): Use it and pass it to set_storage_via_setmem. * expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters; update call to expander. (emit_block_move_hints): Add min_size/max_size parameters. (clear_storage_hints): Likewise. (set_storage_via_setmem): Likewise. (clear_storage): Update. * expr.h (emit_block_move_hints, clear_storage_hints, set_storage_via_setmem): Update prototype. * i386.c (ix86_expand_set_or_movmem): Add bounds; export. (ix86_expand_movmem, ix86_expand_setmem): Remove. (ix86_expand_movmem, ix86_expand_setmem): Remove. * i386.md (movmem, setmem): Pass parameters. * testsuite/gcc.target/i386/memcpy-2.c: New testcase. The new testcase fails for me on x86 and x86-64: FAIL: gcc.target/i386/memcpy-2.c scan-assembler-not memcpy FAIL: gcc.target/i386/memcpy-2.c (test for excess errors) -- H.J.
Re: Add value range support into memcpy/memset expansion
On Mon, Nov 18, 2013 at 6:19 AM, H.J. Lu hjl.to...@gmail.com wrote: On Sun, Nov 17, 2013 at 3:38 PM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this is version I comitted. It also adds a testcase and enables the support in i386 backend. Honza * doc/md.texi (setmem, movstr): Update documentation. * builtins.c (determine_block_size): New function. (expand_builtin_memcpy): Use it and pass it to emit_block_move_hints. (expand_builtin_memset_args): Use it and pass it to set_storage_via_setmem. * expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters; update call to expander. (emit_block_move_hints): Add min_size/max_size parameters. (clear_storage_hints): Likewise. (set_storage_via_setmem): Likewise. (clear_storage): Update. * expr.h (emit_block_move_hints, clear_storage_hints, set_storage_via_setmem): Update prototype. * i386.c (ix86_expand_set_or_movmem): Add bounds; export. (ix86_expand_movmem, ix86_expand_setmem): Remove. (ix86_expand_movmem, ix86_expand_setmem): Remove. * i386.md (movmem, setmem): Pass parameters. * testsuite/gcc.target/i386/memcpy-2.c: New testcase. The new testcase fails for me on x86 and x86-64: FAIL: gcc.target/i386/memcpy-2.c scan-assembler-not memcpy FAIL: gcc.target/i386/memcpy-2.c (test for excess errors) I got [hjl@gnu-6 gcc]$ /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ /export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c -fno-diagnostics-show-caret -fdiagnostics-color=never -O2 -ffat-lto-objects -ffat-lto-objects -S -m32 -o memcpy-2.s /export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c: In function ‘t’: /export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c:10:5: warning: incompatible implicit declaration of built-in function ‘memcpy’ [enabled by default] [hjl@gnu-6 gcc]$ cat memcpy-2.s .filememcpy-2.c .text .p2align 4,,15 .globlt .typet, @function t: .LFB0: .cfi_startproc subl$28, %esp .cfi_def_cfa_offset 32 movl32(%esp), %eax cmpl$9, %eax ja.L3 movl%eax, 8(%esp) movlb, %eax movl%eax, 4(%esp) movla, %eax movl%eax, (%esp) callmemcpy .L3: addl$28, %esp .cfi_def_cfa_offset 4 ret .cfi_endproc /* { dg-final { scan-assembler-not memcpy } } */ will also match .filememcpy-2.c -- H.J.
Re: Add value range support into memcpy/memset expansion
I opened: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59175 Thanks, seems I mixed up the testcase file. Will fix that soon. Honza -- H.J.
Re: Add value range support into memcpy/memset expansion
On Mon, Nov 18, 2013 at 9:16 AM, H.J. Lu hjl.to...@gmail.com wrote: On Mon, Nov 18, 2013 at 6:19 AM, H.J. Lu hjl.to...@gmail.com wrote: On Sun, Nov 17, 2013 at 3:38 PM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this is version I comitted. It also adds a testcase and enables the support in i386 backend. Honza * doc/md.texi (setmem, movstr): Update documentation. * builtins.c (determine_block_size): New function. (expand_builtin_memcpy): Use it and pass it to emit_block_move_hints. (expand_builtin_memset_args): Use it and pass it to set_storage_via_setmem. * expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters; update call to expander. (emit_block_move_hints): Add min_size/max_size parameters. (clear_storage_hints): Likewise. (set_storage_via_setmem): Likewise. (clear_storage): Update. * expr.h (emit_block_move_hints, clear_storage_hints, set_storage_via_setmem): Update prototype. * i386.c (ix86_expand_set_or_movmem): Add bounds; export. (ix86_expand_movmem, ix86_expand_setmem): Remove. (ix86_expand_movmem, ix86_expand_setmem): Remove. * i386.md (movmem, setmem): Pass parameters. * testsuite/gcc.target/i386/memcpy-2.c: New testcase. The new testcase fails for me on x86 and x86-64: FAIL: gcc.target/i386/memcpy-2.c scan-assembler-not memcpy FAIL: gcc.target/i386/memcpy-2.c (test for excess errors) I got [hjl@gnu-6 gcc]$ /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ /export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c -fno-diagnostics-show-caret -fdiagnostics-color=never -O2 -ffat-lto-objects -ffat-lto-objects -S -m32 -o memcpy-2.s /export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c: In function ‘t’: /export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c:10:5: warning: incompatible implicit declaration of built-in function ‘memcpy’ [enabled by default] [hjl@gnu-6 gcc]$ cat memcpy-2.s .filememcpy-2.c .text .p2align 4,,15 .globlt .typet, @function t: .LFB0: .cfi_startproc subl$28, %esp .cfi_def_cfa_offset 32 movl32(%esp), %eax cmpl$9, %eax ja.L3 movl%eax, 8(%esp) movlb, %eax movl%eax, 4(%esp) movla, %eax movl%eax, (%esp) callmemcpy .L3: addl$28, %esp .cfi_def_cfa_offset 4 ret .cfi_endproc /* { dg-final { scan-assembler-not memcpy } } */ will also match .filememcpy-2.c -- H.J. I opened: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59175 -- H.J.
Re: Add value range support into memcpy/memset expansion
Hi, this patch fixes two issues with memcpy testcase - silences warning and updates the template as suggested by Uros in the PR. The testcase still fails on i386. This is because we end up with: ;; Function t (t, funcdef_no=0, decl_uid=1763, symbol_order=2) t (unsigned int c) { void * b.0_4; void * a.1_5; bb 2: if (c_2(D) = 9) goto bb 3; else goto bb 4; bb 3: b.0_4 = b; a.1_5 = a; memcpy (a.1_5, b.0_4, c_2(D)); bb 4: return; } and we have no useful value range on c_2 because assert_expr was removed, while in 64bit version there is a cast in bb 3 that preserves the info. Solving this is an independent (and I guess not terribly easy) problem. Regtested x86_64-linux, will commit it shortly. Index: ChangeLog === --- ChangeLog (revision 204984) +++ ChangeLog (working copy) @@ -1,3 +1,10 @@ +2013-11-18 Jan Hubicka j...@suse.cz + Uros Bizjak ubiz...@gmail.com + + PR middle-end/59175 + * gcc.target/i386/memcpy-2.c: Fix template; + add +1 so the testcase passes at 32bit. + 2013-11-18 Dominique d'Humieres domi...@lps.ens.fr * c-c++-common/cilk-plus/PS/reduction-3.c: Use stdlib.h. Index: gcc.target/i386/memcpy-2.c === --- gcc.target/i386/memcpy-2.c (revision 204984) +++ gcc.target/i386/memcpy-2.c (working copy) @@ -1,11 +1,11 @@ /* { dg-do compile } */ /* { dg-options -O2 } */ -/* Memcpy should be inlined because block size is known. */ -/* { dg-final { scan-assembler-not memcpy } } */ void *a; void *b; t(unsigned int c) { if (c10) -memcpy (a,b,c); +__builtin_memcpy (a,b,c+1); } +/* Memcpy should be inlined because block size is known. */ +/* { dg-final { scan-assembler-not (jmp|call)\[\\t \]*memcpy } } */
Re: Add value range support into memcpy/memset expansion
Hi, this is version I comitted. It also adds a testcase and enables the support in i386 backend. Honza * doc/md.texi (setmem, movstr): Update documentation. * builtins.c (determine_block_size): New function. (expand_builtin_memcpy): Use it and pass it to emit_block_move_hints. (expand_builtin_memset_args): Use it and pass it to set_storage_via_setmem. * expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters; update call to expander. (emit_block_move_hints): Add min_size/max_size parameters. (clear_storage_hints): Likewise. (set_storage_via_setmem): Likewise. (clear_storage): Update. * expr.h (emit_block_move_hints, clear_storage_hints, set_storage_via_setmem): Update prototype. * i386.c (ix86_expand_set_or_movmem): Add bounds; export. (ix86_expand_movmem, ix86_expand_setmem): Remove. (ix86_expand_movmem, ix86_expand_setmem): Remove. * i386.md (movmem, setmem): Pass parameters. * testsuite/gcc.target/i386/memcpy-2.c: New testcase. Index: doc/md.texi === --- doc/md.texi (revision 204899) +++ doc/md.texi (working copy) @@ -5328,6 +5328,9 @@ destination and source strings are opera the expansion of this pattern should store in operand 0 the address in which the @code{NUL} terminator was stored in the destination string. +This patern has also several optional operands that are same as in +@code{setmem}. + @cindex @code{setmem@var{m}} instruction pattern @item @samp{setmem@var{m}} Block set instruction. The destination string is the first operand, @@ -5347,6 +5350,8 @@ respectively. The expected alignment di in a way that the blocks are not required to be aligned according to it in all cases. This expected alignment is also in bytes, just like operand 4. Expected size, when unknown, is set to @code{(const_int -1)}. +Operand 7 is the minimal size of the block and operand 8 is the +maximal size of the block (NULL if it can not be represented as CONST_INT). The use for multiple @code{setmem@var{m}} is as for @code{movmem@var{m}}. Index: builtins.c === --- builtins.c (revision 204899) +++ builtins.c (working copy) @@ -3095,6 +3095,51 @@ builtin_memcpy_read_str (void *data, HOS return c_readstr (str + offset, mode); } +/* LEN specify length of the block of memcpy/memset operation. + Figure out its range and put it into MIN_SIZE/MAX_SIZE. */ + +static void +determine_block_size (tree len, rtx len_rtx, + unsigned HOST_WIDE_INT *min_size, + unsigned HOST_WIDE_INT *max_size) +{ + if (CONST_INT_P (len_rtx)) +{ + *min_size = *max_size = UINTVAL (len_rtx); + return; +} + else +{ + double_int min, max; + if (TREE_CODE (len) == SSA_NAME + get_range_info (len, min, max) == VR_RANGE) + { + if (min.fits_uhwi ()) + *min_size = min.to_uhwi (); + else + *min_size = 0; + if (max.fits_uhwi ()) + *max_size = max.to_uhwi (); + else + *max_size = (HOST_WIDE_INT)-1; + } + else + { + if (host_integerp (TYPE_MIN_VALUE (TREE_TYPE (len)), 1)) + *min_size = tree_low_cst (TYPE_MIN_VALUE (TREE_TYPE (len)), 1); + else + *min_size = 0; + if (host_integerp (TYPE_MAX_VALUE (TREE_TYPE (len)), 1)) + *max_size = tree_low_cst (TYPE_MAX_VALUE (TREE_TYPE (len)), 1); + else + *max_size = GET_MODE_MASK (GET_MODE (len_rtx)); + } +} + gcc_checking_assert (*max_size = + (unsigned HOST_WIDE_INT) + GET_MODE_MASK (GET_MODE (len_rtx))); +} + /* Expand a call EXP to the memcpy builtin. Return NULL_RTX if we failed, the caller should emit a normal call, otherwise try to get the result in TARGET, if convenient (and in @@ -3117,6 +3162,8 @@ expand_builtin_memcpy (tree exp, rtx tar rtx dest_mem, src_mem, dest_addr, len_rtx; HOST_WIDE_INT expected_size = -1; unsigned int expected_align = 0; + unsigned HOST_WIDE_INT min_size; + unsigned HOST_WIDE_INT max_size; /* If DEST is not a pointer type, call the normal function. */ if (dest_align == 0) @@ -3136,6 +3183,7 @@ expand_builtin_memcpy (tree exp, rtx tar dest_mem = get_memory_rtx (dest, len); set_mem_align (dest_mem, dest_align); len_rtx = expand_normal (len); + determine_block_size (len, len_rtx, min_size, max_size); src_str = c_getstr (src); /* If SRC is a string constant and block move would be done @@ -3164,7 +3212,8 @@ expand_builtin_memcpy (tree exp, rtx tar dest_addr = emit_block_move_hints (dest_mem, src_mem, len_rtx, CALL_EXPR_TAILCALL
Re: Add value range support into memcpy/memset expansion
Hi, this patch makes it possible to access value range info from setmem/movstr that I plan to use in i386 memcpy/memset expansion code. It is all quite straighforward except that I need to deal with cases where max size does not fit in HOST_WIDE_INT where I use maximal value as a marker. It is then translated as NULL pointer to the expander that is bit inconsistent with other places that use -1 as marker of unknown value. I also think we lose some cases because of TER replacing out the SSA_NAME by something else, but it seems to work in quite many cases. This can be probably tracked incrementally by disabling TER here or finally getting away from expanding calls via the generic route. Bootstrapped/regtested x86_64-linux, OK? Honza * doc/md.texi (setmem, movstr): Update documentation. * builtins.c (determine_block_size): New function. (expand_builtin_memcpy): Use it and pass it to emit_block_move_hints. (expand_builtin_memset_args): Use it and pass it to set_storage_via_setmem. * expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters; update call to expander. (emit_block_move_hints): Add min_size/max_size parameters. (clear_storage_hints): Likewise. (set_storage_via_setmem): Likewise. (clear_storage): Update. * expr.h (emit_block_move_hints, clear_storage_hints, set_storage_via_setmem): Update prototype. Ping... http://gcc.gnu.org/ml/gcc-patches/2013-09/msg02011.html Honza
Re: Add value range support into memcpy/memset expansion
On Fri, 15 Nov 2013, Jan Hubicka wrote: Hi, this patch makes it possible to access value range info from setmem/movstr that I plan to use in i386 memcpy/memset expansion code. It is all quite straighforward except that I need to deal with cases where max size does not fit in HOST_WIDE_INT where I use maximal value as a marker. It is then translated as NULL pointer to the expander that is bit inconsistent with other places that use -1 as marker of unknown value. I also think we lose some cases because of TER replacing out the SSA_NAME by something else, but it seems to work in quite many cases. This can be probably tracked incrementally by disabling TER here or finally getting away from expanding calls via the generic route. Bootstrapped/regtested x86_64-linux, OK? Honza * doc/md.texi (setmem, movstr): Update documentation. * builtins.c (determine_block_size): New function. (expand_builtin_memcpy): Use it and pass it to emit_block_move_hints. (expand_builtin_memset_args): Use it and pass it to set_storage_via_setmem. * expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters; update call to expander. (emit_block_move_hints): Add min_size/max_size parameters. (clear_storage_hints): Likewise. (set_storage_via_setmem): Likewise. (clear_storage): Update. * expr.h (emit_block_move_hints, clear_storage_hints, set_storage_via_setmem): Update prototype. Ping... http://gcc.gnu.org/ml/gcc-patches/2013-09/msg02011.html Ok - please make sure it still works after all the header re-org ;) Thanks, Richard.
Re: Add value range support into memcpy/memset expansion
We now produce: movqb(%rip), %rsi movqa(%rip), %rcx movq(%rsi), %rax - first 8 bytes are moved leaq8(%rcx), %rdi andq$-8, %rdi - dest is aligned movq%rax, (%rcx) movq132(%rsi), %rax - last 8 bytes are moved movq%rax, 132(%rcx) subq%rdi, %rcx - alignment is subtracted from count subq%rcx, %rsi - source is aligned This (source aligned) is not always true, but nevertheless the sequence is very tight. Yep, sure, it is algigned only if source-dest is aligned, but that is best we can ask for. Unforutnately the following testcase: char *p,*q; t(int a) { if (a100) memcpy(q,p,a); } Won't get inlined. This is because A is known to be smaller than 100 that results in anti range after conversion to size_t. This anti range allows very large values (above INT_MAX) and thus we do not know the block size. I am not sure if the sane range can be recovered somehow. If not, maybe this is common enough to add support for probable upper bound parameter to the template. Do we know if there is real code that intentionally does that other than security flaws as result of improperly done range check? I do not think so. I think by default GCC should assume the memcpy size range is (0, 100) here with perhaps an option to override it. Indeed, this is what I was suggesting. Problem is what to pass down to the expanders as a value range. We either need to update documentation of the expanders that the ranges are just highly probably - and I do not want to do that since I want to use the ranges for move_by_pieces, too. So I think we will have to introduce two upper bounds parameters - one sure and other very likely if there is no other solution. We play similar tricks in niter code. Honza
Re: Add value range support into memcpy/memset expansion
Hi Jan, I also think the misaligned move trick can/should be performed by move_by_pieces and we ought to consider sane use of SSE - current vector_loop with unrolling factor of 4 seems bit extreme. At least buldozer is happy with 2 and I would expect SSE moves to be especially useful for moving blocks with known size where they are not used at all. Currently I disabled misaligned move prologues/epilogues for Michael's vector loop path since they ends up longer than the traditional code (that use loop for epilogue) Prologues could use this techniques even with vector_loop, as they actually don't have a loop. As for epilogues - have you tried to use misaligned vector_moves (movdqu)? It looks to me that we need approx. the same amount of instructions in vector-loop and in usual-loop epilogues, if we use vector-instructions in vector-loop epilogue. Comments are welcome. BTW, maybe we could generalize expand_small_movmem a bit and make a separate expanding strategy out of it? It will expand a memmov with no loop (and probably no alignment prologue) - just with the biggest available moves. Also, a cost model could be added here to make decisions on when we actually want to align the moves. Here is a couple of examples of that: memcpy (a, b, 32); // alignment is unknown will expand to movdqu a, %xmm0 movdqu a+16, %xmm1 movdqu %xmm0, b movdqu %xmm1, b+16 memcpy (a, b, 32); // alignment is known and equals 64bit will expand to a) movdqu a, %xmm0 movdqu a+16, %xmm1 movdqu %xmm0, b movdqu %xmm1, b+16 or b) movqa, %xmm0 movdqa a+8, %xmm1 movqa+24,%xmm2 movq%xmm0, b movdqa %xmm1, b+8 movq%xmm2, b+24 We would compute the total cost of both variant and choose the best - for computation we need just a cost of aligned and misaligned moves. This strategy is actually pretty similar to move_by_pieces, but as it have much more target-specific information, it would be able to produce much more effective code. And one more question - in a work on vector_loop for memset I tried to merge many of movmem and setmem routines (e.g. instead of expand_movmem_prologue and expand_setmem_prologue I made a single routine expand_movmem_or_setmem_prologue). What do you think, is it a good idea? It reduces code size in i386.c, but slightly complicates the code. I'll send a patch shortly, as soon as the testing completes. Thanks, Michael
Re: Add value range support into memcpy/memset expansion
Hi Jan, I also think the misaligned move trick can/should be performed by move_by_pieces and we ought to consider sane use of SSE - current vector_loop with unrolling factor of 4 seems bit extreme. At least buldozer is happy with 2 and I would expect SSE moves to be especially useful for moving blocks with known size where they are not used at all. Currently I disabled misaligned move prologues/epilogues for Michael's vector loop path since they ends up longer than the traditional code (that use loop for epilogue) Prologues could use this techniques even with vector_loop, as they actually don't have a loop. Were new prologues lose is the fact that we need to handle all sizes smaller than SIZE_NEEDED. This is 64bytes that leads to a variant for 32..64, 16..32, 8...16 4..8 and the tail. It is quite a lot of code. When block is known to be greater than 64, this is also a win but my current patch does not fine tune this, yet. Similarly misaligned moves are win when size is known, alignment is not performed and normal fixed size epiogue needs more than one move or when alignment is known but offset is non-zero. It will need bit of tweaking to handle all the paths well - it is usual problem with the stringops, they get way too complex as number of factors increase. It is why I think we may consider vector loop with less than 4 unrollings. AMD optimization manual recommends two for buldozer... Is there difference between 4 and two for Atom? To be honest I am not quite sure from where constant of 4 comes. I think I introduced it long time ago for K8 where it apparently got some extra % of performance. It is used for larger blocks only for PPro. AMD chips preffer it for small blocks apparently becuase they preffer loop-less sequence. As for epilogues - have you tried to use misaligned vector_moves (movdqu)? It looks to me that we need approx. the same amount of instructions in vector-loop and in usual-loop epilogues, if we use vector-instructions in vector-loop epilogue. Yes, code is in place for this. You can just remove the check for size_needed being smaller than 32 and it will produce the movdqu sequence for you (I tested it on the vector loop testcase in the testusite). The code will also use SSE for unrolled_loop prologue expansion at least for memcpy (for memset it does not have broadcast value so it should skip it). Comments are welcome. BTW, maybe we could generalize expand_small_movmem a bit and make a separate expanding strategy out of it? It will expand a memmov with no loop (and probably no alignment prologue) - just with the biggest available moves. Also, a cost model could be added here to make decisions on when we actually want to align the moves. Here is a couple of examples of that: memcpy (a, b, 32); // alignment is unknown will expand to movdqu a, %xmm0 movdqu a+16, %xmm1 movdqu %xmm0, b movdqu %xmm1, b+16 memcpy (a, b, 32); // alignment is known and equals 64bit will expand to a) movdqu a, %xmm0 movdqu a+16, %xmm1 movdqu %xmm0, b movdqu %xmm1, b+16 or b) movq a, %xmm0 movdqa a+8, %xmm1 movq a+24,%xmm2 movq %xmm0, b movdqa %xmm1, b+8 movq %xmm2, b+24 We would compute the total cost of both variant and choose the best - for computation we need just a cost of aligned and misaligned moves. This strategy is actually pretty similar to move_by_pieces, but as it have much more target-specific information, it would be able to produce much more effective code. I was actually thinking more along lines of teaching move_by_pieces to do the tricks. It seems there is not that much of x86 specific knowledge in here and other architectures will also benefit from it. We can also enable it when value range is close enough. I plan to look into it today or tomorrow - revisit your old patch to move_by_pieces and see how much of extra API I need to get move_by_pieces to do what expand_small_movmem does. And one more question - in a work on vector_loop for memset I tried to merge many of movmem and setmem routines (e.g. instead of expand_movmem_prologue and expand_setmem_prologue I made a single routine expand_movmem_or_setmem_prologue). What do you think, is it a good idea? It reduces code size in i386.c, but slightly complicates the code. I'll send a patch shortly, as soon as the testing completes. I would like to see it. I am not too thrilled by the duplication. My original motivation for that was to keep under control number of code paths thorugh the expanders. We already have many of them (and it is easy to get wrong code) as different variants of prologues/epilgoues and main loops are not exactly the same and thus the code is not as moduler as I would like. I am not sure if adding differences in between memset and memmove is not going to add too much of extra cases to think of. Maybe not, like for the misaligned
Re: Add value range support into memcpy/memset expansion
Nice extension. Test cases would be great to have. Fore those you need i386 changes to actually use the info. I will post that after some cleanup and additional testing. Honza
Re: Add value range support into memcpy/memset expansion
Nice extension. Test cases would be great to have. Fore those you need i386 changes to actually use the info. I will post that after some cleanup and additional testing. Hi, since I already caught your attention, here is the target specific part for comments. this patch implements memcpy/memset prologues and epilogues as suggested by Ondrej Bilka. His glibc implementation use IMO very smart trick with single misaligned move to copy first N and last N bytes of the block. The remainder of the block is then copied by the usual loop that gets aligned to the proper address. This leads to partial memory stall, but that is handled well by modern x86 chips. For example in the following testcase: char *a; char *b; t1() { memcpy (a,b,140); } We now produce: movqb(%rip), %rsi movqa(%rip), %rcx movq(%rsi), %rax - first 8 bytes are moved leaq8(%rcx), %rdi andq$-8, %rdi - dest is aligned movq%rax, (%rcx) movq132(%rsi), %rax - last 8 bytes are moved movq%rax, 132(%rcx) subq%rdi, %rcx - alignment is subtracted from count subq%rcx, %rsi - source is aligned addl$140, %ecx - normal copying of 8 byte chunks shrl$3, %ecx rep; movsq ret Instead of: movqa(%rip), %rdi movqb(%rip), %rsi movl$140, %eax testb $1, %dil jne .L28 testb $2, %dil jne .L29 .L3: testb $4, %dil jne .L30 .L4: movl%eax, %ecx xorl%edx, %edx shrl$3, %ecx testb $4, %al rep movsq je .L5 movl(%rsi), %edx movl%edx, (%rdi) movl$4, %edx .L5: testb $2, %al je .L6 movzwl (%rsi,%rdx), %ecx movw%cx, (%rdi,%rdx) addq$2, %rdx .L6: testb $1, %al je .L8 movzbl (%rsi,%rdx), %eax movb%al, (%rdi,%rdx) .L8: rep ret .p2align 4,,10 .p2align 3 .L28: movzbl (%rsi), %eax addq$1, %rsi movb%al, (%rdi) addq$1, %rdi movl$139, %eax testb $2, %dil je .L3 .p2align 4,,10 .p2align 3 .L29: movzwl (%rsi), %edx subl$2, %eax addq$2, %rsi movw%dx, (%rdi) addq$2, %rdi testb $4, %dil je .L4 .p2align 4,,10 .p2align 3 .L30: movl(%rsi), %edx subl$4, %eax addq$4, %rsi movl%edx, (%rdi) addq$4, %rdi jmp .L4 With the proposed value range code we can now take advantage of it even for non-constant moves. Somewhat artificial testcase: char *p,*q; t(unsigned int a) { if (a8 a100) memcpy(q,p,a); } Still generate pretty much same code (while -minline-all-stringops code on mainline is just horrible): leal-9(%rdi), %edx movl%edi, %eax cmpl$90, %edx jbe .L5 rep; ret .p2align 4,,10 .p2align 3 .L5: movqp(%rip), %rsi movqq(%rip), %rcx movq(%rsi), %rdx movq%rdx, (%rcx) movl%edi, %edx movq-8(%rsi,%rdx), %rdi movq%rdi, -8(%rcx,%rdx) leaq8(%rcx), %rdi andq$-8, %rdi subq%rdi, %rcx subq%rcx, %rsi addl%eax, %ecx shrl$3, %ecx rep; movsq ret Of course it is quite common to know only upper bound on the block. In this case we need to generate prologue for first few bytes: char *p,*q; t(unsigned int a) { if (a100) memcpy(q,p,a); } t: .LFB0: .cfi_startproc cmpl$99, %edi jbe .L15 .L7: rep; ret .p2align 4,,10 .p2align 3 .L15: cmpl$8, %edi movqq(%rip), %rdx movqp(%rip), %rsi jae .L3 testb $4, %dil jne .L16 testl %edi, %edi je .L7 movzbl (%rsi), %eax testb $2, %dil movb%al, (%rdx) je .L7 movl%edi, %edi movzwl -2(%rsi,%rdi), %eax movw%ax, -2(%rdx,%rdi) ret .p2align 4,,10 .p2align 3 .L3: movq(%rsi), %rax movq%rax, (%rdx) movl%edi, %eax movq-8(%rsi,%rax), %rcx movq%rcx, -8(%rdx,%rax) leaq8(%rdx), %rax andq$-8, %rax subq%rax, %rdx addl%edx, %edi subq%rdx, %rsi shrl$3, %edi movl%edi, %ecx movq%rax, %rdi rep; movsq ret .p2align 4,,10 .p2align 3 .L16: movl(%rsi), %eax movl%edi, %edi movl%eax, (%rdx) movl-4(%rsi,%rdi), %eax movl
Re: Add value range support into memcpy/memset expansion
On Sat, Sep 28, 2013 at 3:05 PM, Jan Hubicka hubi...@ucw.cz wrote: Nice extension. Test cases would be great to have. Fore those you need i386 changes to actually use the info. I will post that after some cleanup and additional testing. Hi, since I already caught your attention, here is the target specific part for comments. this patch implements memcpy/memset prologues and epilogues as suggested by Ondrej Bilka. His glibc implementation use IMO very smart trick with single misaligned move to copy first N and last N bytes of the block. The remainder of the block is then copied by the usual loop that gets aligned to the proper address. This leads to partial memory stall, but that is handled well by modern x86 chips. For example in the following testcase: char *a; char *b; t1() { memcpy (a,b,140); } We now produce: movqb(%rip), %rsi movqa(%rip), %rcx movq(%rsi), %rax - first 8 bytes are moved leaq8(%rcx), %rdi andq$-8, %rdi - dest is aligned movq%rax, (%rcx) movq132(%rsi), %rax - last 8 bytes are moved movq%rax, 132(%rcx) subq%rdi, %rcx - alignment is subtracted from count subq%rcx, %rsi - source is aligned This (source aligned) is not always true, but nevertheless the sequence is very tight. addl$140, %ecx - normal copying of 8 byte chunks shrl$3, %ecx rep; movsq ret Of course it is quite common to know only upper bound on the block. In this case we need to generate prologue for first few bytes: char *p,*q; t(unsigned int a) { if (a100) memcpy(q,p,a); } t: .LFB0: .cfi_startproc cmpl$99, %edi jbe .L15 .L7: rep; ret .p2align 4,,10 .p2align 3 .L15: cmpl$8, %edi movqq(%rip), %rdx movqp(%rip), %rsi jae .L3 testb $4, %dil jne .L16 testl %edi, %edi je .L7 movzbl (%rsi), %eax testb $2, %dil movb%al, (%rdx) je .L7 movl%edi, %edi movzwl -2(%rsi,%rdi), %eax movw%ax, -2(%rdx,%rdi) ret .p2align 4,,10 .p2align 3 .L3: movq(%rsi), %rax movq%rax, (%rdx) movl%edi, %eax movq-8(%rsi,%rax), %rcx movq%rcx, -8(%rdx,%rax) leaq8(%rdx), %rax andq$-8, %rax subq%rax, %rdx addl%edx, %edi subq%rdx, %rsi shrl$3, %edi movl%edi, %ecx movq%rax, %rdi rep; movsq ret .p2align 4,,10 .p2align 3 .L16: movl(%rsi), %eax movl%edi, %edi movl%eax, (%rdx) movl-4(%rsi,%rdi), %eax movl%eax, -4(%rdx,%rdi) ret .cfi_endproc .LFE0: Mainline would output a libcall here (because size is unknown to it) and with inlining all stringops it winds up 210 bytes of code instead of 142 bytes above. Unforutnately the following testcase: char *p,*q; t(int a) { if (a100) memcpy(q,p,a); } Won't get inlined. This is because A is known to be smaller than 100 that results in anti range after conversion to size_t. This anti range allows very large values (above INT_MAX) and thus we do not know the block size. I am not sure if the sane range can be recovered somehow. If not, maybe this is common enough to add support for probable upper bound parameter to the template. Do we know if there is real code that intentionally does that other than security flaws as result of improperly done range check? I think by default GCC should assume the memcpy size range is (0, 100) here with perhaps an option to override it. thanks, David Use of value ranges makes it harder to choose proper algorithm since the average size is no longer known. For the moment I take simple average of lower and upper bound, but this is wrong. Libcall starts to win only for pretty large blocks (over 4GB definitely) so it makes sense to inline functions with range 04096 even though the cost tables tells to expand libcall for everything bigger than 140 bytes: if blocks are small we will get noticeable win and if blocks are big, we won't lose much. I am considering assigning value ranges to the algorithms, too, for more sane choices in decide_alg. I also think the misaligned move trick can/should be performed by move_by_pieces and we ought to consider sane use of SSE - current vector_loop with unrolling factor of 4 seems bit extreme. At least buldozer is happy with 2 and I would expect SSE moves to be especially useful for moving blocks with known size where they are not used at all. Currently I disabled misaligned move prologues/epilogues for Michael's vector
Add value range support into memcpy/memset expansion
Hi, this patch makes it possible to access value range info from setmem/movstr that I plan to use in i386 memcpy/memset expansion code. It is all quite straighforward except that I need to deal with cases where max size does not fit in HOST_WIDE_INT where I use maximal value as a marker. It is then translated as NULL pointer to the expander that is bit inconsistent with other places that use -1 as marker of unknown value. I also think we lose some cases because of TER replacing out the SSA_NAME by something else, but it seems to work in quite many cases. This can be probably tracked incrementally by disabling TER here or finally getting away from expanding calls via the generic route. Bootstrapped/regtested x86_64-linux, OK? Honza * doc/md.texi (setmem, movstr): Update documentation. * builtins.c (determine_block_size): New function. (expand_builtin_memcpy): Use it and pass it to emit_block_move_hints. (expand_builtin_memset_args): Use it and pass it to set_storage_via_setmem. * expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters; update call to expander. (emit_block_move_hints): Add min_size/max_size parameters. (clear_storage_hints): Likewise. (set_storage_via_setmem): Likewise. (clear_storage): Update. * expr.h (emit_block_move_hints, clear_storage_hints, set_storage_via_setmem): Update prototype. Index: doc/md.texi === --- doc/md.texi (revision 202968) +++ doc/md.texi (working copy) @@ -5198,6 +5198,9 @@ destination and source strings are opera the expansion of this pattern should store in operand 0 the address in which the @code{NUL} terminator was stored in the destination string. +This patern has also several optional operands that are same as in +@code{setmem}. + @cindex @code{setmem@var{m}} instruction pattern @item @samp{setmem@var{m}} Block set instruction. The destination string is the first operand, @@ -5217,6 +5220,8 @@ respectively. The expected alignment di in a way that the blocks are not required to be aligned according to it in all cases. This expected alignment is also in bytes, just like operand 4. Expected size, when unknown, is set to @code{(const_int -1)}. +Operand 7 is the minimal size of the block and operand 8 is the +maximal size of the block (NULL if it can not be represented as CONST_INT). The use for multiple @code{setmem@var{m}} is as for @code{movmem@var{m}}. Index: builtins.c === --- builtins.c (revision 202968) +++ builtins.c (working copy) @@ -3070,6 +3070,51 @@ builtin_memcpy_read_str (void *data, HOS return c_readstr (str + offset, mode); } +/* LEN specify length of the block of memcpy/memset operation. + Figure out its range and put it into MIN_SIZE/MAX_SIZE. */ + +static void +determine_block_size (tree len, rtx len_rtx, + unsigned HOST_WIDE_INT *min_size, + unsigned HOST_WIDE_INT *max_size) +{ + if (CONST_INT_P (len_rtx)) +{ + *min_size = *max_size = UINTVAL (len_rtx); + return; +} + else +{ + double_int min, max; + if (TREE_CODE (len) == SSA_NAME + get_range_info (len, min, max) == VR_RANGE) + { + if (min.fits_uhwi ()) + *min_size = min.to_uhwi (); + else + *min_size = 0; + if (max.fits_uhwi ()) + *max_size = max.to_uhwi (); + else + *max_size = (HOST_WIDE_INT)-1; + } + else + { + if (host_integerp (TYPE_MIN_VALUE (TREE_TYPE (len)), 1)) + *min_size = tree_low_cst (TYPE_MIN_VALUE (TREE_TYPE (len)), 1); + else + *min_size = 0; + if (host_integerp (TYPE_MAX_VALUE (TREE_TYPE (len)), 1)) + *max_size = tree_low_cst (TYPE_MAX_VALUE (TREE_TYPE (len)), 1); + else + *max_size = GET_MODE_MASK (GET_MODE (len_rtx)); + } +} + gcc_checking_assert (*max_size = + (unsigned HOST_WIDE_INT) + GET_MODE_MASK (GET_MODE (len_rtx))); +} + /* Expand a call EXP to the memcpy builtin. Return NULL_RTX if we failed, the caller should emit a normal call, otherwise try to get the result in TARGET, if convenient (and in @@ -3092,6 +3137,8 @@ expand_builtin_memcpy (tree exp, rtx tar rtx dest_mem, src_mem, dest_addr, len_rtx; HOST_WIDE_INT expected_size = -1; unsigned int expected_align = 0; + unsigned HOST_WIDE_INT min_size; + unsigned HOST_WIDE_INT max_size; /* If DEST is not a pointer type, call the normal function. */ if (dest_align == 0) @@ -3111,6 +3158,7 @@ expand_builtin_memcpy (tree exp, rtx tar dest_mem = get_memory_rtx (dest, len); set_mem_align (dest_mem, dest_align); len_rtx = expand_normal (len); +
Re: Add value range support into memcpy/memset expansion
Nice extension. Test cases would be great to have. thanks, David On Fri, Sep 27, 2013 at 7:50 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this patch makes it possible to access value range info from setmem/movstr that I plan to use in i386 memcpy/memset expansion code. It is all quite straighforward except that I need to deal with cases where max size does not fit in HOST_WIDE_INT where I use maximal value as a marker. It is then translated as NULL pointer to the expander that is bit inconsistent with other places that use -1 as marker of unknown value. I also think we lose some cases because of TER replacing out the SSA_NAME by something else, but it seems to work in quite many cases. This can be probably tracked incrementally by disabling TER here or finally getting away from expanding calls via the generic route. Bootstrapped/regtested x86_64-linux, OK? Honza * doc/md.texi (setmem, movstr): Update documentation. * builtins.c (determine_block_size): New function. (expand_builtin_memcpy): Use it and pass it to emit_block_move_hints. (expand_builtin_memset_args): Use it and pass it to set_storage_via_setmem. * expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters; update call to expander. (emit_block_move_hints): Add min_size/max_size parameters. (clear_storage_hints): Likewise. (set_storage_via_setmem): Likewise. (clear_storage): Update. * expr.h (emit_block_move_hints, clear_storage_hints, set_storage_via_setmem): Update prototype. Index: doc/md.texi === --- doc/md.texi (revision 202968) +++ doc/md.texi (working copy) @@ -5198,6 +5198,9 @@ destination and source strings are opera the expansion of this pattern should store in operand 0 the address in which the @code{NUL} terminator was stored in the destination string. +This patern has also several optional operands that are same as in +@code{setmem}. + @cindex @code{setmem@var{m}} instruction pattern @item @samp{setmem@var{m}} Block set instruction. The destination string is the first operand, @@ -5217,6 +5220,8 @@ respectively. The expected alignment di in a way that the blocks are not required to be aligned according to it in all cases. This expected alignment is also in bytes, just like operand 4. Expected size, when unknown, is set to @code{(const_int -1)}. +Operand 7 is the minimal size of the block and operand 8 is the +maximal size of the block (NULL if it can not be represented as CONST_INT). The use for multiple @code{setmem@var{m}} is as for @code{movmem@var{m}}. Index: builtins.c === --- builtins.c (revision 202968) +++ builtins.c (working copy) @@ -3070,6 +3070,51 @@ builtin_memcpy_read_str (void *data, HOS return c_readstr (str + offset, mode); } +/* LEN specify length of the block of memcpy/memset operation. + Figure out its range and put it into MIN_SIZE/MAX_SIZE. */ + +static void +determine_block_size (tree len, rtx len_rtx, + unsigned HOST_WIDE_INT *min_size, + unsigned HOST_WIDE_INT *max_size) +{ + if (CONST_INT_P (len_rtx)) +{ + *min_size = *max_size = UINTVAL (len_rtx); + return; +} + else +{ + double_int min, max; + if (TREE_CODE (len) == SSA_NAME + get_range_info (len, min, max) == VR_RANGE) + { + if (min.fits_uhwi ()) + *min_size = min.to_uhwi (); + else + *min_size = 0; + if (max.fits_uhwi ()) + *max_size = max.to_uhwi (); + else + *max_size = (HOST_WIDE_INT)-1; + } + else + { + if (host_integerp (TYPE_MIN_VALUE (TREE_TYPE (len)), 1)) + *min_size = tree_low_cst (TYPE_MIN_VALUE (TREE_TYPE (len)), 1); + else + *min_size = 0; + if (host_integerp (TYPE_MAX_VALUE (TREE_TYPE (len)), 1)) + *max_size = tree_low_cst (TYPE_MAX_VALUE (TREE_TYPE (len)), 1); + else + *max_size = GET_MODE_MASK (GET_MODE (len_rtx)); + } +} + gcc_checking_assert (*max_size = + (unsigned HOST_WIDE_INT) + GET_MODE_MASK (GET_MODE (len_rtx))); +} + /* Expand a call EXP to the memcpy builtin. Return NULL_RTX if we failed, the caller should emit a normal call, otherwise try to get the result in TARGET, if convenient (and in @@ -3092,6 +3137,8 @@ expand_builtin_memcpy (tree exp, rtx tar rtx dest_mem, src_mem, dest_addr, len_rtx; HOST_WIDE_INT expected_size = -1; unsigned int expected_align = 0; + unsigned HOST_WIDE_INT min_size; + unsigned HOST_WIDE_INT max_size; /* If DEST is not a pointer type, call the normal function. */