Re: Add value range support into memcpy/memset expansion

2013-11-19 Thread Richard Biener
On Tue, 19 Nov 2013, Jan Hubicka wrote:

 Hi,
 this patch fixes two issues with memcpy testcase - silences warning and 
 updates
 the template as suggested by Uros in the PR.  The testcase still fails on 
 i386.
 This is because we end up with:
 ;; Function t (t, funcdef_no=0, decl_uid=1763, symbol_order=2)
 
 t (unsigned int c)
 {
   void * b.0_4;
   void * a.1_5;
 
   bb 2:
   if (c_2(D) = 9)
 goto bb 3;
   else
 goto bb 4;
 
   bb 3:
   b.0_4 = b;
   a.1_5 = a;
   memcpy (a.1_5, b.0_4, c_2(D));
 
   bb 4:
   return;
 
 }
 and we have no useful value range on c_2 because assert_expr was removed,
 while in 64bit version there is a cast in bb 3 that preserves the info.
 Solving this is an independent (and I guess not terribly easy) problem.

Hmm, I thought Jakub fixed this already (with the checking whether
there are any uses of c_2(D) before the conditional)?  Or is this
a different case?

Richard.

 Regtested x86_64-linux, will commit it shortly.
 
 Index: ChangeLog
 ===
 --- ChangeLog (revision 204984)
 +++ ChangeLog (working copy)
 @@ -1,3 +1,10 @@
 +2013-11-18  Jan Hubicka  j...@suse.cz
 + Uros Bizjak  ubiz...@gmail.com
 +
 + PR middle-end/59175
 + * gcc.target/i386/memcpy-2.c: Fix template;
 + add +1 so the testcase passes at 32bit.
 +
  2013-11-18  Dominique d'Humieres  domi...@lps.ens.fr
  
   * c-c++-common/cilk-plus/PS/reduction-3.c: Use stdlib.h.
 Index: gcc.target/i386/memcpy-2.c
 ===
 --- gcc.target/i386/memcpy-2.c(revision 204984)
 +++ gcc.target/i386/memcpy-2.c(working copy)
 @@ -1,11 +1,11 @@
  /* { dg-do compile } */
  /* { dg-options -O2 } */
 -/* Memcpy should be inlined because block size is known.  */
 -/* { dg-final { scan-assembler-not memcpy } } */
  void *a;
  void *b;
  t(unsigned int c)
  {
if (c10)
 -memcpy (a,b,c);
 +__builtin_memcpy (a,b,c+1);
  }
 +/* Memcpy should be inlined because block size is known.  */
 +/* { dg-final { scan-assembler-not (jmp|call)\[\\t \]*memcpy } } */
 
 

-- 
Richard Biener rguent...@suse.de
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imendorffer


Re: Add value range support into memcpy/memset expansion

2013-11-19 Thread Jakub Jelinek
On Tue, Nov 19, 2013 at 09:50:56AM +0100, Richard Biener wrote:
  this patch fixes two issues with memcpy testcase - silences warning and 
  updates
  the template as suggested by Uros in the PR.  The testcase still fails on 
  i386.
  This is because we end up with:
  ;; Function t (t, funcdef_no=0, decl_uid=1763, symbol_order=2)
  
  t (unsigned int c)
  {
void * b.0_4;
void * a.1_5;
  
bb 2:
if (c_2(D) = 9)
  goto bb 3;
else
  goto bb 4;
  
bb 3:
b.0_4 = b;
a.1_5 = a;
memcpy (a.1_5, b.0_4, c_2(D));
  
bb 4:
return;
  
  }
  and we have no useful value range on c_2 because assert_expr was removed,
  while in 64bit version there is a cast in bb 3 that preserves the info.
  Solving this is an independent (and I guess not terribly easy) problem.
 
 Hmm, I thought Jakub fixed this already (with the checking whether
 there are any uses of c_2(D) before the conditional)?  Or is this
 a different case?

It was a different case, I've only handled the case where you have
if (c = 10) __builtin_unreachable ();
and c doesn't have any immediate uses before this form of assertion.
In Honza's testcase there is no __builtin_unreachable, but instead
return at that point, changing the value range of c_2(D) in his case
would be far more controversial - for __builtin_unreachable () it means
if you pass c_2(D) 10 to the function and the if (c = 10) test is
reached, it will be invalid (from this POV that transformation isn't
100% valid either, because there is no proof that if you call the function,
then that condition stmt will be executed).  While in the above case
c_2(D) of 10 is completely valid.

Perhaps better might be for __builtin_unreachable () to just add
there an internal pass through call (perhaps doing nothing at all)
that would just preserve the range and non-zero bits info, we could perhaps
allow some extra optimizations on it (like, if we propagate to the first
argument anything other than some SSA_NAME, we just remove the call and
propagate it further), the question is what else should we do so that it
doesn't inhibit some optimizations unnecessarily.  But if the user
cared enough to insert an __builtin_unreachable () assertion, perhaps it
might be worth preserving that info.

That said, for Honza's case keeping around some internal pass through call
might be even far more expensive.

Jakub


Re: Add value range support into memcpy/memset expansion

2013-11-18 Thread H.J. Lu
On Sun, Nov 17, 2013 at 3:38 PM, Jan Hubicka hubi...@ucw.cz wrote:
 Hi,
 this is version I comitted. It also adds a testcase and enables the support 
 in i386 backend.

 Honza

 * doc/md.texi (setmem, movstr): Update documentation.
 * builtins.c (determine_block_size): New function.
 (expand_builtin_memcpy): Use it and pass it to
 emit_block_move_hints.
 (expand_builtin_memset_args): Use it and pass it to
 set_storage_via_setmem.
 * expr.c (emit_block_move_via_movmem): Add min_size/max_size 
 parameters;
 update call to expander.
 (emit_block_move_hints): Add min_size/max_size parameters.
 (clear_storage_hints): Likewise.
 (set_storage_via_setmem): Likewise.
 (clear_storage): Update.
 * expr.h (emit_block_move_hints, clear_storage_hints,
 set_storage_via_setmem): Update prototype.
 * i386.c (ix86_expand_set_or_movmem): Add bounds; export.
 (ix86_expand_movmem, ix86_expand_setmem): Remove.
 (ix86_expand_movmem, ix86_expand_setmem): Remove.
 * i386.md (movmem, setmem): Pass parameters.

 * testsuite/gcc.target/i386/memcpy-2.c: New testcase.

The new testcase fails for me on x86 and x86-64:

FAIL: gcc.target/i386/memcpy-2.c scan-assembler-not memcpy
FAIL: gcc.target/i386/memcpy-2.c (test for excess errors)

-- 
H.J.


Re: Add value range support into memcpy/memset expansion

2013-11-18 Thread H.J. Lu
On Mon, Nov 18, 2013 at 6:19 AM, H.J. Lu hjl.to...@gmail.com wrote:
 On Sun, Nov 17, 2013 at 3:38 PM, Jan Hubicka hubi...@ucw.cz wrote:
 Hi,
 this is version I comitted. It also adds a testcase and enables the support 
 in i386 backend.

 Honza

 * doc/md.texi (setmem, movstr): Update documentation.
 * builtins.c (determine_block_size): New function.
 (expand_builtin_memcpy): Use it and pass it to
 emit_block_move_hints.
 (expand_builtin_memset_args): Use it and pass it to
 set_storage_via_setmem.
 * expr.c (emit_block_move_via_movmem): Add min_size/max_size 
 parameters;
 update call to expander.
 (emit_block_move_hints): Add min_size/max_size parameters.
 (clear_storage_hints): Likewise.
 (set_storage_via_setmem): Likewise.
 (clear_storage): Update.
 * expr.h (emit_block_move_hints, clear_storage_hints,
 set_storage_via_setmem): Update prototype.
 * i386.c (ix86_expand_set_or_movmem): Add bounds; export.
 (ix86_expand_movmem, ix86_expand_setmem): Remove.
 (ix86_expand_movmem, ix86_expand_setmem): Remove.
 * i386.md (movmem, setmem): Pass parameters.

 * testsuite/gcc.target/i386/memcpy-2.c: New testcase.

 The new testcase fails for me on x86 and x86-64:

 FAIL: gcc.target/i386/memcpy-2.c scan-assembler-not memcpy
 FAIL: gcc.target/i386/memcpy-2.c (test for excess errors)


I got

[hjl@gnu-6 gcc]$ /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/gcc/build-x86_64-linux/gcc/
/export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c
-fno-diagnostics-show-caret -fdiagnostics-color=never  -O2
-ffat-lto-objects -ffat-lto-objects -S  -m32 -o memcpy-2.s
/export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c:
In function ‘t’:
/export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c:10:5:
warning: incompatible implicit declaration of built-in function
‘memcpy’ [enabled by default]
[hjl@gnu-6 gcc]$ cat memcpy-2.s
.filememcpy-2.c
.text
.p2align 4,,15
.globlt
.typet, @function
t:
.LFB0:
.cfi_startproc
subl$28, %esp
.cfi_def_cfa_offset 32
movl32(%esp), %eax
cmpl$9, %eax
ja.L3
movl%eax, 8(%esp)
movlb, %eax
movl%eax, 4(%esp)
movla, %eax
movl%eax, (%esp)
callmemcpy
.L3:
addl$28, %esp
.cfi_def_cfa_offset 4
ret
.cfi_endproc


/* { dg-final { scan-assembler-not memcpy } } */

will also match

.filememcpy-2.c

-- 
H.J.


Re: Add value range support into memcpy/memset expansion

2013-11-18 Thread Jan Hubicka
 
 I opened:
 
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59175

Thanks,
seems I mixed up the testcase file.  Will fix that soon.

Honza
 
 -- 
 H.J.


Re: Add value range support into memcpy/memset expansion

2013-11-18 Thread H.J. Lu
On Mon, Nov 18, 2013 at 9:16 AM, H.J. Lu hjl.to...@gmail.com wrote:
 On Mon, Nov 18, 2013 at 6:19 AM, H.J. Lu hjl.to...@gmail.com wrote:
 On Sun, Nov 17, 2013 at 3:38 PM, Jan Hubicka hubi...@ucw.cz wrote:
 Hi,
 this is version I comitted. It also adds a testcase and enables the support 
 in i386 backend.

 Honza

 * doc/md.texi (setmem, movstr): Update documentation.
 * builtins.c (determine_block_size): New function.
 (expand_builtin_memcpy): Use it and pass it to
 emit_block_move_hints.
 (expand_builtin_memset_args): Use it and pass it to
 set_storage_via_setmem.
 * expr.c (emit_block_move_via_movmem): Add min_size/max_size 
 parameters;
 update call to expander.
 (emit_block_move_hints): Add min_size/max_size parameters.
 (clear_storage_hints): Likewise.
 (set_storage_via_setmem): Likewise.
 (clear_storage): Update.
 * expr.h (emit_block_move_hints, clear_storage_hints,
 set_storage_via_setmem): Update prototype.
 * i386.c (ix86_expand_set_or_movmem): Add bounds; export.
 (ix86_expand_movmem, ix86_expand_setmem): Remove.
 (ix86_expand_movmem, ix86_expand_setmem): Remove.
 * i386.md (movmem, setmem): Pass parameters.

 * testsuite/gcc.target/i386/memcpy-2.c: New testcase.

 The new testcase fails for me on x86 and x86-64:

 FAIL: gcc.target/i386/memcpy-2.c scan-assembler-not memcpy
 FAIL: gcc.target/i386/memcpy-2.c (test for excess errors)


 I got

 [hjl@gnu-6 gcc]$ /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
 -B/export/build/gnu/gcc/build-x86_64-linux/gcc/
 /export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c
 -fno-diagnostics-show-caret -fdiagnostics-color=never  -O2
 -ffat-lto-objects -ffat-lto-objects -S  -m32 -o memcpy-2.s
 /export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c:
 In function ‘t’:
 /export/gnu/import/git/gcc/gcc/testsuite/gcc.target/i386/memcpy-2.c:10:5:
 warning: incompatible implicit declaration of built-in function
 ‘memcpy’ [enabled by default]
 [hjl@gnu-6 gcc]$ cat memcpy-2.s
 .filememcpy-2.c
 .text
 .p2align 4,,15
 .globlt
 .typet, @function
 t:
 .LFB0:
 .cfi_startproc
 subl$28, %esp
 .cfi_def_cfa_offset 32
 movl32(%esp), %eax
 cmpl$9, %eax
 ja.L3
 movl%eax, 8(%esp)
 movlb, %eax
 movl%eax, 4(%esp)
 movla, %eax
 movl%eax, (%esp)
 callmemcpy
 .L3:
 addl$28, %esp
 .cfi_def_cfa_offset 4
 ret
 .cfi_endproc


 /* { dg-final { scan-assembler-not memcpy } } */

 will also match

 .filememcpy-2.c

 --
 H.J.

I opened:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59175

-- 
H.J.


Re: Add value range support into memcpy/memset expansion

2013-11-18 Thread Jan Hubicka
Hi,
this patch fixes two issues with memcpy testcase - silences warning and updates
the template as suggested by Uros in the PR.  The testcase still fails on i386.
This is because we end up with:
;; Function t (t, funcdef_no=0, decl_uid=1763, symbol_order=2)

t (unsigned int c)
{
  void * b.0_4;
  void * a.1_5;

  bb 2:
  if (c_2(D) = 9)
goto bb 3;
  else
goto bb 4;

  bb 3:
  b.0_4 = b;
  a.1_5 = a;
  memcpy (a.1_5, b.0_4, c_2(D));

  bb 4:
  return;

}
and we have no useful value range on c_2 because assert_expr was removed,
while in 64bit version there is a cast in bb 3 that preserves the info.
Solving this is an independent (and I guess not terribly easy) problem.

Regtested x86_64-linux, will commit it shortly.

Index: ChangeLog
===
--- ChangeLog   (revision 204984)
+++ ChangeLog   (working copy)
@@ -1,3 +1,10 @@
+2013-11-18  Jan Hubicka  j...@suse.cz
+   Uros Bizjak  ubiz...@gmail.com
+
+   PR middle-end/59175
+   * gcc.target/i386/memcpy-2.c: Fix template;
+   add +1 so the testcase passes at 32bit.
+
 2013-11-18  Dominique d'Humieres  domi...@lps.ens.fr
 
* c-c++-common/cilk-plus/PS/reduction-3.c: Use stdlib.h.
Index: gcc.target/i386/memcpy-2.c
===
--- gcc.target/i386/memcpy-2.c  (revision 204984)
+++ gcc.target/i386/memcpy-2.c  (working copy)
@@ -1,11 +1,11 @@
 /* { dg-do compile } */
 /* { dg-options -O2 } */
-/* Memcpy should be inlined because block size is known.  */
-/* { dg-final { scan-assembler-not memcpy } } */
 void *a;
 void *b;
 t(unsigned int c)
 {
   if (c10)
-memcpy (a,b,c);
+__builtin_memcpy (a,b,c+1);
 }
+/* Memcpy should be inlined because block size is known.  */
+/* { dg-final { scan-assembler-not (jmp|call)\[\\t \]*memcpy } } */


Re: Add value range support into memcpy/memset expansion

2013-11-17 Thread Jan Hubicka
Hi,
this is version I comitted. It also adds a testcase and enables the support in 
i386 backend.

Honza

* doc/md.texi (setmem, movstr): Update documentation.
* builtins.c (determine_block_size): New function.
(expand_builtin_memcpy): Use it and pass it to
emit_block_move_hints.
(expand_builtin_memset_args): Use it and pass it to
set_storage_via_setmem.
* expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters;
update call to expander.
(emit_block_move_hints): Add min_size/max_size parameters.
(clear_storage_hints): Likewise.
(set_storage_via_setmem): Likewise.
(clear_storage): Update.
* expr.h (emit_block_move_hints, clear_storage_hints,
set_storage_via_setmem): Update prototype.
* i386.c (ix86_expand_set_or_movmem): Add bounds; export.
(ix86_expand_movmem, ix86_expand_setmem): Remove.
(ix86_expand_movmem, ix86_expand_setmem): Remove.
* i386.md (movmem, setmem): Pass parameters.

* testsuite/gcc.target/i386/memcpy-2.c: New testcase.
Index: doc/md.texi
===
--- doc/md.texi (revision 204899)
+++ doc/md.texi (working copy)
@@ -5328,6 +5328,9 @@ destination and source strings are opera
 the expansion of this pattern should store in operand 0 the address in
 which the @code{NUL} terminator was stored in the destination string.
 
+This patern has also several optional operands that are same as in
+@code{setmem}.
+
 @cindex @code{setmem@var{m}} instruction pattern
 @item @samp{setmem@var{m}}
 Block set instruction.  The destination string is the first operand,
@@ -5347,6 +5350,8 @@ respectively.  The expected alignment di
 in a way that the blocks are not required to be aligned according to it in
 all cases. This expected alignment is also in bytes, just like operand 4.
 Expected size, when unknown, is set to @code{(const_int -1)}.
+Operand 7 is the minimal size of the block and operand 8 is the
+maximal size of the block (NULL if it can not be represented as CONST_INT).
 
 The use for multiple @code{setmem@var{m}} is as for @code{movmem@var{m}}.
 
Index: builtins.c
===
--- builtins.c  (revision 204899)
+++ builtins.c  (working copy)
@@ -3095,6 +3095,51 @@ builtin_memcpy_read_str (void *data, HOS
   return c_readstr (str + offset, mode);
 }
 
+/* LEN specify length of the block of memcpy/memset operation.
+   Figure out its range and put it into MIN_SIZE/MAX_SIZE.  */
+
+static void
+determine_block_size (tree len, rtx len_rtx,
+ unsigned HOST_WIDE_INT *min_size,
+ unsigned HOST_WIDE_INT *max_size)
+{
+  if (CONST_INT_P (len_rtx))
+{
+  *min_size = *max_size = UINTVAL (len_rtx);
+  return;
+}
+  else
+{
+  double_int min, max;
+  if (TREE_CODE (len) == SSA_NAME 
+  get_range_info (len, min, max) == VR_RANGE)
+   {
+ if (min.fits_uhwi ())
+   *min_size = min.to_uhwi ();
+ else
+   *min_size = 0;
+ if (max.fits_uhwi ())
+   *max_size = max.to_uhwi ();
+ else
+   *max_size = (HOST_WIDE_INT)-1;
+   }
+  else
+   {
+ if (host_integerp (TYPE_MIN_VALUE (TREE_TYPE (len)), 1))
+   *min_size = tree_low_cst (TYPE_MIN_VALUE (TREE_TYPE (len)), 1);
+ else
+   *min_size = 0;
+ if (host_integerp (TYPE_MAX_VALUE (TREE_TYPE (len)), 1))
+   *max_size = tree_low_cst (TYPE_MAX_VALUE (TREE_TYPE (len)), 1);
+ else
+   *max_size = GET_MODE_MASK (GET_MODE (len_rtx));
+   }
+}
+  gcc_checking_assert (*max_size =
+  (unsigned HOST_WIDE_INT)
+ GET_MODE_MASK (GET_MODE (len_rtx)));
+}
+
 /* Expand a call EXP to the memcpy builtin.
Return NULL_RTX if we failed, the caller should emit a normal call,
otherwise try to get the result in TARGET, if convenient (and in
@@ -3117,6 +3162,8 @@ expand_builtin_memcpy (tree exp, rtx tar
   rtx dest_mem, src_mem, dest_addr, len_rtx;
   HOST_WIDE_INT expected_size = -1;
   unsigned int expected_align = 0;
+  unsigned HOST_WIDE_INT min_size;
+  unsigned HOST_WIDE_INT max_size;
 
   /* If DEST is not a pointer type, call the normal function.  */
   if (dest_align == 0)
@@ -3136,6 +3183,7 @@ expand_builtin_memcpy (tree exp, rtx tar
   dest_mem = get_memory_rtx (dest, len);
   set_mem_align (dest_mem, dest_align);
   len_rtx = expand_normal (len);
+  determine_block_size (len, len_rtx, min_size, max_size);
   src_str = c_getstr (src);
 
   /* If SRC is a string constant and block move would be done
@@ -3164,7 +3212,8 @@ expand_builtin_memcpy (tree exp, rtx tar
   dest_addr = emit_block_move_hints (dest_mem, src_mem, len_rtx,
 CALL_EXPR_TAILCALL 

Re: Add value range support into memcpy/memset expansion

2013-11-15 Thread Jan Hubicka
 Hi,
 this patch makes it possible to access value range info from setmem/movstr 
 that
 I plan to use in i386 memcpy/memset expansion code.  It is all quite
 straighforward except that I need to deal with cases where max size does not
 fit in HOST_WIDE_INT where I use maximal value as a marker.  It is then
 translated as NULL pointer to the expander that is bit inconsistent with other
 places that use -1 as marker of unknown value.
 
 I also think we lose some cases because of TER replacing out the SSA_NAME by
 something else, but it seems to work in quite many cases. This can be probably
 tracked incrementally by disabling TER here or finally getting away from
 expanding calls via the generic route.
 
 Bootstrapped/regtested x86_64-linux, OK?
 
 Honza
 
   * doc/md.texi (setmem, movstr): Update documentation.
   * builtins.c (determine_block_size): New function.
   (expand_builtin_memcpy): Use it and pass it to
   emit_block_move_hints.
   (expand_builtin_memset_args): Use it and pass it to
   set_storage_via_setmem.
   * expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters;
   update call to expander.
   (emit_block_move_hints): Add min_size/max_size parameters.
   (clear_storage_hints): Likewise.
   (set_storage_via_setmem): Likewise.
   (clear_storage): Update.
   * expr.h (emit_block_move_hints, clear_storage_hints,
   set_storage_via_setmem): Update prototype.

Ping...
http://gcc.gnu.org/ml/gcc-patches/2013-09/msg02011.html

Honza


Re: Add value range support into memcpy/memset expansion

2013-11-15 Thread Richard Biener
On Fri, 15 Nov 2013, Jan Hubicka wrote:

  Hi,
  this patch makes it possible to access value range info from setmem/movstr 
  that
  I plan to use in i386 memcpy/memset expansion code.  It is all quite
  straighforward except that I need to deal with cases where max size does not
  fit in HOST_WIDE_INT where I use maximal value as a marker.  It is then
  translated as NULL pointer to the expander that is bit inconsistent with 
  other
  places that use -1 as marker of unknown value.
  
  I also think we lose some cases because of TER replacing out the SSA_NAME by
  something else, but it seems to work in quite many cases. This can be 
  probably
  tracked incrementally by disabling TER here or finally getting away from
  expanding calls via the generic route.
  
  Bootstrapped/regtested x86_64-linux, OK?
  
  Honza
  
  * doc/md.texi (setmem, movstr): Update documentation.
  * builtins.c (determine_block_size): New function.
  (expand_builtin_memcpy): Use it and pass it to
  emit_block_move_hints.
  (expand_builtin_memset_args): Use it and pass it to
  set_storage_via_setmem.
  * expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters;
  update call to expander.
  (emit_block_move_hints): Add min_size/max_size parameters.
  (clear_storage_hints): Likewise.
  (set_storage_via_setmem): Likewise.
  (clear_storage): Update.
  * expr.h (emit_block_move_hints, clear_storage_hints,
  set_storage_via_setmem): Update prototype.
 
 Ping...
 http://gcc.gnu.org/ml/gcc-patches/2013-09/msg02011.html

Ok - please make sure it still works after all the header re-org ;)

Thanks,
Richard.


Re: Add value range support into memcpy/memset expansion

2013-09-29 Thread Jan Hubicka
  We now produce:
  movqb(%rip), %rsi
  movqa(%rip), %rcx
  movq(%rsi), %rax - first 8 bytes are moved
  leaq8(%rcx), %rdi
  andq$-8, %rdi   - dest is aligned
  movq%rax, (%rcx)
  movq132(%rsi), %rax  - last 8 bytes are moved
  movq%rax, 132(%rcx)
  subq%rdi, %rcx  - alignment is subtracted from count
 
  subq%rcx, %rsi  - source is aligned
 
 This (source aligned) is not always true, but nevertheless the
 sequence is very tight.

Yep, sure, it is algigned only if source-dest is aligned, but that is best we
can ask for.
  Unforutnately the following testcase:
  char *p,*q;
  t(int a)
  {
if (a100)
  memcpy(q,p,a);
 
  }
  Won't get inlined.  This is because A is known to be smaller than 100 that
  results in anti range after conversion to size_t.  This anti range allows 
  very
  large values (above INT_MAX) and thus we do not know the block size.
  I am not sure if the sane range can be recovered somehow.  If not, maybe
  this is common enough to add support for probable upper bound parameter to
  the template.
 
 Do we know if there is real code that intentionally does that other
 than security flaws as result of improperly done range check?

I do not think so.
 
 I think by default GCC should assume the memcpy size range is (0, 100)
 here with perhaps an option to override it.

Indeed, this is what I was suggesting.  Problem is what to pass down to the
expanders as a value range.

We either need to update documentation of the expanders that the ranges are
just highly probably - and I do not want to do that since I want to use the
ranges for move_by_pieces, too.  So I think we will have to introduce two upper
bounds parameters - one sure and other very likely if there is no other
solution.

We play similar tricks in niter code.

Honza


Re: Add value range support into memcpy/memset expansion

2013-09-29 Thread Michael V. Zolotukhin
Hi Jan,

 I also think the misaligned move trick can/should be performed by
 move_by_pieces and we ought to consider sane use of SSE - current vector_loop
 with unrolling factor of 4 seems bit extreme.  At least buldozer is happy with
 2 and I would expect SSE moves to be especially useful for moving blocks with
 known size where they are not used at all.
 
 Currently I disabled misaligned move prologues/epilogues for Michael's vector
 loop path since they ends up longer than the traditional code (that use loop
 for epilogue)
Prologues could use this techniques even with vector_loop, as they actually
don't have a loop.
As for epilogues - have you tried to use misaligned vector_moves (movdqu)?  It
looks to me that we need approx. the same amount of instructions in vector-loop
and in usual-loop epilogues, if we use vector-instructions in vector-loop
epilogue.

 Comments are welcome.
BTW, maybe we could generalize expand_small_movmem a bit and make a separate
expanding strategy out of it?  It will expand a memmov with no loop (and
probably no alignment prologue) - just with the biggest available moves.  Also,
a cost model could be added here to make decisions on when we actually want to
align the moves.  Here is a couple of examples of that:

memcpy (a, b, 32); // alignment is unknown
will expand to
  movdqu a, %xmm0
  movdqu a+16, %xmm1
  movdqu %xmm0, b
  movdqu %xmm1, b+16
memcpy (a, b, 32); // alignment is known and equals 64bit
will expand to
a)
  movdqu a, %xmm0
  movdqu a+16, %xmm1
  movdqu %xmm0, b
  movdqu %xmm1, b+16
or b)
  movqa,   %xmm0
  movdqa  a+8, %xmm1
  movqa+24,%xmm2
  movq%xmm0, b
  movdqa  %xmm1, b+8
  movq%xmm2, b+24

We would compute the total cost of both variant and choose the best - for
computation we need just a cost of aligned and misaligned moves.

This strategy is actually pretty similar to move_by_pieces, but as it have much
more target-specific information, it would be able to produce much more
effective code.

And one more question - in a work on vector_loop for memset I tried to merge
many of movmem and setmem routines (e.g. instead of expand_movmem_prologue and
expand_setmem_prologue I made a single routine
expand_movmem_or_setmem_prologue).  What do you think, is it a good idea?  It
reduces code size in i386.c, but slightly complicates the code.  I'll send a
patch shortly, as soon as the testing completes.

Thanks, Michael


Re: Add value range support into memcpy/memset expansion

2013-09-29 Thread Jan Hubicka
 Hi Jan,
 
  I also think the misaligned move trick can/should be performed by
  move_by_pieces and we ought to consider sane use of SSE - current 
  vector_loop
  with unrolling factor of 4 seems bit extreme.  At least buldozer is happy 
  with
  2 and I would expect SSE moves to be especially useful for moving blocks 
  with
  known size where they are not used at all.
  
  Currently I disabled misaligned move prologues/epilogues for Michael's 
  vector
  loop path since they ends up longer than the traditional code (that use loop
  for epilogue)
 Prologues could use this techniques even with vector_loop, as they actually
 don't have a loop.

Were new prologues lose is the fact that we need to handle all sizes smaller 
than
SIZE_NEEDED.  This is 64bytes that leads to a variant for 32..64, 16..32, 8...16
4..8 and the tail.  It is quite a lot of code.

When block is known to be greater than 64, this is also a win but my current 
patch
does not fine tune this, yet.
Similarly misaligned moves are win when size is known, alignment is not 
performed
and normal fixed size epiogue needs more than one move or when alignment is 
known
but offset is non-zero.
It will need bit of tweaking to handle all the paths well - it is usual problem
with the stringops, they get way too complex as number of factors increase.

It is why I think we may consider vector loop with less than 4 unrollings.
AMD optimization manual recommends two for buldozer... Is there difference 
between
4 and two for Atom?

To be honest I am not quite sure from where constant of 4 comes.  I think I
introduced it long time ago for K8 where it apparently got some extra % of
performance.

It is used for larger blocks only for PPro. AMD chips preffer it for small
blocks apparently becuase they preffer loop-less sequence.

 As for epilogues - have you tried to use misaligned vector_moves (movdqu)?  It
 looks to me that we need approx. the same amount of instructions in 
 vector-loop
 and in usual-loop epilogues, if we use vector-instructions in vector-loop
 epilogue.

Yes, code is in place for this.  You can just remove the check for size_needed
being smaller than 32 and it will produce the movdqu sequence for you (I tested
it on the vector loop testcase in the testusite).

The code will also use SSE for unrolled_loop prologue expansion at least for
memcpy (for memset it does not have broadcast value so it should skip it).

 
  Comments are welcome.
 BTW, maybe we could generalize expand_small_movmem a bit and make a separate
 expanding strategy out of it?  It will expand a memmov with no loop (and
 probably no alignment prologue) - just with the biggest available moves.  
 Also,
 a cost model could be added here to make decisions on when we actually want to
 align the moves.  Here is a couple of examples of that:
 
 memcpy (a, b, 32); // alignment is unknown
 will expand to
   movdqu a, %xmm0
   movdqu a+16, %xmm1
   movdqu %xmm0, b
   movdqu %xmm1, b+16
 memcpy (a, b, 32); // alignment is known and equals 64bit
 will expand to
 a)
   movdqu a, %xmm0
   movdqu a+16, %xmm1
   movdqu %xmm0, b
   movdqu %xmm1, b+16
 or b)
   movq  a,   %xmm0
   movdqa  a+8, %xmm1
   movq  a+24,%xmm2
   movq  %xmm0, b
   movdqa  %xmm1, b+8
   movq  %xmm2, b+24
 
 We would compute the total cost of both variant and choose the best - for
 computation we need just a cost of aligned and misaligned moves.
 
 This strategy is actually pretty similar to move_by_pieces, but as it have 
 much
 more target-specific information, it would be able to produce much more
 effective code.

I was actually thinking more along lines of teaching move_by_pieces to do the 
tricks.
It seems there is not that much of x86 specific knowledge in here and other 
architectures
will also benefit from it.  We can also enable it when value range is close 
enough.

I plan to look into it today or tomorrow - revisit your old patch to 
move_by_pieces and see
how much of extra API I need to get move_by_pieces to do what 
expand_small_movmem does.
 
 And one more question - in a work on vector_loop for memset I tried to merge
 many of movmem and setmem routines (e.g. instead of expand_movmem_prologue and
 expand_setmem_prologue I made a single routine
 expand_movmem_or_setmem_prologue).  What do you think, is it a good idea?  It
 reduces code size in i386.c, but slightly complicates the code.  I'll send a
 patch shortly, as soon as the testing completes.

I would like to see it.  I am not too thrilled by the duplication. My original
motivation for that was to keep under control number of code paths thorugh the
expanders.  We already have many of them (and it is easy to get wrong code) as
different variants of prologues/epilgoues and main loops are not exactly the
same and thus the code is not as moduler as I would like.  I am not sure if
adding differences in between memset and memmove is not going to add too much
of extra cases to think of.  Maybe not, like for the misaligned 

Re: Add value range support into memcpy/memset expansion

2013-09-28 Thread Jan Hubicka
 Nice extension. Test cases would be great to have.
Fore those you need i386 changes to actually use the info.  I will post that
after some cleanup and additional testing.

Honza


Re: Add value range support into memcpy/memset expansion

2013-09-28 Thread Jan Hubicka
  Nice extension. Test cases would be great to have.
 Fore those you need i386 changes to actually use the info.  I will post that
 after some cleanup and additional testing.

Hi,
since I already caught your attention, here is the target specific part for
comments.

this patch implements memcpy/memset prologues and epilogues as suggested by
Ondrej Bilka.  His glibc implementation use IMO very smart trick with single
misaligned move to copy first N and last N bytes of the block.  The remainder
of the block is then copied by the usual loop that gets aligned to the proper
address.

This leads to partial memory stall, but that is handled well by modern x86
chips.

For example in the following testcase:
char *a;
char *b;
t1()
{
  memcpy (a,b,140);
}

We now produce:
movqb(%rip), %rsi
movqa(%rip), %rcx
movq(%rsi), %rax - first 8 bytes are moved
leaq8(%rcx), %rdi
andq$-8, %rdi   - dest is aligned
movq%rax, (%rcx)
movq132(%rsi), %rax  - last 8 bytes are moved
movq%rax, 132(%rcx)
subq%rdi, %rcx  - alignment is subtracted from count
subq%rcx, %rsi  - source is aligned
addl$140, %ecx  - normal copying of 8 byte chunks
shrl$3, %ecx
rep; movsq
ret

Instead of:
movqa(%rip), %rdi
movqb(%rip), %rsi
movl$140, %eax
testb   $1, %dil
jne .L28
testb   $2, %dil
jne .L29
.L3:
testb   $4, %dil
jne .L30
.L4:
movl%eax, %ecx
xorl%edx, %edx
shrl$3, %ecx
testb   $4, %al
rep movsq
je  .L5
movl(%rsi), %edx
movl%edx, (%rdi)
movl$4, %edx
.L5:
testb   $2, %al
je  .L6
movzwl  (%rsi,%rdx), %ecx
movw%cx, (%rdi,%rdx)
addq$2, %rdx
.L6:
testb   $1, %al
je  .L8
movzbl  (%rsi,%rdx), %eax
movb%al, (%rdi,%rdx)
.L8:
rep
ret
.p2align 4,,10
.p2align 3
.L28:
movzbl  (%rsi), %eax
addq$1, %rsi
movb%al, (%rdi)
addq$1, %rdi
movl$139, %eax
testb   $2, %dil
je  .L3
.p2align 4,,10
.p2align 3
.L29:
movzwl  (%rsi), %edx
subl$2, %eax
addq$2, %rsi
movw%dx, (%rdi)
addq$2, %rdi
testb   $4, %dil
je  .L4
.p2align 4,,10
.p2align 3
.L30:
movl(%rsi), %edx
subl$4, %eax
addq$4, %rsi
movl%edx, (%rdi)
addq$4, %rdi
jmp .L4

With the proposed value range code we can now take advantage of it even for
non-constant moves.  Somewhat artificial testcase:

char *p,*q;
t(unsigned int a)
{
  if (a8  a100)
memcpy(q,p,a);

}

Still generate pretty much same code (while -minline-all-stringops code on
mainline is just horrible):
leal-9(%rdi), %edx
movl%edi, %eax
cmpl$90, %edx
jbe .L5
rep; ret
.p2align 4,,10
.p2align 3
.L5:
movqp(%rip), %rsi
movqq(%rip), %rcx
movq(%rsi), %rdx
movq%rdx, (%rcx)
movl%edi, %edx
movq-8(%rsi,%rdx), %rdi
movq%rdi, -8(%rcx,%rdx)
leaq8(%rcx), %rdi
andq$-8, %rdi
subq%rdi, %rcx
subq%rcx, %rsi
addl%eax, %ecx
shrl$3, %ecx
rep; movsq
ret

Of course it is quite common to know only upper bound on the block.  In this 
case
we need to generate prologue for first few bytes:
char *p,*q;
t(unsigned int a)
{
  if (a100)
memcpy(q,p,a);

}
t:
.LFB0:
.cfi_startproc
cmpl$99, %edi
jbe .L15
.L7:
rep; ret
.p2align 4,,10
.p2align 3
.L15:
cmpl$8, %edi
movqq(%rip), %rdx
movqp(%rip), %rsi
jae .L3
testb   $4, %dil
jne .L16
testl   %edi, %edi
je  .L7
movzbl  (%rsi), %eax
testb   $2, %dil
movb%al, (%rdx)
je  .L7
movl%edi, %edi
movzwl  -2(%rsi,%rdi), %eax
movw%ax, -2(%rdx,%rdi)
ret
.p2align 4,,10
.p2align 3
.L3:
movq(%rsi), %rax
movq%rax, (%rdx)
movl%edi, %eax
movq-8(%rsi,%rax), %rcx
movq%rcx, -8(%rdx,%rax)
leaq8(%rdx), %rax
andq$-8, %rax
subq%rax, %rdx
addl%edx, %edi
subq%rdx, %rsi
shrl$3, %edi
movl%edi, %ecx
movq%rax, %rdi
rep; movsq
ret
.p2align 4,,10
.p2align 3
.L16:
movl(%rsi), %eax
movl%edi, %edi
movl%eax, (%rdx)
movl-4(%rsi,%rdi), %eax
movl  

Re: Add value range support into memcpy/memset expansion

2013-09-28 Thread Xinliang David Li
On Sat, Sep 28, 2013 at 3:05 PM, Jan Hubicka hubi...@ucw.cz wrote:
  Nice extension. Test cases would be great to have.
 Fore those you need i386 changes to actually use the info.  I will post that
 after some cleanup and additional testing.

 Hi,
 since I already caught your attention, here is the target specific part for
 comments.

 this patch implements memcpy/memset prologues and epilogues as suggested by
 Ondrej Bilka.  His glibc implementation use IMO very smart trick with single
 misaligned move to copy first N and last N bytes of the block.  The remainder
 of the block is then copied by the usual loop that gets aligned to the proper
 address.

 This leads to partial memory stall, but that is handled well by modern x86
 chips.

 For example in the following testcase:
 char *a;
 char *b;
 t1()
 {
   memcpy (a,b,140);
 }

 We now produce:
 movqb(%rip), %rsi
 movqa(%rip), %rcx
 movq(%rsi), %rax - first 8 bytes are moved
 leaq8(%rcx), %rdi
 andq$-8, %rdi   - dest is aligned
 movq%rax, (%rcx)
 movq132(%rsi), %rax  - last 8 bytes are moved
 movq%rax, 132(%rcx)
 subq%rdi, %rcx  - alignment is subtracted from count

 subq%rcx, %rsi  - source is aligned

This (source aligned) is not always true, but nevertheless the
sequence is very tight.

 addl$140, %ecx  - normal copying of 8 byte chunks
 shrl$3, %ecx
 rep; movsq
 ret

 Of course it is quite common to know only upper bound on the block.  In this 
 case
 we need to generate prologue for first few bytes:
 char *p,*q;
 t(unsigned int a)
 {
   if (a100)
 memcpy(q,p,a);

 }
 t:
 .LFB0:
 .cfi_startproc
 cmpl$99, %edi
 jbe .L15
 .L7:
 rep; ret
 .p2align 4,,10
 .p2align 3
 .L15:
 cmpl$8, %edi
 movqq(%rip), %rdx
 movqp(%rip), %rsi
 jae .L3
 testb   $4, %dil
 jne .L16
 testl   %edi, %edi
 je  .L7
 movzbl  (%rsi), %eax
 testb   $2, %dil
 movb%al, (%rdx)
 je  .L7
 movl%edi, %edi
 movzwl  -2(%rsi,%rdi), %eax
 movw%ax, -2(%rdx,%rdi)
 ret
 .p2align 4,,10
 .p2align 3
 .L3:
 movq(%rsi), %rax
 movq%rax, (%rdx)
 movl%edi, %eax
 movq-8(%rsi,%rax), %rcx
 movq%rcx, -8(%rdx,%rax)
 leaq8(%rdx), %rax
 andq$-8, %rax
 subq%rax, %rdx
 addl%edx, %edi
 subq%rdx, %rsi
 shrl$3, %edi
 movl%edi, %ecx
 movq%rax, %rdi
 rep; movsq
 ret
 .p2align 4,,10
 .p2align 3
 .L16:
 movl(%rsi), %eax
 movl%edi, %edi
 movl%eax, (%rdx)
 movl-4(%rsi,%rdi), %eax
 movl%eax, -4(%rdx,%rdi)
 ret
 .cfi_endproc
 .LFE0:

 Mainline would output a libcall here (because size is unknown to it) and with
 inlining all stringops it winds up 210 bytes of code instead of 142 bytes
 above.

 Unforutnately the following testcase:
 char *p,*q;
 t(int a)
 {
   if (a100)
 memcpy(q,p,a);

 }
 Won't get inlined.  This is because A is known to be smaller than 100 that
 results in anti range after conversion to size_t.  This anti range allows very
 large values (above INT_MAX) and thus we do not know the block size.
 I am not sure if the sane range can be recovered somehow.  If not, maybe
 this is common enough to add support for probable upper bound parameter to
 the template.

Do we know if there is real code that intentionally does that other
than security flaws as result of improperly done range check?

I think by default GCC should assume the memcpy size range is (0, 100)
here with perhaps an option to override it.

thanks,

David


 Use of value ranges makes it harder to choose proper algorithm since the 
 average
 size is no longer known.  For the moment I take simple average of lower and 
 upper
 bound, but this is wrong.

 Libcall starts to win only for pretty large blocks (over 4GB definitely) so 
 it makes
 sense to inline functions with range 04096 even though the cost tables 
 tells
 to expand libcall for everything bigger than 140 bytes:  if blocks are small 
 we will
 get noticeable win and if blocks are big, we won't lose much.

 I am considering assigning value ranges to the algorithms, too, for more sane
 choices in decide_alg.

 I also think the misaligned move trick can/should be performed by
 move_by_pieces and we ought to consider sane use of SSE - current vector_loop
 with unrolling factor of 4 seems bit extreme.  At least buldozer is happy with
 2 and I would expect SSE moves to be especially useful for moving blocks with
 known size where they are not used at all.

 Currently I disabled misaligned move prologues/epilogues for Michael's vector
 

Add value range support into memcpy/memset expansion

2013-09-27 Thread Jan Hubicka
Hi,
this patch makes it possible to access value range info from setmem/movstr that
I plan to use in i386 memcpy/memset expansion code.  It is all quite
straighforward except that I need to deal with cases where max size does not
fit in HOST_WIDE_INT where I use maximal value as a marker.  It is then
translated as NULL pointer to the expander that is bit inconsistent with other
places that use -1 as marker of unknown value.

I also think we lose some cases because of TER replacing out the SSA_NAME by
something else, but it seems to work in quite many cases. This can be probably
tracked incrementally by disabling TER here or finally getting away from
expanding calls via the generic route.

Bootstrapped/regtested x86_64-linux, OK?

Honza

* doc/md.texi (setmem, movstr): Update documentation.
* builtins.c (determine_block_size): New function.
(expand_builtin_memcpy): Use it and pass it to
emit_block_move_hints.
(expand_builtin_memset_args): Use it and pass it to
set_storage_via_setmem.
* expr.c (emit_block_move_via_movmem): Add min_size/max_size parameters;
update call to expander.
(emit_block_move_hints): Add min_size/max_size parameters.
(clear_storage_hints): Likewise.
(set_storage_via_setmem): Likewise.
(clear_storage): Update.
* expr.h (emit_block_move_hints, clear_storage_hints,
set_storage_via_setmem): Update prototype.

Index: doc/md.texi
===
--- doc/md.texi (revision 202968)
+++ doc/md.texi (working copy)
@@ -5198,6 +5198,9 @@ destination and source strings are opera
 the expansion of this pattern should store in operand 0 the address in
 which the @code{NUL} terminator was stored in the destination string.
 
+This patern has also several optional operands that are same as in
+@code{setmem}.
+
 @cindex @code{setmem@var{m}} instruction pattern
 @item @samp{setmem@var{m}}
 Block set instruction.  The destination string is the first operand,
@@ -5217,6 +5220,8 @@ respectively.  The expected alignment di
 in a way that the blocks are not required to be aligned according to it in
 all cases. This expected alignment is also in bytes, just like operand 4.
 Expected size, when unknown, is set to @code{(const_int -1)}.
+Operand 7 is the minimal size of the block and operand 8 is the
+maximal size of the block (NULL if it can not be represented as CONST_INT).
 
 The use for multiple @code{setmem@var{m}} is as for @code{movmem@var{m}}.
 
Index: builtins.c
===
--- builtins.c  (revision 202968)
+++ builtins.c  (working copy)
@@ -3070,6 +3070,51 @@ builtin_memcpy_read_str (void *data, HOS
   return c_readstr (str + offset, mode);
 }
 
+/* LEN specify length of the block of memcpy/memset operation.
+   Figure out its range and put it into MIN_SIZE/MAX_SIZE.  */
+
+static void
+determine_block_size (tree len, rtx len_rtx,
+ unsigned HOST_WIDE_INT *min_size,
+ unsigned HOST_WIDE_INT *max_size)
+{
+  if (CONST_INT_P (len_rtx))
+{
+  *min_size = *max_size = UINTVAL (len_rtx);
+  return;
+}
+  else
+{
+  double_int min, max;
+  if (TREE_CODE (len) == SSA_NAME 
+  get_range_info (len, min, max) == VR_RANGE)
+   {
+ if (min.fits_uhwi ())
+   *min_size = min.to_uhwi ();
+ else
+   *min_size = 0;
+ if (max.fits_uhwi ())
+   *max_size = max.to_uhwi ();
+ else
+   *max_size = (HOST_WIDE_INT)-1;
+   }
+  else
+   {
+ if (host_integerp (TYPE_MIN_VALUE (TREE_TYPE (len)), 1))
+   *min_size = tree_low_cst (TYPE_MIN_VALUE (TREE_TYPE (len)), 1);
+ else
+   *min_size = 0;
+ if (host_integerp (TYPE_MAX_VALUE (TREE_TYPE (len)), 1))
+   *max_size = tree_low_cst (TYPE_MAX_VALUE (TREE_TYPE (len)), 1);
+ else
+   *max_size = GET_MODE_MASK (GET_MODE (len_rtx));
+   }
+}
+  gcc_checking_assert (*max_size =
+  (unsigned HOST_WIDE_INT)
+ GET_MODE_MASK (GET_MODE (len_rtx)));
+}
+
 /* Expand a call EXP to the memcpy builtin.
Return NULL_RTX if we failed, the caller should emit a normal call,
otherwise try to get the result in TARGET, if convenient (and in
@@ -3092,6 +3137,8 @@ expand_builtin_memcpy (tree exp, rtx tar
   rtx dest_mem, src_mem, dest_addr, len_rtx;
   HOST_WIDE_INT expected_size = -1;
   unsigned int expected_align = 0;
+  unsigned HOST_WIDE_INT min_size;
+  unsigned HOST_WIDE_INT max_size;
 
   /* If DEST is not a pointer type, call the normal function.  */
   if (dest_align == 0)
@@ -3111,6 +3158,7 @@ expand_builtin_memcpy (tree exp, rtx tar
   dest_mem = get_memory_rtx (dest, len);
   set_mem_align (dest_mem, dest_align);
   len_rtx = expand_normal (len);
+  

Re: Add value range support into memcpy/memset expansion

2013-09-27 Thread Xinliang David Li
Nice extension. Test cases would be great to have.

thanks,

David

On Fri, Sep 27, 2013 at 7:50 AM, Jan Hubicka hubi...@ucw.cz wrote:
 Hi,
 this patch makes it possible to access value range info from setmem/movstr 
 that
 I plan to use in i386 memcpy/memset expansion code.  It is all quite
 straighforward except that I need to deal with cases where max size does not
 fit in HOST_WIDE_INT where I use maximal value as a marker.  It is then
 translated as NULL pointer to the expander that is bit inconsistent with other
 places that use -1 as marker of unknown value.

 I also think we lose some cases because of TER replacing out the SSA_NAME by
 something else, but it seems to work in quite many cases. This can be probably
 tracked incrementally by disabling TER here or finally getting away from
 expanding calls via the generic route.

 Bootstrapped/regtested x86_64-linux, OK?

 Honza

 * doc/md.texi (setmem, movstr): Update documentation.
 * builtins.c (determine_block_size): New function.
 (expand_builtin_memcpy): Use it and pass it to
 emit_block_move_hints.
 (expand_builtin_memset_args): Use it and pass it to
 set_storage_via_setmem.
 * expr.c (emit_block_move_via_movmem): Add min_size/max_size 
 parameters;
 update call to expander.
 (emit_block_move_hints): Add min_size/max_size parameters.
 (clear_storage_hints): Likewise.
 (set_storage_via_setmem): Likewise.
 (clear_storage): Update.
 * expr.h (emit_block_move_hints, clear_storage_hints,
 set_storage_via_setmem): Update prototype.

 Index: doc/md.texi
 ===
 --- doc/md.texi (revision 202968)
 +++ doc/md.texi (working copy)
 @@ -5198,6 +5198,9 @@ destination and source strings are opera
  the expansion of this pattern should store in operand 0 the address in
  which the @code{NUL} terminator was stored in the destination string.

 +This patern has also several optional operands that are same as in
 +@code{setmem}.
 +
  @cindex @code{setmem@var{m}} instruction pattern
  @item @samp{setmem@var{m}}
  Block set instruction.  The destination string is the first operand,
 @@ -5217,6 +5220,8 @@ respectively.  The expected alignment di
  in a way that the blocks are not required to be aligned according to it in
  all cases. This expected alignment is also in bytes, just like operand 4.
  Expected size, when unknown, is set to @code{(const_int -1)}.
 +Operand 7 is the minimal size of the block and operand 8 is the
 +maximal size of the block (NULL if it can not be represented as CONST_INT).

  The use for multiple @code{setmem@var{m}} is as for @code{movmem@var{m}}.

 Index: builtins.c
 ===
 --- builtins.c  (revision 202968)
 +++ builtins.c  (working copy)
 @@ -3070,6 +3070,51 @@ builtin_memcpy_read_str (void *data, HOS
return c_readstr (str + offset, mode);
  }

 +/* LEN specify length of the block of memcpy/memset operation.
 +   Figure out its range and put it into MIN_SIZE/MAX_SIZE.  */
 +
 +static void
 +determine_block_size (tree len, rtx len_rtx,
 + unsigned HOST_WIDE_INT *min_size,
 + unsigned HOST_WIDE_INT *max_size)
 +{
 +  if (CONST_INT_P (len_rtx))
 +{
 +  *min_size = *max_size = UINTVAL (len_rtx);
 +  return;
 +}
 +  else
 +{
 +  double_int min, max;
 +  if (TREE_CODE (len) == SSA_NAME
 +  get_range_info (len, min, max) == VR_RANGE)
 +   {
 + if (min.fits_uhwi ())
 +   *min_size = min.to_uhwi ();
 + else
 +   *min_size = 0;
 + if (max.fits_uhwi ())
 +   *max_size = max.to_uhwi ();
 + else
 +   *max_size = (HOST_WIDE_INT)-1;
 +   }
 +  else
 +   {
 + if (host_integerp (TYPE_MIN_VALUE (TREE_TYPE (len)), 1))
 +   *min_size = tree_low_cst (TYPE_MIN_VALUE (TREE_TYPE (len)), 1);
 + else
 +   *min_size = 0;
 + if (host_integerp (TYPE_MAX_VALUE (TREE_TYPE (len)), 1))
 +   *max_size = tree_low_cst (TYPE_MAX_VALUE (TREE_TYPE (len)), 1);
 + else
 +   *max_size = GET_MODE_MASK (GET_MODE (len_rtx));
 +   }
 +}
 +  gcc_checking_assert (*max_size =
 +  (unsigned HOST_WIDE_INT)
 + GET_MODE_MASK (GET_MODE (len_rtx)));
 +}
 +
  /* Expand a call EXP to the memcpy builtin.
 Return NULL_RTX if we failed, the caller should emit a normal call,
 otherwise try to get the result in TARGET, if convenient (and in
 @@ -3092,6 +3137,8 @@ expand_builtin_memcpy (tree exp, rtx tar
rtx dest_mem, src_mem, dest_addr, len_rtx;
HOST_WIDE_INT expected_size = -1;
unsigned int expected_align = 0;
 +  unsigned HOST_WIDE_INT min_size;
 +  unsigned HOST_WIDE_INT max_size;

/* If DEST is not a pointer type, call the normal function.  */