Re: [PATCH] Optimise memset on i386

2010-07-28 Thread Vladimir 'φ-coder/phcoder' Serbinenko
On 07/23/2010 02:14 PM, Colin Watson wrote:
 On Fri, Jun 25, 2010 at 08:27:20PM +0200, Vladimir 'φ-coder/phcoder' 
 Serbinenko wrote:
   
 +void *
 +grub_memset (void *s, int c, grub_size_t n)
 +{
 +  unsigned char *p = (unsigned char *) s;
 +
 +  while (n--)
 +*p++ = (unsigned char) c;
 +
 +  return s;
 +}
   
 Attached is a possible generic implementation. Not even compile-tested.
 Could you test it and compare disasm of this version on i386 and glibc
 i386 version. Perhaps they are equivalent. If they are for maintainance
 reasons it's better to always use generic one.
 
 Thanks for this, and sorry for my delay.

 I haven't compared the disassembly, but once I fixed up your code it
 performed within about 10% of the optimised version I posted before, so
 about 640ms original, 160ms with i386-specific memset, and 176ms with
 your generic memset.  Even though that isn't quite equivalent, I don't
 think that 16ms is going to be a major problem, and so I would suggest
 going with the generic version.  Please check the following.

 2010-07-23  Vladimir Serbinenko  phco...@gmail.com
 2010-07-23  Colin Watson  cjwat...@ubuntu.com

   * kern/misc.c (grub_memset): Optimise to reduce cache stalls.

   
Go ahead.
 === modified file 'kern/misc.c'
 --- kern/misc.c   2010-07-02 17:35:07 +
 +++ kern/misc.c   2010-07-23 11:05:32 +
 @@ -518,12 +518,39 @@ grub_strndup (const char *s, grub_size_t
  }
  
  void *
 -grub_memset (void *s, int c, grub_size_t n)
 +grub_memset (void *s, int c, grub_size_t len)
  {
 -  unsigned char *p = (unsigned char *) s;
 +  void *p = s;
 +  grub_uint8_t pattern8 = c;
  
 -  while (n--)
 -*p++ = (unsigned char) c;
 +  if (len = 3 * sizeof (unsigned long))
 +{
 +  unsigned long patternl = 0;
 +  grub_size_t i;
 +
 +  for (i = 0; i  sizeof (unsigned long); i++)
 + patternl |= ((unsigned long) pattern8)  (8 * i);
 +
 +  while (len  0  (((grub_addr_t) p)  (sizeof (unsigned long) - 1)))
 + {
 +   *(grub_uint8_t *) p = pattern8;
 +   p = (grub_uint8_t *) p + 1;
 +   len--;
 + }
 +  while (len = sizeof (unsigned long))
 + {
 +   *(unsigned long *) p = patternl;
 +   p = (unsigned long *) p + 1;
 +   len -= sizeof (unsigned long);
 + }
 +}
 +
 +  while (len  0)
 +{
 +  *(grub_uint8_t *) p = pattern8;
 +  p = (grub_uint8_t *) p + 1;
 +  len--;
 +}
  
return s;
  }

   


-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko




signature.asc
Description: OpenPGP digital signature
___
Grub-devel mailing list
Grub-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/grub-devel


Re: [PATCH] Optimise memset on i386

2010-07-24 Thread Colin Watson
On Fri, Jul 23, 2010 at 10:56:24AM -0500, richardvo...@gmail.com wrote:
 [snip]
 
  +  unsigned long patternl = 0;
  +  grub_size_t i;
  +
  +  for (i = 0; i  sizeof (unsigned long); i++)
  +   patternl |= ((unsigned long) pattern8)  (8 * i);
  +
 
 might I suggest:
 
 unsigned long patternl = pattern8;
 patternl |= patternl  8;
 patternl |= patternl  16;
 patternl |= patternl  32;
 patternl |= patternl  64;
 
 O(lg N) instead of O(N), no loop, no branches, and the compiler should be
 smart enough to optimize away the last two lines on systems with narrower
 long.

I no longer have the system on which I benchmarked this.  However, since
N is always either 4 or 8 on current targets, this can only amount to
micro-optimisation which I don't think can possibly matter much; we're
talking a handful of cycles at most.  Do we really need to spend time
bikeshedding this?  The important thing is taking only a cache stall per
long rather than a cache stall per byte; anything else is likely to be
noise.

-- 
Colin Watson   [cjwat...@ubuntu.com]

___
Grub-devel mailing list
Grub-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/grub-devel


Re: [PATCH] Optimise memset on i386

2010-07-23 Thread Colin Watson
On Fri, Jun 25, 2010 at 08:27:20PM +0200, Vladimir 'φ-coder/phcoder' Serbinenko 
wrote:
  +void *
  +grub_memset (void *s, int c, grub_size_t n)
  +{
  +  unsigned char *p = (unsigned char *) s;
  +
  +  while (n--)
  +*p++ = (unsigned char) c;
  +
  +  return s;
  +}
 
 Attached is a possible generic implementation. Not even compile-tested.
 Could you test it and compare disasm of this version on i386 and glibc
 i386 version. Perhaps they are equivalent. If they are for maintainance
 reasons it's better to always use generic one.

Thanks for this, and sorry for my delay.

I haven't compared the disassembly, but once I fixed up your code it
performed within about 10% of the optimised version I posted before, so
about 640ms original, 160ms with i386-specific memset, and 176ms with
your generic memset.  Even though that isn't quite equivalent, I don't
think that 16ms is going to be a major problem, and so I would suggest
going with the generic version.  Please check the following.

2010-07-23  Vladimir Serbinenko  phco...@gmail.com
2010-07-23  Colin Watson  cjwat...@ubuntu.com

* kern/misc.c (grub_memset): Optimise to reduce cache stalls.

=== modified file 'kern/misc.c'
--- kern/misc.c 2010-07-02 17:35:07 +
+++ kern/misc.c 2010-07-23 11:05:32 +
@@ -518,12 +518,39 @@ grub_strndup (const char *s, grub_size_t
 }
 
 void *
-grub_memset (void *s, int c, grub_size_t n)
+grub_memset (void *s, int c, grub_size_t len)
 {
-  unsigned char *p = (unsigned char *) s;
+  void *p = s;
+  grub_uint8_t pattern8 = c;
 
-  while (n--)
-*p++ = (unsigned char) c;
+  if (len = 3 * sizeof (unsigned long))
+{
+  unsigned long patternl = 0;
+  grub_size_t i;
+
+  for (i = 0; i  sizeof (unsigned long); i++)
+   patternl |= ((unsigned long) pattern8)  (8 * i);
+
+  while (len  0  (((grub_addr_t) p)  (sizeof (unsigned long) - 1)))
+   {
+ *(grub_uint8_t *) p = pattern8;
+ p = (grub_uint8_t *) p + 1;
+ len--;
+   }
+  while (len = sizeof (unsigned long))
+   {
+ *(unsigned long *) p = patternl;
+ p = (unsigned long *) p + 1;
+ len -= sizeof (unsigned long);
+   }
+}
+
+  while (len  0)
+{
+  *(grub_uint8_t *) p = pattern8;
+  p = (grub_uint8_t *) p + 1;
+  len--;
+}
 
   return s;
 }

-- 
Colin Watson   [cjwat...@ubuntu.com]

___
Grub-devel mailing list
Grub-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/grub-devel


Re: [PATCH] Optimise memset on i386

2010-07-23 Thread richardvo...@gmail.com
[snip]

 +  unsigned long patternl = 0;
 +  grub_size_t i;
 +
 +  for (i = 0; i  sizeof (unsigned long); i++)
 +   patternl |= ((unsigned long) pattern8)  (8 * i);
 +


might I suggest:

unsigned long patternl = pattern8;
patternl |= patternl  8;
patternl |= patternl  16;
patternl |= patternl  32;
patternl |= patternl  64;

O(lg N) instead of O(N), no loop, no branches, and the compiler should be
smart enough to optimize away the last two lines on systems with narrower
long.
___
Grub-devel mailing list
Grub-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/grub-devel


Re: [PATCH] Optimise memset on i386

2010-07-23 Thread Christian Franke

richardvoigt wrote:


might I suggest:

unsigned long patternl = pattern8;
patternl |= patternl  8;
patternl |= patternl  16;
patternl |= patternl  32;
patternl |= patternl  64;

O(lg N) instead of O(N), no loop, no branches, and the compiler should 
be smart enough to optimize away the last two lines on systems with 
narrower long.


The latter is unfortunately not the case. At least gcc 4.5.0 prints a 
warning but still produces code.


$ cat EOF f.c
unsigned long f(unsigned long x)
{
  x |= x  32;
  x |= x  64;
  return x;
}

$ gcc -O3 -S f.c
x.c: In function ‘f’:
x.c:3: warning: left shift count = width of type
x.c:4: warning: left shift count = width of type

$ cat f.s
...
pushl   %ebp
movl$32, %ecx
movl%esp, %ebp
movl8(%ebp), %eax
popl%ebp
movl%eax, %edx
sall%cl, %edx
movl$64, %ecx
orl %eax, %edx
movl%edx, %eax
sall%cl, %eax
orl %edx, %eax
ret


--
Regards,
Christian Franke


___
Grub-devel mailing list
Grub-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/grub-devel


Re: [PATCH] Optimise memset on i386

2010-07-23 Thread richardvo...@gmail.com
On Fri, Jul 23, 2010 at 12:34 PM, Christian Franke 
christian.fra...@t-online.de wrote:

 richardvoigt wrote:


 might I suggest:

 unsigned long patternl = pattern8;
 patternl |= patternl  8;
 patternl |= patternl  16;
 patternl |= patternl  32;
 patternl |= patternl  64;

 O(lg N) instead of O(N), no loop, no branches, and the compiler should be
 smart enough to optimize away the last two lines on systems with narrower
 long.


 The latter is unfortunately not the case. At least gcc 4.5.0 prints a
 warning but still produces code.


Ok then, a compile-time conditional should fix that.

unsigned long patternl = pattern8;
if (8  sizeof patternl) patternl |= patternl  8;
if (16  sizeof patternl) patternl |= patternl  16;
if (32  sizeof patternl) patternl |= patternl  32;
if (64  sizeof patternl) patternl |= patternl  64;

I'm pretty confident in gcc's ability to optimize this version.  I hope.
___
Grub-devel mailing list
Grub-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/grub-devel


Re: [PATCH] Optimise memset on i386

2010-07-23 Thread richardvo...@gmail.com
On Fri, Jul 23, 2010 at 2:48 PM, richardvo...@gmail.com 
richardvo...@gmail.com wrote:



 On Fri, Jul 23, 2010 at 12:34 PM, Christian Franke 
 christian.fra...@t-online.de wrote:

 richardvoigt wrote:


 might I suggest:

 unsigned long patternl = pattern8;
 patternl |= patternl  8;
 patternl |= patternl  16;
 patternl |= patternl  32;
 patternl |= patternl  64;

 O(lg N) instead of O(N), no loop, no branches, and the compiler should be
 smart enough to optimize away the last two lines on systems with narrower
 long.


 The latter is unfortunately not the case. At least gcc 4.5.0 prints a
 warning but still produces code.


 Ok then, a compile-time conditional should fix that.



Thanks to Seth for pointing out my obvious bug.

unsigned long patternl = pattern8;
if (1  sizeof patternl) patternl |= patternl  8;
if (2  sizeof patternl) patternl |= patternl  16;
if (4  sizeof patternl) patternl |= patternl  32;
if (8  sizeof patternl) patternl |= patternl  64;



 I'm pretty confident in gcc's ability to optimize this version.  I hope.

___
Grub-devel mailing list
Grub-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/grub-devel


Re: [PATCH] Optimise memset on i386

2010-06-25 Thread Vladimir 'φ-coder/phcoder' Serbinenko
On 06/23/2010 11:38 PM, Colin Watson wrote:
 With this approach, one of the most noticeable time sinks is that
 setting a graphical video mode (I'm using the VBE backend) takes ages:
 1.6 seconds, which is a substantial percentage of this project's total
 boot time.  It turns out that most of this is spent initialising
 double-buffering: doublebuf_pageflipping_init calls
 grub_video_fb_create_render_target_from_pointer twice, and each call
 takes a little over 600 milliseconds.  Now,
 grub_video_fb_create_render_target_from_pointer is basically just a big
 grub_memset to clear framebuffer memory, so this equates to under two
 frames per second.  What's going on?

 It turns out that write caching is disabled on video memory when GRUB is
 running, so we take a cache stall on every single write, and it's
 apparently hard to enable caching without implementing MTRRs.  People
 who know more about this than I do tell me that this can get
 unpleasantly CPU-specific at times, although I still hold out some hope
 that it's possible in GRUB.

   
On non-device memory GRUB should take advantage of cache. On MIPS
enabling/disabling cache is done by using a different address. So we
have all infrastructure necessary for differentiating
cacheable/non-cacheable is present. Enabling cache on video memory is
however more of a trouble. One of the reasons is that cache nmishandling
produces difficult bugs.
 However, there's a way to substantially speed things up without that.
 The naïve implementation of grub_memset writes a byte at a time, and for
 that matter on i386 it compiles to a poorly-optimised loop rather than
 using REP STOS or similar.  grub_memset is an inner loop practically by
 definition, and it's worth optimising.  We can fix both of these
 weaknesses by importing the optimised memset from GNU libc: since it
 writes four bytes at a time except (sometimes) at the start and end, it
 should take about a quarter the number of cache stalls.  And, indeed,
 measurement bears this out: instead of taking over 600 milliseconds per
 call to grub_video_fb_create_render_target_from_pointer (I think it was
 actually 630 or so, though I neglected to write that down), GRUB now
 takes about 160 milliseconds per call.  Much better!

 The optimised memset is LGPLv2.1 or later, and I've preserved that
 notice, but as far as I know this should be fine for use in GRUB; it can
 be upgraded to LGPLv3, and that's just GPLv3 with some additional
 permissions.  It's already assigned to the FSF due to being in glibc.

   
It's ok to use this code but be sure to mention its origin. It's also ok
to keep its license unless big divergeance is to be expected.

Did you test it on x86_64?
 +void *
 +grub_memset (void *s, int c, grub_size_t n)
 +{
 +  unsigned char *p = (unsigned char *) s;
 +
 +  while (n--)
 +*p++ = (unsigned char) c;
 +
 +  return s;
 +}
   
This can be optimised the same way as i386 part, just replace stos with
a loop over iterator with a pointer aligned on its size.
 Thanks,

   


-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko




signature.asc
Description: OpenPGP digital signature
___
Grub-devel mailing list
Grub-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/grub-devel


Re: [PATCH] Optimise memset on i386

2010-06-25 Thread Vladimir 'φ-coder/phcoder' Serbinenko

 +void *
 +grub_memset (void *s, int c, grub_size_t n)
 +{
 +  unsigned char *p = (unsigned char *) s;
 +
 +  while (n--)
 +*p++ = (unsigned char) c;
 +
 +  return s;
 +}
   
Attached is a possible generic implementation. Not even compile-tested.
Could you test it and compare disasm of this version on i386 and glibc
i386 version. Perhaps they are equivalent. If they are for maintainance
reasons it's better to always use generic one.
 +
 +#ifndef APPLE_CC
 +void *memset (void *s, int c, grub_size_t n)
 +  __attribute__ ((alias (grub_memset)));
 +#else
 +void *memset (void *s, int c, grub_size_t n)
 +{
 +  return grub_memset (s, c, n);
 +}
 +#endif

 Thanks,

   


-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko


void
grub_memset (void *p, int c, grub_size_t len)
{
  grub_uint8_t pattern8 = c;
  if (len = 3 * sizeof (unsigned long))
{
  unsigned long patternl = 0;
  int i;
  for (i = 0; i  sizeof (unsigned long); i++)
	patternl |= ((unsigned long) pattern8)  (8 * i);
  while (len  0  (((grub_addr_t) p)  (sizeof (unsigned long) - 1)))
	{
	  *(grub_uint8_t *) p = pattern8;
	  p = (grub_uint8_t *) p + 1;
	}
  while (len = sizeof (unsigned long))
	{
	  *(unsigned long *) p = patternl;
	  p = (unsigned long *) p + 1;
	}
}
  while (len  0)
{
  *(grub_uint8_t *) p = pattern8;
  p = (grub_uint8_t *) p + 1;
}
}


signature.asc
Description: OpenPGP digital signature
___
Grub-devel mailing list
Grub-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/grub-devel


[PATCH] Optimise memset on i386

2010-06-23 Thread Colin Watson
I've been working on a project involving fast boot speeds, which are
timed from post-BIOS (this is an axiom of the project and not something
I get to change).  As such I've been working on optimising GRUB, and
have been digging into it with this general debugging pattern (probably
not optimal, I'm just showing it as an example of my approach in case
anyone else is trying to do similar things):

#include grub/time.h
#include grub/env.h

void
grub_function (void)
{
  char *value, *valueend;

  value = grub_malloc (16384);
  *value = '\0';
  valueend = value;

  grub_snprintf (valueend, 100, %lld, grub_get_time_ms ());
  valueend = grub_strchr (valueend, '\0');

  /* do slow stuff */

  grub_snprintf (valueend, 100, %lld, grub_get_time_ms ());
  valueend = grub_strchr (valueend, '\0');

  /* do more slow stuff */

  grub_snprintf (valueend, 100, %lld, grub_get_time_ms ());
  valueend = grub_strchr (valueend, '\0');
  
  grub_env_set (identifying_name, value);
  grub_free (value);
}

This seems lightweight enough that it doesn't interfere much with
timings, and the approach of stuffing things into an environment
variable is useful because you can get millisecond-precision timings
even for video initialisation.

With this approach, one of the most noticeable time sinks is that
setting a graphical video mode (I'm using the VBE backend) takes ages:
1.6 seconds, which is a substantial percentage of this project's total
boot time.  It turns out that most of this is spent initialising
double-buffering: doublebuf_pageflipping_init calls
grub_video_fb_create_render_target_from_pointer twice, and each call
takes a little over 600 milliseconds.  Now,
grub_video_fb_create_render_target_from_pointer is basically just a big
grub_memset to clear framebuffer memory, so this equates to under two
frames per second.  What's going on?

It turns out that write caching is disabled on video memory when GRUB is
running, so we take a cache stall on every single write, and it's
apparently hard to enable caching without implementing MTRRs.  People
who know more about this than I do tell me that this can get
unpleasantly CPU-specific at times, although I still hold out some hope
that it's possible in GRUB.

However, there's a way to substantially speed things up without that.
The naïve implementation of grub_memset writes a byte at a time, and for
that matter on i386 it compiles to a poorly-optimised loop rather than
using REP STOS or similar.  grub_memset is an inner loop practically by
definition, and it's worth optimising.  We can fix both of these
weaknesses by importing the optimised memset from GNU libc: since it
writes four bytes at a time except (sometimes) at the start and end, it
should take about a quarter the number of cache stalls.  And, indeed,
measurement bears this out: instead of taking over 600 milliseconds per
call to grub_video_fb_create_render_target_from_pointer (I think it was
actually 630 or so, though I neglected to write that down), GRUB now
takes about 160 milliseconds per call.  Much better!

The optimised memset is LGPLv2.1 or later, and I've preserved that
notice, but as far as I know this should be fine for use in GRUB; it can
be upgraded to LGPLv3, and that's just GPLv3 with some additional
permissions.  It's already assigned to the FSF due to being in glibc.

2010-06-23  Colin Watson  cjwat...@ubuntu.com

* conf/any-emu.rmk (kernel_img_SOURCES): Add kern/string.c.
* conf/common.rmk (grub_mkdevicemap_SOURCES): Likewise.
(grub_probe_SOURCES): Likewise.
(grub_fstest_SOURCES): Likewise.
(grub_script_check_SOURCES): Likewise.
(grub_editenv_SOURCES): Likewise.
* conf/mips-qemu-mips.rmk (kernel_img_SOURCES): Likewise.
* conf/mips-yeeloong.rmk (kernel_img_SOURCES): Likewise.
* conf/powerpc-ieee1275.rmk (kernel_img_SOURCES): Likewise.
* conf/sparc64-ieee1275.rmk (kernel_img_SOURCES): Likewise.
(grub_setup_SOURCES): Likewise.
* conf/tests.rmk (example_unit_test_SOURCES): Likewise.

* conf/i386-coreboot.rmk (kernel_img_SOURCES): Add
kern/i386/string.c.
* conf/i386-ieee1275.rmk (kernel_img_SOURCES): Likewise.
* conf/i386-multiboot.rmk (kernel_img_SOURCES): Likewise.
* conf/i386-pc.rmk (kernel_img_SOURCES): Likewise.
(grub_setup_SOURCES): Likewise.
* conf/i386-qemu.rmk (kernel_img_SOURCES): Likewise.
* conf/x86-efi.rmk (kernel_img_SOURCES): Likewise.

* kern/i386/string.c: New file.
* kern/misc.c (grub_memset): Move to ...
* kern/string.c (grub_memset): ... here.
* kern/misc.c (memset): Move to ...
* kern/string.c (memset): ... here.

=== modified file 'conf/any-emu.rmk'
--- conf/any-emu.rmk2010-06-11 20:31:16 +
+++ conf/any-emu.rmk2010-06-23 16:55:35 +
@@ -6,7 +6,7 @@ kernel_img_SOURCES = kern/device.c kern/
kern/err.c kern/list.c kern/command.c   \
kern/corecmd.c kern/file.c