Re: [PATCH] Optimise memset on i386
On 07/23/2010 02:14 PM, Colin Watson wrote: On Fri, Jun 25, 2010 at 08:27:20PM +0200, Vladimir 'φ-coder/phcoder' Serbinenko wrote: +void * +grub_memset (void *s, int c, grub_size_t n) +{ + unsigned char *p = (unsigned char *) s; + + while (n--) +*p++ = (unsigned char) c; + + return s; +} Attached is a possible generic implementation. Not even compile-tested. Could you test it and compare disasm of this version on i386 and glibc i386 version. Perhaps they are equivalent. If they are for maintainance reasons it's better to always use generic one. Thanks for this, and sorry for my delay. I haven't compared the disassembly, but once I fixed up your code it performed within about 10% of the optimised version I posted before, so about 640ms original, 160ms with i386-specific memset, and 176ms with your generic memset. Even though that isn't quite equivalent, I don't think that 16ms is going to be a major problem, and so I would suggest going with the generic version. Please check the following. 2010-07-23 Vladimir Serbinenko phco...@gmail.com 2010-07-23 Colin Watson cjwat...@ubuntu.com * kern/misc.c (grub_memset): Optimise to reduce cache stalls. Go ahead. === modified file 'kern/misc.c' --- kern/misc.c 2010-07-02 17:35:07 + +++ kern/misc.c 2010-07-23 11:05:32 + @@ -518,12 +518,39 @@ grub_strndup (const char *s, grub_size_t } void * -grub_memset (void *s, int c, grub_size_t n) +grub_memset (void *s, int c, grub_size_t len) { - unsigned char *p = (unsigned char *) s; + void *p = s; + grub_uint8_t pattern8 = c; - while (n--) -*p++ = (unsigned char) c; + if (len = 3 * sizeof (unsigned long)) +{ + unsigned long patternl = 0; + grub_size_t i; + + for (i = 0; i sizeof (unsigned long); i++) + patternl |= ((unsigned long) pattern8) (8 * i); + + while (len 0 (((grub_addr_t) p) (sizeof (unsigned long) - 1))) + { + *(grub_uint8_t *) p = pattern8; + p = (grub_uint8_t *) p + 1; + len--; + } + while (len = sizeof (unsigned long)) + { + *(unsigned long *) p = patternl; + p = (unsigned long *) p + 1; + len -= sizeof (unsigned long); + } +} + + while (len 0) +{ + *(grub_uint8_t *) p = pattern8; + p = (grub_uint8_t *) p + 1; + len--; +} return s; } -- Regards Vladimir 'φ-coder/phcoder' Serbinenko signature.asc Description: OpenPGP digital signature ___ Grub-devel mailing list Grub-devel@gnu.org http://lists.gnu.org/mailman/listinfo/grub-devel
Re: [PATCH] Optimise memset on i386
On Fri, Jul 23, 2010 at 10:56:24AM -0500, richardvo...@gmail.com wrote: [snip] + unsigned long patternl = 0; + grub_size_t i; + + for (i = 0; i sizeof (unsigned long); i++) + patternl |= ((unsigned long) pattern8) (8 * i); + might I suggest: unsigned long patternl = pattern8; patternl |= patternl 8; patternl |= patternl 16; patternl |= patternl 32; patternl |= patternl 64; O(lg N) instead of O(N), no loop, no branches, and the compiler should be smart enough to optimize away the last two lines on systems with narrower long. I no longer have the system on which I benchmarked this. However, since N is always either 4 or 8 on current targets, this can only amount to micro-optimisation which I don't think can possibly matter much; we're talking a handful of cycles at most. Do we really need to spend time bikeshedding this? The important thing is taking only a cache stall per long rather than a cache stall per byte; anything else is likely to be noise. -- Colin Watson [cjwat...@ubuntu.com] ___ Grub-devel mailing list Grub-devel@gnu.org http://lists.gnu.org/mailman/listinfo/grub-devel
Re: [PATCH] Optimise memset on i386
On Fri, Jun 25, 2010 at 08:27:20PM +0200, Vladimir 'φ-coder/phcoder' Serbinenko wrote: +void * +grub_memset (void *s, int c, grub_size_t n) +{ + unsigned char *p = (unsigned char *) s; + + while (n--) +*p++ = (unsigned char) c; + + return s; +} Attached is a possible generic implementation. Not even compile-tested. Could you test it and compare disasm of this version on i386 and glibc i386 version. Perhaps they are equivalent. If they are for maintainance reasons it's better to always use generic one. Thanks for this, and sorry for my delay. I haven't compared the disassembly, but once I fixed up your code it performed within about 10% of the optimised version I posted before, so about 640ms original, 160ms with i386-specific memset, and 176ms with your generic memset. Even though that isn't quite equivalent, I don't think that 16ms is going to be a major problem, and so I would suggest going with the generic version. Please check the following. 2010-07-23 Vladimir Serbinenko phco...@gmail.com 2010-07-23 Colin Watson cjwat...@ubuntu.com * kern/misc.c (grub_memset): Optimise to reduce cache stalls. === modified file 'kern/misc.c' --- kern/misc.c 2010-07-02 17:35:07 + +++ kern/misc.c 2010-07-23 11:05:32 + @@ -518,12 +518,39 @@ grub_strndup (const char *s, grub_size_t } void * -grub_memset (void *s, int c, grub_size_t n) +grub_memset (void *s, int c, grub_size_t len) { - unsigned char *p = (unsigned char *) s; + void *p = s; + grub_uint8_t pattern8 = c; - while (n--) -*p++ = (unsigned char) c; + if (len = 3 * sizeof (unsigned long)) +{ + unsigned long patternl = 0; + grub_size_t i; + + for (i = 0; i sizeof (unsigned long); i++) + patternl |= ((unsigned long) pattern8) (8 * i); + + while (len 0 (((grub_addr_t) p) (sizeof (unsigned long) - 1))) + { + *(grub_uint8_t *) p = pattern8; + p = (grub_uint8_t *) p + 1; + len--; + } + while (len = sizeof (unsigned long)) + { + *(unsigned long *) p = patternl; + p = (unsigned long *) p + 1; + len -= sizeof (unsigned long); + } +} + + while (len 0) +{ + *(grub_uint8_t *) p = pattern8; + p = (grub_uint8_t *) p + 1; + len--; +} return s; } -- Colin Watson [cjwat...@ubuntu.com] ___ Grub-devel mailing list Grub-devel@gnu.org http://lists.gnu.org/mailman/listinfo/grub-devel
Re: [PATCH] Optimise memset on i386
[snip] + unsigned long patternl = 0; + grub_size_t i; + + for (i = 0; i sizeof (unsigned long); i++) + patternl |= ((unsigned long) pattern8) (8 * i); + might I suggest: unsigned long patternl = pattern8; patternl |= patternl 8; patternl |= patternl 16; patternl |= patternl 32; patternl |= patternl 64; O(lg N) instead of O(N), no loop, no branches, and the compiler should be smart enough to optimize away the last two lines on systems with narrower long. ___ Grub-devel mailing list Grub-devel@gnu.org http://lists.gnu.org/mailman/listinfo/grub-devel
Re: [PATCH] Optimise memset on i386
richardvoigt wrote: might I suggest: unsigned long patternl = pattern8; patternl |= patternl 8; patternl |= patternl 16; patternl |= patternl 32; patternl |= patternl 64; O(lg N) instead of O(N), no loop, no branches, and the compiler should be smart enough to optimize away the last two lines on systems with narrower long. The latter is unfortunately not the case. At least gcc 4.5.0 prints a warning but still produces code. $ cat EOF f.c unsigned long f(unsigned long x) { x |= x 32; x |= x 64; return x; } $ gcc -O3 -S f.c x.c: In function ‘f’: x.c:3: warning: left shift count = width of type x.c:4: warning: left shift count = width of type $ cat f.s ... pushl %ebp movl$32, %ecx movl%esp, %ebp movl8(%ebp), %eax popl%ebp movl%eax, %edx sall%cl, %edx movl$64, %ecx orl %eax, %edx movl%edx, %eax sall%cl, %eax orl %edx, %eax ret -- Regards, Christian Franke ___ Grub-devel mailing list Grub-devel@gnu.org http://lists.gnu.org/mailman/listinfo/grub-devel
Re: [PATCH] Optimise memset on i386
On Fri, Jul 23, 2010 at 12:34 PM, Christian Franke christian.fra...@t-online.de wrote: richardvoigt wrote: might I suggest: unsigned long patternl = pattern8; patternl |= patternl 8; patternl |= patternl 16; patternl |= patternl 32; patternl |= patternl 64; O(lg N) instead of O(N), no loop, no branches, and the compiler should be smart enough to optimize away the last two lines on systems with narrower long. The latter is unfortunately not the case. At least gcc 4.5.0 prints a warning but still produces code. Ok then, a compile-time conditional should fix that. unsigned long patternl = pattern8; if (8 sizeof patternl) patternl |= patternl 8; if (16 sizeof patternl) patternl |= patternl 16; if (32 sizeof patternl) patternl |= patternl 32; if (64 sizeof patternl) patternl |= patternl 64; I'm pretty confident in gcc's ability to optimize this version. I hope. ___ Grub-devel mailing list Grub-devel@gnu.org http://lists.gnu.org/mailman/listinfo/grub-devel
Re: [PATCH] Optimise memset on i386
On Fri, Jul 23, 2010 at 2:48 PM, richardvo...@gmail.com richardvo...@gmail.com wrote: On Fri, Jul 23, 2010 at 12:34 PM, Christian Franke christian.fra...@t-online.de wrote: richardvoigt wrote: might I suggest: unsigned long patternl = pattern8; patternl |= patternl 8; patternl |= patternl 16; patternl |= patternl 32; patternl |= patternl 64; O(lg N) instead of O(N), no loop, no branches, and the compiler should be smart enough to optimize away the last two lines on systems with narrower long. The latter is unfortunately not the case. At least gcc 4.5.0 prints a warning but still produces code. Ok then, a compile-time conditional should fix that. Thanks to Seth for pointing out my obvious bug. unsigned long patternl = pattern8; if (1 sizeof patternl) patternl |= patternl 8; if (2 sizeof patternl) patternl |= patternl 16; if (4 sizeof patternl) patternl |= patternl 32; if (8 sizeof patternl) patternl |= patternl 64; I'm pretty confident in gcc's ability to optimize this version. I hope. ___ Grub-devel mailing list Grub-devel@gnu.org http://lists.gnu.org/mailman/listinfo/grub-devel
Re: [PATCH] Optimise memset on i386
On 06/23/2010 11:38 PM, Colin Watson wrote: With this approach, one of the most noticeable time sinks is that setting a graphical video mode (I'm using the VBE backend) takes ages: 1.6 seconds, which is a substantial percentage of this project's total boot time. It turns out that most of this is spent initialising double-buffering: doublebuf_pageflipping_init calls grub_video_fb_create_render_target_from_pointer twice, and each call takes a little over 600 milliseconds. Now, grub_video_fb_create_render_target_from_pointer is basically just a big grub_memset to clear framebuffer memory, so this equates to under two frames per second. What's going on? It turns out that write caching is disabled on video memory when GRUB is running, so we take a cache stall on every single write, and it's apparently hard to enable caching without implementing MTRRs. People who know more about this than I do tell me that this can get unpleasantly CPU-specific at times, although I still hold out some hope that it's possible in GRUB. On non-device memory GRUB should take advantage of cache. On MIPS enabling/disabling cache is done by using a different address. So we have all infrastructure necessary for differentiating cacheable/non-cacheable is present. Enabling cache on video memory is however more of a trouble. One of the reasons is that cache nmishandling produces difficult bugs. However, there's a way to substantially speed things up without that. The naïve implementation of grub_memset writes a byte at a time, and for that matter on i386 it compiles to a poorly-optimised loop rather than using REP STOS or similar. grub_memset is an inner loop practically by definition, and it's worth optimising. We can fix both of these weaknesses by importing the optimised memset from GNU libc: since it writes four bytes at a time except (sometimes) at the start and end, it should take about a quarter the number of cache stalls. And, indeed, measurement bears this out: instead of taking over 600 milliseconds per call to grub_video_fb_create_render_target_from_pointer (I think it was actually 630 or so, though I neglected to write that down), GRUB now takes about 160 milliseconds per call. Much better! The optimised memset is LGPLv2.1 or later, and I've preserved that notice, but as far as I know this should be fine for use in GRUB; it can be upgraded to LGPLv3, and that's just GPLv3 with some additional permissions. It's already assigned to the FSF due to being in glibc. It's ok to use this code but be sure to mention its origin. It's also ok to keep its license unless big divergeance is to be expected. Did you test it on x86_64? +void * +grub_memset (void *s, int c, grub_size_t n) +{ + unsigned char *p = (unsigned char *) s; + + while (n--) +*p++ = (unsigned char) c; + + return s; +} This can be optimised the same way as i386 part, just replace stos with a loop over iterator with a pointer aligned on its size. Thanks, -- Regards Vladimir 'φ-coder/phcoder' Serbinenko signature.asc Description: OpenPGP digital signature ___ Grub-devel mailing list Grub-devel@gnu.org http://lists.gnu.org/mailman/listinfo/grub-devel
Re: [PATCH] Optimise memset on i386
+void * +grub_memset (void *s, int c, grub_size_t n) +{ + unsigned char *p = (unsigned char *) s; + + while (n--) +*p++ = (unsigned char) c; + + return s; +} Attached is a possible generic implementation. Not even compile-tested. Could you test it and compare disasm of this version on i386 and glibc i386 version. Perhaps they are equivalent. If they are for maintainance reasons it's better to always use generic one. + +#ifndef APPLE_CC +void *memset (void *s, int c, grub_size_t n) + __attribute__ ((alias (grub_memset))); +#else +void *memset (void *s, int c, grub_size_t n) +{ + return grub_memset (s, c, n); +} +#endif Thanks, -- Regards Vladimir 'φ-coder/phcoder' Serbinenko void grub_memset (void *p, int c, grub_size_t len) { grub_uint8_t pattern8 = c; if (len = 3 * sizeof (unsigned long)) { unsigned long patternl = 0; int i; for (i = 0; i sizeof (unsigned long); i++) patternl |= ((unsigned long) pattern8) (8 * i); while (len 0 (((grub_addr_t) p) (sizeof (unsigned long) - 1))) { *(grub_uint8_t *) p = pattern8; p = (grub_uint8_t *) p + 1; } while (len = sizeof (unsigned long)) { *(unsigned long *) p = patternl; p = (unsigned long *) p + 1; } } while (len 0) { *(grub_uint8_t *) p = pattern8; p = (grub_uint8_t *) p + 1; } } signature.asc Description: OpenPGP digital signature ___ Grub-devel mailing list Grub-devel@gnu.org http://lists.gnu.org/mailman/listinfo/grub-devel
[PATCH] Optimise memset on i386
I've been working on a project involving fast boot speeds, which are timed from post-BIOS (this is an axiom of the project and not something I get to change). As such I've been working on optimising GRUB, and have been digging into it with this general debugging pattern (probably not optimal, I'm just showing it as an example of my approach in case anyone else is trying to do similar things): #include grub/time.h #include grub/env.h void grub_function (void) { char *value, *valueend; value = grub_malloc (16384); *value = '\0'; valueend = value; grub_snprintf (valueend, 100, %lld, grub_get_time_ms ()); valueend = grub_strchr (valueend, '\0'); /* do slow stuff */ grub_snprintf (valueend, 100, %lld, grub_get_time_ms ()); valueend = grub_strchr (valueend, '\0'); /* do more slow stuff */ grub_snprintf (valueend, 100, %lld, grub_get_time_ms ()); valueend = grub_strchr (valueend, '\0'); grub_env_set (identifying_name, value); grub_free (value); } This seems lightweight enough that it doesn't interfere much with timings, and the approach of stuffing things into an environment variable is useful because you can get millisecond-precision timings even for video initialisation. With this approach, one of the most noticeable time sinks is that setting a graphical video mode (I'm using the VBE backend) takes ages: 1.6 seconds, which is a substantial percentage of this project's total boot time. It turns out that most of this is spent initialising double-buffering: doublebuf_pageflipping_init calls grub_video_fb_create_render_target_from_pointer twice, and each call takes a little over 600 milliseconds. Now, grub_video_fb_create_render_target_from_pointer is basically just a big grub_memset to clear framebuffer memory, so this equates to under two frames per second. What's going on? It turns out that write caching is disabled on video memory when GRUB is running, so we take a cache stall on every single write, and it's apparently hard to enable caching without implementing MTRRs. People who know more about this than I do tell me that this can get unpleasantly CPU-specific at times, although I still hold out some hope that it's possible in GRUB. However, there's a way to substantially speed things up without that. The naïve implementation of grub_memset writes a byte at a time, and for that matter on i386 it compiles to a poorly-optimised loop rather than using REP STOS or similar. grub_memset is an inner loop practically by definition, and it's worth optimising. We can fix both of these weaknesses by importing the optimised memset from GNU libc: since it writes four bytes at a time except (sometimes) at the start and end, it should take about a quarter the number of cache stalls. And, indeed, measurement bears this out: instead of taking over 600 milliseconds per call to grub_video_fb_create_render_target_from_pointer (I think it was actually 630 or so, though I neglected to write that down), GRUB now takes about 160 milliseconds per call. Much better! The optimised memset is LGPLv2.1 or later, and I've preserved that notice, but as far as I know this should be fine for use in GRUB; it can be upgraded to LGPLv3, and that's just GPLv3 with some additional permissions. It's already assigned to the FSF due to being in glibc. 2010-06-23 Colin Watson cjwat...@ubuntu.com * conf/any-emu.rmk (kernel_img_SOURCES): Add kern/string.c. * conf/common.rmk (grub_mkdevicemap_SOURCES): Likewise. (grub_probe_SOURCES): Likewise. (grub_fstest_SOURCES): Likewise. (grub_script_check_SOURCES): Likewise. (grub_editenv_SOURCES): Likewise. * conf/mips-qemu-mips.rmk (kernel_img_SOURCES): Likewise. * conf/mips-yeeloong.rmk (kernel_img_SOURCES): Likewise. * conf/powerpc-ieee1275.rmk (kernel_img_SOURCES): Likewise. * conf/sparc64-ieee1275.rmk (kernel_img_SOURCES): Likewise. (grub_setup_SOURCES): Likewise. * conf/tests.rmk (example_unit_test_SOURCES): Likewise. * conf/i386-coreboot.rmk (kernel_img_SOURCES): Add kern/i386/string.c. * conf/i386-ieee1275.rmk (kernel_img_SOURCES): Likewise. * conf/i386-multiboot.rmk (kernel_img_SOURCES): Likewise. * conf/i386-pc.rmk (kernel_img_SOURCES): Likewise. (grub_setup_SOURCES): Likewise. * conf/i386-qemu.rmk (kernel_img_SOURCES): Likewise. * conf/x86-efi.rmk (kernel_img_SOURCES): Likewise. * kern/i386/string.c: New file. * kern/misc.c (grub_memset): Move to ... * kern/string.c (grub_memset): ... here. * kern/misc.c (memset): Move to ... * kern/string.c (memset): ... here. === modified file 'conf/any-emu.rmk' --- conf/any-emu.rmk2010-06-11 20:31:16 + +++ conf/any-emu.rmk2010-06-23 16:55:35 + @@ -6,7 +6,7 @@ kernel_img_SOURCES = kern/device.c kern/ kern/err.c kern/list.c kern/command.c \ kern/corecmd.c kern/file.c