Hi, On Tue, 30 Aug 2005, Knut Petersen wrote:
> > Probably you can make it even faster by avoiding the multiplication, like > > > > unsigned int offset = 0; > > for (i = 0; i < image.height; i++) { > > dst[offset] = src[i]; > > offset += pitch; > > } > > > > More than two decades ago I learned to avoid mul and imul. Use shifts, add and > lea instead, > that was the credo those days. The name of the game was CP/M 80/86, a86, d86 > and ddt ;-) > > But let´s get serious again. > > Your proposed change of the patch results in a 21 ms performance decrease on > my system. > Yes, I do know that this is hard to believe. I tested a similar variation > before, and the results > were even worse. Could you try the patch below, for a few extra cycles you might want to make it an inline function. bye, Roman Index: linux-2.6/drivers/video/fbmem.c =================================================================== --- linux-2.6.orig/drivers/video/fbmem.c 2005-08-30 01:55:18.000000000 +0200 +++ linux-2.6/drivers/video/fbmem.c 2005-08-30 21:10:25.705462837 +0200 @@ -82,11 +82,11 @@ void fb_pad_aligned_buffer(u8 *dst, u32 { int i, j; + d_pitch -= s_pitch; for (i = height; i--; ) { /* s_pitch is a few bytes at the most, memcpy is suboptimal */ for (j = 0; j < s_pitch; j++) - dst[j] = src[j]; - src += s_pitch; + *dst++ = *src++; dst += d_pitch; } }