[PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Joakim Tjernlund
This is a much simpler and efficent impl. of the IP checksum.
It is a dry port from Quagga and I have not tested it.
---
 lib/checksum.c |   44 +++-
 1 files changed, 11 insertions(+), 33 deletions(-)

diff --git a/lib/checksum.c b/lib/checksum.c
index 33cb386..b836bdb 100644
--- a/lib/checksum.c
+++ b/lib/checksum.c
@@ -15,25 +15,10 @@
 #include nest/bird.h
 #include checksum.h
 
-static u16 /* One-complement addition */
-add16(u16 sum, u16 x)
-{
-  u16 z = sum + x;
-  return z + (z  sum);
-}
-
-static u32
-add32(u32 sum, u32 x)
-{
-  u32 z = sum + x;
-  return z + (z  sum);
-}
-
 static u16
 ipsum_calc_block(u16 *x, unsigned len, u16 sum)
 {
-  int rest;
-  u32 tmp, *xx;
+  u32 tmp;
 
   /*
*  A few simple facts about the IP checksum (see RFC 1071 for detailed
@@ -51,23 +36,16 @@ ipsum_calc_block(u16 *x, unsigned len, u16 sum)
   if (!len)
 return sum;
   len = 1;
-  if ((unsigned long) x  2)   /* Align to 32-bit boundary */
-{
-  sum = add16(sum, *x++);
-  len--;
-}
-  rest = len  1;
-  len = 1;
-  tmp = 0;
-  xx = (u32 *) x;
-  while (len)
-{
-  tmp = add32(tmp, *xx++);
-  len--;
-}
-  sum = add16(sum, add16(tmp  0x, tmp  16U));
-  if (rest)
-sum = add16(sum, *(u16 *) xx);
+  tmp = sum;
+  for(x--; len; --len)
+  tmp += *++x;
+  /*
+   * Add back carry outs from top 16 bits to low 16 bits.
+   */
+  tmp  = (tmp  16) + (tmp  0x);/* add high-16 to low-16 */
+  tmp += (tmp  16); /* add carry */
+  sum = ~tmp;  /* ones-complement, then truncate to 16 bits */
+
   return sum;
 }
 
-- 
1.6.4.4



Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Martin Mares
Hello!

 This is a much simpler and efficent impl. of the IP checksum.
 It is a dry port from Quagga and I have not tested it.

Are you convinced that your version is more efficient? The original version
processes 32 bits at a time, while your code does only 16 bits at a time. It 
might
be worth the saved branch, but do you have any benchmark proving that it
helps?

Have a nice fortnight
-- 
Martin `MJ' Mares  m...@ucw.cz   http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Hex dump:  Where witches put used curses...


Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Joakim Tjernlund
Martin Mares m...@ucw.cz wrote on 2010/04/23 15:03:34:

 Hello!

  This is a much simpler and efficent impl. of the IP checksum.
  It is a dry port from Quagga and I have not tested it.

 Are you convinced that your version is more efficient? The original version
 processes 32 bits at a time, while your code does only 16 bits at a time. It 
 might
 be worth the saved branch, but do you have any benchmark proving that it
 helps?

Fairly, I once had the same idea for Quagga but found all those extra tests and
additions were much slower(I benched it). Just look at the number of extra ops 
one
has to do in the current code.
If you want to do it faster you have to go ASM. It would be easy to
add support for that too but it can wait.

 Jocke



Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Martin Mares
Hello!

 Fairly, I once had the same idea for Quagga but found all those extra tests 
 and
 additions were much slower(I benched it). Just look at the number of extra 
 ops one
 has to do in the current code.
 If you want to do it faster you have to go ASM. It would be easy to
 add support for that too but it can wait.

My primary reaction was If something isn't broken, don't fix it. I.e.,
unless you have good reasons for rewriting a piece of code, don't do that.

Your version is more readable and I would be in favour of accepting it,
but I would still like to see at least a very simple benchmark which shows
that it is not significantly slower.

(Anyway, it might be interesting to run OProfile on Bird to see which parts
of the code are real hot spots and where we should focus more on maintainibility
than on speed.)

Have a nice fortnight
-- 
Martin `MJ' Mares  m...@ucw.cz   http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
To understand a program you must become both the machine and the program.


Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Martin Mares
Hello!

 But you can't get rid of:
  z + (z  sum)
 which is the real bottleneck. Perhaps this doesn't cost much
 on high end CPUs but it sure does on embedded CPUs
 
Why should it be?  It can be compiled as a sequence of add with carry
instructions, can't it?

Have a nice fortnight
-- 
Martin `MJ' Mares  m...@ucw.cz   http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
A student who changes the course of history is probably taking an exam.


Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Joakim Tjernlund

 Hello!

  But you can't get rid of:
   z + (z  sum)
  which is the real bottleneck. Perhaps this doesn't cost much
  on high end CPUs but it sure does on embedded CPUs

 Why should it be?  It can be compiled as a sequence of add with carry
 instructions, can't it?

Yes, but have you seen gcc do that? I havn't, perhaps gcc has become smarter
recently?

 Jocke



Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Joakim Tjernlund
Joakim Tjernlund/Transmode wrote on 2010/04/23 16:14:58:

 
  Hello!
 
   But you can't get rid of:
z + (z  sum)
   which is the real bottleneck. Perhaps this doesn't cost much
   on high end CPUs but it sure does on embedded CPUs
 
  Why should it be?  It can be compiled as a sequence of add with carry
  instructions, can't it?
 Yes, but have you seen gcc do that? I havn't, perhaps gcc has become smarter
 recently?

  Jocke

Just tried this and it didn't with gcc 3.4.3 on PowerPC
Some arch does not have an add with carry insn(MIPS?)

  Jocke

unsigned long
add32(unsigned long sum, unsigned long x)
{
  unsigned long z = sum + x;
  return z + (z  sum);
}

/* gcc -O3 -S gives:

.file   addc.c
.gnu_attribute 4, 2
.gnu_attribute 8, 1
.section.text
.align 2
.globl add32
.type   add32, @function
add32:
add 4,4,3
subfc 3,3,4
subfe 3,3,3
subf 3,3,4
blr
.size   add32, .-add32
.ident  GCC: (Gentoo 4.3.4 p1.1, pie-10.1.5) 4.3.4
.section.note.GNU-stack,,@progbits
 */



Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Martin Mares
Hello!

 Just tried this and it didn't with gcc 3.4.3 on PowerPC

It would be better to let gcc unroll the loop (if it is critical for
performance, it should be unrolled anyway) and use a newer version of gcc.

 Some arch does not have an add with carry insn(MIPS?)

Well, first of all, we should really get a clue about how much time is spent
in the checksum function.

Have a nice fortnight
-- 
Martin `MJ' Mares  m...@ucw.cz   http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
I think it's a new feature.  Don't tell anyone it was an accident. :-) -- 
Larry Wall


Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Ondrej Filip
On 23.4.2010 18:32, Ondrej Zajicek wrote:
 On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote:
 Hello!

 Fairly, I once had the same idea for Quagga but found all those extra tests 
 and
 additions were much slower(I benched it). Just look at the number of extra 
 ops one
 has to do in the current code.
 If you want to do it faster you have to go ASM. It would be easy to
 add support for that too but it can wait.

 My primary reaction was If something isn't broken, don't fix it. I.e.,
 unless you have good reasons for rewriting a piece of code, don't do that.

 Your version is more readable and I would be in favour of accepting it,
 but I would still like to see at least a very simple benchmark which shows
 that it is not significantly slower.
 
 I was curious enough to do some benchmarks and got these results:
 
 Intel Atom:   suggested code ~ 1.2* faster
 AMD Geode:no diference
 MIPS ADM5120: old code ~ 1.2* faster
 
 So there isn't really difference in performance of both
 implementations. Even on slow embedded AMD Geode CPU, it gives
 ~ 180 MB/s.
 

Hmm, so there is no major reason for change. I would really support
dont-fix-whats-not-broken approach. And also we should keep different
code from Quagga to have heterogeneity.

Ondrej






Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Joakim Tjernlund

Ondrej Filip fe...@network.cz wrote on 2010/04/23 18:41:57:

 On 23.4.2010 18:32, Ondrej Zajicek wrote:
  On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote:
  Hello!
 
  Fairly, I once had the same idea for Quagga but found all those extra 
  tests and
  additions were much slower(I benched it). Just look at the number of 
  extra ops one
  has to do in the current code.
  If you want to do it faster you have to go ASM. It would be easy to
  add support for that too but it can wait.
 
  My primary reaction was If something isn't broken, don't fix it. I.e.,
  unless you have good reasons for rewriting a piece of code, don't do that.
 
  Your version is more readable and I would be in favour of accepting it,
  but I would still like to see at least a very simple benchmark which shows
  that it is not significantly slower.
 
  I was curious enough to do some benchmarks and got these results:
 
  Intel Atom:   suggested code ~ 1.2* faster
  AMD Geode:   no diference
  MIPS ADM5120:   old code ~ 1.2* faster
 
  So there isn't really difference in performance of both
  implementations. Even on slow embedded AMD Geode CPU, it gives
  ~ 180 MB/s.

No difference? what does 1.2 mean? to me this means 20% which is a lot


 Hmm, so there is no major reason for change. I would really support
 dont-fix-whats-not-broken approach. And also we should keep different

So it is smaller, faster and easier to read and you still
want to keep the old code? Very non Open source like I
must say.

 code from Quagga to have heterogeneity.

What kind of argument is that?
Sorry, but this all feels like NIH syndrome.


  Ondrej




Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Martin Mares
Hello!

   So there isn't really difference in performance of both
   implementations. Even on slow embedded AMD Geode CPU, it gives
   ~ 180 MB/s.
 
 No difference? what does 1.2 mean? to me this means 20% which is a lot

Yes, but according to Santiago's benchmarks, your code is sometimes 20%
faster, sometimes 20% slower. It does not seem like a reason for change.

If you have any benchmark showing that BIRD spends a substantial amount
of time in this function, let's optimize it properly, even if it means
having several versions for different CPU's. Otherwise, let us stick
with the current code and concentrate our effort on places which matter.

The difference in code size and in readability is really tiny.

Have a nice fortnight
-- 
Martin `MJ' Mares  m...@ucw.cz   http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Why is abbreviation such a long word?


Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Ondrej Filip
On 23.4.2010 19:01, Joakim Tjernlund wrote:
 
 Ondrej Filip fe...@network.cz wrote on 2010/04/23 18:41:57:

 On 23.4.2010 18:32, Ondrej Zajicek wrote:
 On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote:
 Hello!

 My primary reaction was If something isn't broken, don't fix it. I.e.,
 unless you have good reasons for rewriting a piece of code, don't do that.

 Your version is more readable and I would be in favour of accepting it,
 but I would still like to see at least a very simple benchmark which shows
 that it is not significantly slower.

 I was curious enough to do some benchmarks and got these results:

 Intel Atom:   suggested code ~ 1.2* faster
 AMD Geode:   no diference
 MIPS ADM5120:   old code ~ 1.2* faster

 So there isn't really difference in performance of both
 implementations. Even on slow embedded AMD Geode CPU, it gives
 ~ 180 MB/s.
 
 No difference? what does 1.2 mean? to me this means 20% which is a lot

And the code is 20% slower on MIPS. So I do not see the point. Anyway,
20% in a not often used operation does not have to be even visible.

 Hmm, so there is no major reason for change. I would really support
 dont-fix-whats-not-broken approach. And also we should keep different
 
 So it is smaller, faster and easier to read and you still
 want to keep the old code? Very non Open source like I
 must say.

It was not proven that this code is faster.

 code from Quagga to have heterogeneity.
 
 What kind of argument is that?

Well, let me explain you a few things. It had happened many times,
that some wrong (mainly BGP) routing announcement were exported into
Internet and all routers of one type (often Quaggas) has crashed. BIRD
is used on some very important places like world largest IXPs as a
route server (RS). Each IXP usually has at least two different RSs (if
possible to avoid implementation dependent bugs. The second RS is
usually OpenBGPd of Quagga. So generally speaking, I do not want to
use any code from those routing daemons.

 Sorry, but this all feels like NIH syndrome.

Again, BIRD is used on some very important places and therefore we are
very conservative in accepting new patches. But I don't think we have
NIH syndrome. We have been accepting foreign patches since beginning
and Ondrej Zajicek is a good example. :-) He had sent me some new
patches and later we started a regular cooperation and he became a key
developer.

Ondrej


 

  Ondrej
 



Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Joakim Tjernlund
Martin Mares m...@ucw.cz wrote on 2010/04/23 19:23:18:

 Hello!

So there isn't really difference in performance of both
implementations. Even on slow embedded AMD Geode CPU, it gives
~ 180 MB/s.
 
  No difference? what does 1.2 mean? to me this means 20% which is a lot

 Yes, but according to Santiago's benchmarks, your code is sometimes 20%
 faster, sometimes 20% slower. It does not seem like a reason for change.

uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, are
you sure that is not a typo?


 If you have any benchmark showing that BIRD spends a substantial amount
 of time in this function, let's optimize it properly, even if it means
 having several versions for different CPU's. Otherwise, let us stick
 with the current code and concentrate our effort on places which matter.

Peformance matter, especially when the network grows. Is this the way
BIRD development works? Only work on stuff that currently feels important
is acceptet?

 Jocke



Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Martin Mares
Hello!

 Peformance matter, especially when the network grows. Is this the way
 BIRD development works? Only work on stuff that currently feels important
 is acceptet?

Yes, performance matters. This is why performance optimizations in BIRD
have to be justified by real numbers, not by hand-waving.

Have a nice fortnight
-- 
Martin `MJ' Mares  m...@ucw.cz   http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
God is real, unless declared integer.


Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Joakim Tjernlund

Ondrej Filip fe...@network.cz wrote on 2010/04/23 19:28:27:

 On 23.4.2010 19:01, Joakim Tjernlund wrote:
 
  Ondrej Filip fe...@network.cz wrote on 2010/04/23 18:41:57:
 
  On 23.4.2010 18:32, Ondrej Zajicek wrote:
  On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote:
  Hello!
 
  My primary reaction was If something isn't broken, don't fix it. I.e.,
  unless you have good reasons for rewriting a piece of code, don't do 
  that.
 
  Your version is more readable and I would be in favour of accepting it,
  but I would still like to see at least a very simple benchmark which 
  shows
  that it is not significantly slower.
 
  I was curious enough to do some benchmarks and got these results:
 
  Intel Atom:   suggested code ~ 1.2* faster
  AMD Geode:   no diference
  MIPS ADM5120:   old code ~ 1.2* faster
 
  So there isn't really difference in performance of both
  implementations. Even on slow embedded AMD Geode CPU, it gives
  ~ 180 MB/s.
 
  No difference? what does 1.2 mean? to me this means 20% which is a lot

 And the code is 20% slower on MIPS. So I do not see the point. Anyway,
 20% in a not often used operation does not have to be even visible.

  Hmm, so there is no major reason for change. I would really support
  dont-fix-whats-not-broken approach. And also we should keep different
 
  So it is smaller, faster and easier to read and you still
  want to keep the old code? Very non Open source like I
  must say.

 It was not proven that this code is faster.

  code from Quagga to have heterogeneity.
 
  What kind of argument is that?

 Well, let me explain you a few things. It had happened many times,
 that some wrong (mainly BGP) routing announcement were exported into
 Internet and all routers of one type (often Quaggas) has crashed. BIRD
 is used on some very important places like world largest IXPs as a
 route server (RS). Each IXP usually has at least two different RSs (if
 possible to avoid implementation dependent bugs. The second RS is
 usually OpenBGPd of Quagga. So generally speaking, I do not want to
 use any code from those routing daemons.

  Sorry, but this all feels like NIH syndrome.

 Again, BIRD is used on some very important places and therefore we are
 very conservative in accepting new patches. But I don't think we have
 NIH syndrome. We have been accepting foreign patches since beginning
 and Ondrej Zajicek is a good example. :-) He had sent me some new
 patches and later we started a regular cooperation and he became a key
 developer.

So basically you are saying that outsiders like my self aren't welcome
because BIRD is so important to some IXPs that you don't want to
take any chanches? I had hoped that the possible changes I would need
to do could be fed back into BIRD so I didn't have to maintain
them myslef forever.

 Jocke



Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Ondrej Filip
 Again, BIRD is used on some very important places and therefore we are
 very conservative in accepting new patches. But I don't think we have
 NIH syndrome. We have been accepting foreign patches since beginning
 and Ondrej Zajicek is a good example. :-) He had sent me some new
 patches and later we started a regular cooperation and he became a key
 developer.
 
 So basically you are saying that outsiders like my self aren't welcome
 because BIRD is so important to some IXPs that you don't want to
 take any chanches? I had hoped that the possible changes I would need
 to do could be fed back into BIRD so I didn't have to maintain
 them myslef forever.

Dear Jocke, no, I didn't say this.

And again, all ideas are welcome. Thank you for yours.

Ondrej


 
  Jocke



Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Martin Mares
Hello!

 So basically you are saying that outsiders like my self aren't welcome
 because BIRD is so important to some IXPs that you don't want to
 take any chanches?

Certainly not. However, it means that the criteria for accepting patches
are somewhat stricter than in many other projects.

All ideas are of course welcome (if this discussion leads to skipping
all byte order conversions in OSPF on big-endian machines, I will be
happy), but please be prepared that being an obvious improvement is
often a highly subjective trait, so the other developers will sometimes
want to see a proof that the patch does indeed produce better results
than the status quo.

In my opinion, one of the key qualities of a good engineer is to prefer
exact observation and measurement of reality over personal feelings.

Have a nice fortnight
-- 
Martin `MJ' Mares  m...@ucw.cz   http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
System going down at 5 pm to install scheduler bug.


Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Ondrej Zajicek
On Fri, Apr 23, 2010 at 07:40:28PM +0200, Joakim Tjernlund wrote:
 So there isn't really difference in performance of both
 implementations. Even on slow embedded AMD Geode CPU, it gives
 ~ 180 MB/s.
  
   No difference? what does 1.2 mean? to me this means 20% which is a lot
 
  Yes, but according to Santiago's benchmarks, your code is sometimes 20%
  faster, sometimes 20% slower. It does not seem like a reason for change.
 
 uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, 
 are
 you sure that is not a typo?

Yes i am sure, i checked it several times.

-- 
Elen sila lumenn' omentielvo

Ondrej 'SanTiago' Zajicek (email: santi...@crfreenet.org)
OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net)
To err is human -- to blame it on a computer is even more so.


signature.asc
Description: Digital signature


Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Ondrej Zajicek
On Fri, Apr 23, 2010 at 07:40:28PM +0200, Joakim Tjernlund wrote:
 Martin Mares m...@ucw.cz wrote on 2010/04/23 19:23:18:
 
  Hello!
 
 So there isn't really difference in performance of both
 implementations. Even on slow embedded AMD Geode CPU, it gives
 ~ 180 MB/s.
  
   No difference? what does 1.2 mean? to me this means 20% which is a lot
 
  Yes, but according to Santiago's benchmarks, your code is sometimes 20%
  faster, sometimes 20% slower. It does not seem like a reason for change.
 
 uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, 
 are
 you sure that is not a typo?

FYI, code z = sum + x, z + (z  sum) was compiled to:

addu$2,$3,$2
sltu$3,$2,$3
addu$3,$2,$3

Therefore, doing half number of iterations outweights in that case.

BTW, it was compiled by GCC 3.4.6

-- 
Elen sila lumenn' omentielvo

Ondrej 'SanTiago' Zajicek (email: santi...@crfreenet.org)
OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net)
To err is human -- to blame it on a computer is even more so.


signature.asc
Description: Digital signature


Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Joakim Tjernlund
Ondrej Zajicek santi...@crfreenet.org wrote on 2010/04/23 21:39:06:

 On Fri, Apr 23, 2010 at 07:40:28PM +0200, Joakim Tjernlund wrote:
  Martin Mares m...@ucw.cz wrote on 2010/04/23 19:23:18:
  
   Hello!
  
  So there isn't really difference in performance of both
  implementations. Even on slow embedded AMD Geode CPU, it gives
  ~ 180 MB/s.
   
No difference? what does 1.2 mean? to me this means 20% which is a lot
  
   Yes, but according to Santiago's benchmarks, your code is sometimes 20%
   faster, sometimes 20% slower. It does not seem like a reason for change.
 
  uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, 
  are
  you sure that is not a typo?

 FYI, code z = sum + x, z + (z  sum) was compiled to:

 addu$2,$3,$2
 sltu$3,$2,$3
 addu$3,$2,$3

OK, MIPS has always been a strange platform to me.
So I had to test myself again:
x84 Core 2 duo, 3.1 MHz:
New code:
   64 byte buffer:   5899 +/-2.3%
  128 byte buffer:   5570 +/-3.1%
  256 byte buffer:   5797 +/-0.3%
  512 byte buffer:   5501 +/-1.1%
 1024 byte buffer:   5357 +/-1.5%
 2048 byte buffer:   5277 +/-0.6%
 4096 byte buffer:   5249 +/-1.2%
 8192 byte buffer:   5245 +/-2.1%
16384 byte buffer:   5221 +/-1.6%

Old code:
   64 byte buffer:   7237 +/-0.4%
  128 byte buffer:   6505 +/-1.7%
  256 byte buffer:   6075 +/-1.6%
  512 byte buffer:   6120 +/-1.6%
 1024 byte buffer:   5773 +/-8.2%
 2048 byte buffer:   5790 +/-2.0%
 4096 byte buffer:   5474 +/-0.7%
 8192 byte buffer:   5679 +/-47.1%
16384 byte buffer:   5339 +/-1.3%

PowerPC MPC 8321, 266 Mhz

New Code:
   64 byte buffer:  68349 +/-8.0%
  128 byte buffer:  58271 +/-8.7%
  256 byte buffer:  52945 +/-8.4%
  512 byte buffer:  50535 +/-8.6%
 1024 byte buffer:  49288 +/-9.6%
 2048 byte buffer:  48984 +/-10.3%
 4096 byte buffer:  48345 +/-8.6%
 8192 byte buffer:  48127 +/-8.4%

Old Code:
   64 byte buffer:  68349 +/-8.0%
  128 byte buffer:  58271 +/-8.7%
  256 byte buffer:  52945 +/-8.4%
  512 byte buffer:  50535 +/-8.6%
 1024 byte buffer:  49288 +/-9.6%
 2048 byte buffer:  48984 +/-10.3%
 4096 byte buffer:  48345 +/-8.6%
 8192 byte buffer:  48127 +/-8.4%

Just for fun, replace add32 with
static inline
unsigned long
add32(unsigned long sum, unsigned long x)
{
asm (addc %0, %0, %1: =r(sum) : r (x));
return sum;
}
MPC 8321 with asm addc:
   64 byte buffer:  52007 +/-8.7%
  128 byte buffer:  41986 +/-9.9%
  256 byte buffer:  37160 +/-11.4%
  512 byte buffer:  34593 +/-10.3%
 1024 byte buffer:  33265 +/-10.4%
 2048 byte buffer:  32648 +/-11.4%
 4096 byte buffer:  32843 +/-14.1%
 8192 byte buffer:  32223 +/-12.5%


So the new code is better on both platforms and the asm addc on ppc is
very fast.

Test prog attached.

 Jocke

(See attached file: crc32test.c)

crc32test.c
Description: Binary data


Re: [PATCH] ipsum_calc_block: Optimize size and speed

2010-04-23 Thread Joakim Tjernlund

Martin Mares m...@ucw.cz wrote on 2010/04/23 23:02:29:

 Hello!

  How did you come to the conclusion the the current code was better than
  the previous version? Seems like hand waving to me.

 Did I claim anywhere that the old code is better?  I only pointed out

You did when you commited it and I asked twice how you came to that conlusion.
Pretty much all recent tests shows otherwise.

 the lack of arguments about the new code being better, which is a reason
 to stay with the old, tested code.

  I told told you I had benched the add carry in C before and it wasn't
  any better(worse actually).

 Actually, back in the ages when I wrote the old checksum function, I have
 checked that it performs better than a trivial implementation, and now you
 claim otherwise, so I naturally want to see new data which show that modern
 hardware behaves differently.

  Santiago benched it too and it was better
  or just as good as before. Only the MIPS had a regression.

 If I recall his results correctly, he has performed three tests:

 In the 1st one, your code was 20% faster.
 In the 2nd one, it was of the same speed.
 In the 3rd one, it was 20% slower.

 Maybe I wear different glasses from yours, but I clearly see that on average,
 there is no improvement.

You have ignored my tests which also show an improvement.


  So what now? what more proof do you need?

 First of all, I want at least a rudimentary proof that IT MATTERS AT ALL.
 We are spending lots of time talking about a minor (20%) speedup in a small
 chunk of code, without having any clue about how often it gets called and what
 fraction of the total time is really spent there.

20% is a lot and I not going to bench it any further. if you are
not happy with the results so far, nothing I can do will change that.