[PATCH] ipsum_calc_block: Optimize size and speed
This is a much simpler and efficent impl. of the IP checksum. It is a dry port from Quagga and I have not tested it. --- lib/checksum.c | 44 +++- 1 files changed, 11 insertions(+), 33 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 33cb386..b836bdb 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -15,25 +15,10 @@ #include nest/bird.h #include checksum.h -static u16 /* One-complement addition */ -add16(u16 sum, u16 x) -{ - u16 z = sum + x; - return z + (z sum); -} - -static u32 -add32(u32 sum, u32 x) -{ - u32 z = sum + x; - return z + (z sum); -} - static u16 ipsum_calc_block(u16 *x, unsigned len, u16 sum) { - int rest; - u32 tmp, *xx; + u32 tmp; /* * A few simple facts about the IP checksum (see RFC 1071 for detailed @@ -51,23 +36,16 @@ ipsum_calc_block(u16 *x, unsigned len, u16 sum) if (!len) return sum; len = 1; - if ((unsigned long) x 2) /* Align to 32-bit boundary */ -{ - sum = add16(sum, *x++); - len--; -} - rest = len 1; - len = 1; - tmp = 0; - xx = (u32 *) x; - while (len) -{ - tmp = add32(tmp, *xx++); - len--; -} - sum = add16(sum, add16(tmp 0x, tmp 16U)); - if (rest) -sum = add16(sum, *(u16 *) xx); + tmp = sum; + for(x--; len; --len) + tmp += *++x; + /* + * Add back carry outs from top 16 bits to low 16 bits. + */ + tmp = (tmp 16) + (tmp 0x);/* add high-16 to low-16 */ + tmp += (tmp 16); /* add carry */ + sum = ~tmp; /* ones-complement, then truncate to 16 bits */ + return sum; } -- 1.6.4.4
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Hello! This is a much simpler and efficent impl. of the IP checksum. It is a dry port from Quagga and I have not tested it. Are you convinced that your version is more efficient? The original version processes 32 bits at a time, while your code does only 16 bits at a time. It might be worth the saved branch, but do you have any benchmark proving that it helps? Have a nice fortnight -- Martin `MJ' Mares m...@ucw.cz http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth Hex dump: Where witches put used curses...
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Martin Mares m...@ucw.cz wrote on 2010/04/23 15:03:34: Hello! This is a much simpler and efficent impl. of the IP checksum. It is a dry port from Quagga and I have not tested it. Are you convinced that your version is more efficient? The original version processes 32 bits at a time, while your code does only 16 bits at a time. It might be worth the saved branch, but do you have any benchmark proving that it helps? Fairly, I once had the same idea for Quagga but found all those extra tests and additions were much slower(I benched it). Just look at the number of extra ops one has to do in the current code. If you want to do it faster you have to go ASM. It would be easy to add support for that too but it can wait. Jocke
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Hello! Fairly, I once had the same idea for Quagga but found all those extra tests and additions were much slower(I benched it). Just look at the number of extra ops one has to do in the current code. If you want to do it faster you have to go ASM. It would be easy to add support for that too but it can wait. My primary reaction was If something isn't broken, don't fix it. I.e., unless you have good reasons for rewriting a piece of code, don't do that. Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower. (Anyway, it might be interesting to run OProfile on Bird to see which parts of the code are real hot spots and where we should focus more on maintainibility than on speed.) Have a nice fortnight -- Martin `MJ' Mares m...@ucw.cz http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth To understand a program you must become both the machine and the program.
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Hello! But you can't get rid of: z + (z sum) which is the real bottleneck. Perhaps this doesn't cost much on high end CPUs but it sure does on embedded CPUs Why should it be? It can be compiled as a sequence of add with carry instructions, can't it? Have a nice fortnight -- Martin `MJ' Mares m...@ucw.cz http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth A student who changes the course of history is probably taking an exam.
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Hello! But you can't get rid of: z + (z sum) which is the real bottleneck. Perhaps this doesn't cost much on high end CPUs but it sure does on embedded CPUs Why should it be? It can be compiled as a sequence of add with carry instructions, can't it? Yes, but have you seen gcc do that? I havn't, perhaps gcc has become smarter recently? Jocke
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Joakim Tjernlund/Transmode wrote on 2010/04/23 16:14:58: Hello! But you can't get rid of: z + (z sum) which is the real bottleneck. Perhaps this doesn't cost much on high end CPUs but it sure does on embedded CPUs Why should it be? It can be compiled as a sequence of add with carry instructions, can't it? Yes, but have you seen gcc do that? I havn't, perhaps gcc has become smarter recently? Jocke Just tried this and it didn't with gcc 3.4.3 on PowerPC Some arch does not have an add with carry insn(MIPS?) Jocke unsigned long add32(unsigned long sum, unsigned long x) { unsigned long z = sum + x; return z + (z sum); } /* gcc -O3 -S gives: .file addc.c .gnu_attribute 4, 2 .gnu_attribute 8, 1 .section.text .align 2 .globl add32 .type add32, @function add32: add 4,4,3 subfc 3,3,4 subfe 3,3,3 subf 3,3,4 blr .size add32, .-add32 .ident GCC: (Gentoo 4.3.4 p1.1, pie-10.1.5) 4.3.4 .section.note.GNU-stack,,@progbits */
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Hello! Just tried this and it didn't with gcc 3.4.3 on PowerPC It would be better to let gcc unroll the loop (if it is critical for performance, it should be unrolled anyway) and use a newer version of gcc. Some arch does not have an add with carry insn(MIPS?) Well, first of all, we should really get a clue about how much time is spent in the checksum function. Have a nice fortnight -- Martin `MJ' Mares m...@ucw.cz http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth I think it's a new feature. Don't tell anyone it was an accident. :-) -- Larry Wall
Re: [PATCH] ipsum_calc_block: Optimize size and speed
On 23.4.2010 18:32, Ondrej Zajicek wrote: On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote: Hello! Fairly, I once had the same idea for Quagga but found all those extra tests and additions were much slower(I benched it). Just look at the number of extra ops one has to do in the current code. If you want to do it faster you have to go ASM. It would be easy to add support for that too but it can wait. My primary reaction was If something isn't broken, don't fix it. I.e., unless you have good reasons for rewriting a piece of code, don't do that. Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower. I was curious enough to do some benchmarks and got these results: Intel Atom: suggested code ~ 1.2* faster AMD Geode:no diference MIPS ADM5120: old code ~ 1.2* faster So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s. Hmm, so there is no major reason for change. I would really support dont-fix-whats-not-broken approach. And also we should keep different code from Quagga to have heterogeneity. Ondrej
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Ondrej Filip fe...@network.cz wrote on 2010/04/23 18:41:57: On 23.4.2010 18:32, Ondrej Zajicek wrote: On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote: Hello! Fairly, I once had the same idea for Quagga but found all those extra tests and additions were much slower(I benched it). Just look at the number of extra ops one has to do in the current code. If you want to do it faster you have to go ASM. It would be easy to add support for that too but it can wait. My primary reaction was If something isn't broken, don't fix it. I.e., unless you have good reasons for rewriting a piece of code, don't do that. Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower. I was curious enough to do some benchmarks and got these results: Intel Atom: suggested code ~ 1.2* faster AMD Geode: no diference MIPS ADM5120: old code ~ 1.2* faster So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s. No difference? what does 1.2 mean? to me this means 20% which is a lot Hmm, so there is no major reason for change. I would really support dont-fix-whats-not-broken approach. And also we should keep different So it is smaller, faster and easier to read and you still want to keep the old code? Very non Open source like I must say. code from Quagga to have heterogeneity. What kind of argument is that? Sorry, but this all feels like NIH syndrome. Ondrej
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Hello! So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s. No difference? what does 1.2 mean? to me this means 20% which is a lot Yes, but according to Santiago's benchmarks, your code is sometimes 20% faster, sometimes 20% slower. It does not seem like a reason for change. If you have any benchmark showing that BIRD spends a substantial amount of time in this function, let's optimize it properly, even if it means having several versions for different CPU's. Otherwise, let us stick with the current code and concentrate our effort on places which matter. The difference in code size and in readability is really tiny. Have a nice fortnight -- Martin `MJ' Mares m...@ucw.cz http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth Why is abbreviation such a long word?
Re: [PATCH] ipsum_calc_block: Optimize size and speed
On 23.4.2010 19:01, Joakim Tjernlund wrote: Ondrej Filip fe...@network.cz wrote on 2010/04/23 18:41:57: On 23.4.2010 18:32, Ondrej Zajicek wrote: On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote: Hello! My primary reaction was If something isn't broken, don't fix it. I.e., unless you have good reasons for rewriting a piece of code, don't do that. Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower. I was curious enough to do some benchmarks and got these results: Intel Atom: suggested code ~ 1.2* faster AMD Geode: no diference MIPS ADM5120: old code ~ 1.2* faster So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s. No difference? what does 1.2 mean? to me this means 20% which is a lot And the code is 20% slower on MIPS. So I do not see the point. Anyway, 20% in a not often used operation does not have to be even visible. Hmm, so there is no major reason for change. I would really support dont-fix-whats-not-broken approach. And also we should keep different So it is smaller, faster and easier to read and you still want to keep the old code? Very non Open source like I must say. It was not proven that this code is faster. code from Quagga to have heterogeneity. What kind of argument is that? Well, let me explain you a few things. It had happened many times, that some wrong (mainly BGP) routing announcement were exported into Internet and all routers of one type (often Quaggas) has crashed. BIRD is used on some very important places like world largest IXPs as a route server (RS). Each IXP usually has at least two different RSs (if possible to avoid implementation dependent bugs. The second RS is usually OpenBGPd of Quagga. So generally speaking, I do not want to use any code from those routing daemons. Sorry, but this all feels like NIH syndrome. Again, BIRD is used on some very important places and therefore we are very conservative in accepting new patches. But I don't think we have NIH syndrome. We have been accepting foreign patches since beginning and Ondrej Zajicek is a good example. :-) He had sent me some new patches and later we started a regular cooperation and he became a key developer. Ondrej Ondrej
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Martin Mares m...@ucw.cz wrote on 2010/04/23 19:23:18: Hello! So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s. No difference? what does 1.2 mean? to me this means 20% which is a lot Yes, but according to Santiago's benchmarks, your code is sometimes 20% faster, sometimes 20% slower. It does not seem like a reason for change. uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, are you sure that is not a typo? If you have any benchmark showing that BIRD spends a substantial amount of time in this function, let's optimize it properly, even if it means having several versions for different CPU's. Otherwise, let us stick with the current code and concentrate our effort on places which matter. Peformance matter, especially when the network grows. Is this the way BIRD development works? Only work on stuff that currently feels important is acceptet? Jocke
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Hello! Peformance matter, especially when the network grows. Is this the way BIRD development works? Only work on stuff that currently feels important is acceptet? Yes, performance matters. This is why performance optimizations in BIRD have to be justified by real numbers, not by hand-waving. Have a nice fortnight -- Martin `MJ' Mares m...@ucw.cz http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth God is real, unless declared integer.
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Ondrej Filip fe...@network.cz wrote on 2010/04/23 19:28:27: On 23.4.2010 19:01, Joakim Tjernlund wrote: Ondrej Filip fe...@network.cz wrote on 2010/04/23 18:41:57: On 23.4.2010 18:32, Ondrej Zajicek wrote: On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote: Hello! My primary reaction was If something isn't broken, don't fix it. I.e., unless you have good reasons for rewriting a piece of code, don't do that. Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower. I was curious enough to do some benchmarks and got these results: Intel Atom: suggested code ~ 1.2* faster AMD Geode: no diference MIPS ADM5120: old code ~ 1.2* faster So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s. No difference? what does 1.2 mean? to me this means 20% which is a lot And the code is 20% slower on MIPS. So I do not see the point. Anyway, 20% in a not often used operation does not have to be even visible. Hmm, so there is no major reason for change. I would really support dont-fix-whats-not-broken approach. And also we should keep different So it is smaller, faster and easier to read and you still want to keep the old code? Very non Open source like I must say. It was not proven that this code is faster. code from Quagga to have heterogeneity. What kind of argument is that? Well, let me explain you a few things. It had happened many times, that some wrong (mainly BGP) routing announcement were exported into Internet and all routers of one type (often Quaggas) has crashed. BIRD is used on some very important places like world largest IXPs as a route server (RS). Each IXP usually has at least two different RSs (if possible to avoid implementation dependent bugs. The second RS is usually OpenBGPd of Quagga. So generally speaking, I do not want to use any code from those routing daemons. Sorry, but this all feels like NIH syndrome. Again, BIRD is used on some very important places and therefore we are very conservative in accepting new patches. But I don't think we have NIH syndrome. We have been accepting foreign patches since beginning and Ondrej Zajicek is a good example. :-) He had sent me some new patches and later we started a regular cooperation and he became a key developer. So basically you are saying that outsiders like my self aren't welcome because BIRD is so important to some IXPs that you don't want to take any chanches? I had hoped that the possible changes I would need to do could be fed back into BIRD so I didn't have to maintain them myslef forever. Jocke
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Again, BIRD is used on some very important places and therefore we are very conservative in accepting new patches. But I don't think we have NIH syndrome. We have been accepting foreign patches since beginning and Ondrej Zajicek is a good example. :-) He had sent me some new patches and later we started a regular cooperation and he became a key developer. So basically you are saying that outsiders like my self aren't welcome because BIRD is so important to some IXPs that you don't want to take any chanches? I had hoped that the possible changes I would need to do could be fed back into BIRD so I didn't have to maintain them myslef forever. Dear Jocke, no, I didn't say this. And again, all ideas are welcome. Thank you for yours. Ondrej Jocke
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Hello! So basically you are saying that outsiders like my self aren't welcome because BIRD is so important to some IXPs that you don't want to take any chanches? Certainly not. However, it means that the criteria for accepting patches are somewhat stricter than in many other projects. All ideas are of course welcome (if this discussion leads to skipping all byte order conversions in OSPF on big-endian machines, I will be happy), but please be prepared that being an obvious improvement is often a highly subjective trait, so the other developers will sometimes want to see a proof that the patch does indeed produce better results than the status quo. In my opinion, one of the key qualities of a good engineer is to prefer exact observation and measurement of reality over personal feelings. Have a nice fortnight -- Martin `MJ' Mares m...@ucw.cz http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth System going down at 5 pm to install scheduler bug.
Re: [PATCH] ipsum_calc_block: Optimize size and speed
On Fri, Apr 23, 2010 at 07:40:28PM +0200, Joakim Tjernlund wrote: So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s. No difference? what does 1.2 mean? to me this means 20% which is a lot Yes, but according to Santiago's benchmarks, your code is sometimes 20% faster, sometimes 20% slower. It does not seem like a reason for change. uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, are you sure that is not a typo? Yes i am sure, i checked it several times. -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santi...@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) To err is human -- to blame it on a computer is even more so. signature.asc Description: Digital signature
Re: [PATCH] ipsum_calc_block: Optimize size and speed
On Fri, Apr 23, 2010 at 07:40:28PM +0200, Joakim Tjernlund wrote: Martin Mares m...@ucw.cz wrote on 2010/04/23 19:23:18: Hello! So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s. No difference? what does 1.2 mean? to me this means 20% which is a lot Yes, but according to Santiago's benchmarks, your code is sometimes 20% faster, sometimes 20% slower. It does not seem like a reason for change. uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, are you sure that is not a typo? FYI, code z = sum + x, z + (z sum) was compiled to: addu$2,$3,$2 sltu$3,$2,$3 addu$3,$2,$3 Therefore, doing half number of iterations outweights in that case. BTW, it was compiled by GCC 3.4.6 -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santi...@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) To err is human -- to blame it on a computer is even more so. signature.asc Description: Digital signature
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Ondrej Zajicek santi...@crfreenet.org wrote on 2010/04/23 21:39:06: On Fri, Apr 23, 2010 at 07:40:28PM +0200, Joakim Tjernlund wrote: Martin Mares m...@ucw.cz wrote on 2010/04/23 19:23:18: Hello! So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s. No difference? what does 1.2 mean? to me this means 20% which is a lot Yes, but according to Santiago's benchmarks, your code is sometimes 20% faster, sometimes 20% slower. It does not seem like a reason for change. uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, are you sure that is not a typo? FYI, code z = sum + x, z + (z sum) was compiled to: addu$2,$3,$2 sltu$3,$2,$3 addu$3,$2,$3 OK, MIPS has always been a strange platform to me. So I had to test myself again: x84 Core 2 duo, 3.1 MHz: New code: 64 byte buffer: 5899 +/-2.3% 128 byte buffer: 5570 +/-3.1% 256 byte buffer: 5797 +/-0.3% 512 byte buffer: 5501 +/-1.1% 1024 byte buffer: 5357 +/-1.5% 2048 byte buffer: 5277 +/-0.6% 4096 byte buffer: 5249 +/-1.2% 8192 byte buffer: 5245 +/-2.1% 16384 byte buffer: 5221 +/-1.6% Old code: 64 byte buffer: 7237 +/-0.4% 128 byte buffer: 6505 +/-1.7% 256 byte buffer: 6075 +/-1.6% 512 byte buffer: 6120 +/-1.6% 1024 byte buffer: 5773 +/-8.2% 2048 byte buffer: 5790 +/-2.0% 4096 byte buffer: 5474 +/-0.7% 8192 byte buffer: 5679 +/-47.1% 16384 byte buffer: 5339 +/-1.3% PowerPC MPC 8321, 266 Mhz New Code: 64 byte buffer: 68349 +/-8.0% 128 byte buffer: 58271 +/-8.7% 256 byte buffer: 52945 +/-8.4% 512 byte buffer: 50535 +/-8.6% 1024 byte buffer: 49288 +/-9.6% 2048 byte buffer: 48984 +/-10.3% 4096 byte buffer: 48345 +/-8.6% 8192 byte buffer: 48127 +/-8.4% Old Code: 64 byte buffer: 68349 +/-8.0% 128 byte buffer: 58271 +/-8.7% 256 byte buffer: 52945 +/-8.4% 512 byte buffer: 50535 +/-8.6% 1024 byte buffer: 49288 +/-9.6% 2048 byte buffer: 48984 +/-10.3% 4096 byte buffer: 48345 +/-8.6% 8192 byte buffer: 48127 +/-8.4% Just for fun, replace add32 with static inline unsigned long add32(unsigned long sum, unsigned long x) { asm (addc %0, %0, %1: =r(sum) : r (x)); return sum; } MPC 8321 with asm addc: 64 byte buffer: 52007 +/-8.7% 128 byte buffer: 41986 +/-9.9% 256 byte buffer: 37160 +/-11.4% 512 byte buffer: 34593 +/-10.3% 1024 byte buffer: 33265 +/-10.4% 2048 byte buffer: 32648 +/-11.4% 4096 byte buffer: 32843 +/-14.1% 8192 byte buffer: 32223 +/-12.5% So the new code is better on both platforms and the asm addc on ppc is very fast. Test prog attached. Jocke (See attached file: crc32test.c) crc32test.c Description: Binary data
Re: [PATCH] ipsum_calc_block: Optimize size and speed
Martin Mares m...@ucw.cz wrote on 2010/04/23 23:02:29: Hello! How did you come to the conclusion the the current code was better than the previous version? Seems like hand waving to me. Did I claim anywhere that the old code is better? I only pointed out You did when you commited it and I asked twice how you came to that conlusion. Pretty much all recent tests shows otherwise. the lack of arguments about the new code being better, which is a reason to stay with the old, tested code. I told told you I had benched the add carry in C before and it wasn't any better(worse actually). Actually, back in the ages when I wrote the old checksum function, I have checked that it performs better than a trivial implementation, and now you claim otherwise, so I naturally want to see new data which show that modern hardware behaves differently. Santiago benched it too and it was better or just as good as before. Only the MIPS had a regression. If I recall his results correctly, he has performed three tests: In the 1st one, your code was 20% faster. In the 2nd one, it was of the same speed. In the 3rd one, it was 20% slower. Maybe I wear different glasses from yours, but I clearly see that on average, there is no improvement. You have ignored my tests which also show an improvement. So what now? what more proof do you need? First of all, I want at least a rudimentary proof that IT MATTERS AT ALL. We are spending lots of time talking about a minor (20%) speedup in a small chunk of code, without having any clue about how often it gets called and what fraction of the total time is really spent there. 20% is a lot and I not going to bench it any further. if you are not happy with the results so far, nothing I can do will change that.