On Fri, 2007-02-16 at 18:43 +0200, Peter wrote: > On Fri, 16 Feb 2007, Gilboa Davara wrote: > > > On Thu, 2007-02-15 at 19:23 +0200, Peter wrote: > >> On Thu, 15 Feb 2007, Gilboa Davara wrote: > >> > >>> Small example. > >>> About two years ago I go bored, and decided to implement binary trees in > >>> (x86) Assembly. > >>> The end result was between 2-10 times faster then GCC (-O2/-O3) > >>> generated code. (Depending the size of the tree) > >>> The main reason being the lack of a 3 way comparison in C. > >>> (above/below/equal) > >> > >> And assembly lacks it too. > > > > ????????!!!? > > > > cmp $eax,$ebx > > jb label_below > > ja label_above > > <equal code> > > Each jump is equivalent with a cache line flush.
(Before I begin, my code targets x86_64 [AMD Opteron, Xeon 5xxx] and i386 [P4 Xeon] - nothing else.) - I'm talking about a short (+127/-128) jumps. - As far as I remember: AMD Opteron's L1I cache line size is 64b. P4/Xeon is 128b. Core2 is 64b. - Now, both the AMD and Core2 use aggressive pre-fetching that will usually result in multiple adjacent instruction cache lines. In short, It is very likely that as long as you keep your in-line assembly code -small-, you will most fit your code nicely inside the L1 I cache. > > >> But in C you can get creative with compound > >> statements: > >> int x,y; > >> register int t; > >> > >> (t = x - y) && (((t < 0) && below()) || above()) || equal(); > > > > .. Which will only work if the below/above/equal are made of short > > statements which is a very problematic pre-requisite. > > inline int below(your,optional,arguments); > > will work fine. So will: > > #define below(a,b,c) (z=a+b+c) Been there, done that. As I said, under both Windows and Linux the asm code yielded (much) better performance. > > > In my case I needed to store some additional information in each leaf - > > making each step a compound statement by itself. (which in-turn, > > rendered your compound less effective) > > Don't be so sure about that. A compound statement can be optimized very > well. .. Which will make is as readable as the asm code - or far worse... > > >> which wastes 1 register variable. Still, there is no guarantee that this > >> generates faster code than an optimizing compiler (and gcc is not known > >> among the best optimizing compilers). Rewriting above using binary > >> operators and masks may be even faster. > > > > The same code was also tested under Visual Studio 2K3 and showed the > > same results. The assembly code was considerably faster then the VS > > generate binary. > > Assembly is not portable and it is a *** to debug. No argument there. (Though if you make your compound code complex enough, it'll make the asm code far more debug-able) > Yes, you can make it run faster. It's fun for the 1st few days, after that > you need to change > something or port it to a NSLU2 and things stop being nice very fast. > Especially if someone else needs to compile your code. As I said above, I usually use -small- (<20 lines) blocks of in-line assembly code. Other then that, I'm fanatical about documentation. (Mostly because I have a very small brain and it takes me 5 minutes to forget why I trashed rax) > > >> Atomic code execution should not require assembly because segment > >> locking can be done using C (even if that C is inline assembly for > >> some applications). > > > > A. I -was- talking about in-line assembly. > > B. How can I implement "lock btX/inc/dec/sub/add" in pure C? > > (Let alone using the resulting flags. [setXX]) > > > > BTW, another valid excuse to using assembly (at least in > > register-barren-world-known-as-i386) is the ability to trash the base > > pointer. (every register count.) > > Again, why are you assuming x86 assembly is the target ? It could be ARM > or MIPS or PPC. If I'm writing multi-platform code, I'll keep in-line assembly to the minimum. (or none) Contrary to popular belief, I'm not that mad ;) > Optimizing x86 makes sense for extreme driver writing, > kernel code and such. But it pays my mortgage ;) > Otherwise it makes little sense on a platform that > doubles its MIPS speed every 2 years. lock exists only on x86 and it > exists because x86 is a brainf***d architecture that allows 'long > instructions' (once upon a time known as microcode) to be interrupted in > the middle. I assure you that this is a very unique feature among CPUs. True. But my target is i386/x86_64. (With the rare SPARC/POWER from time to time) > Think about it, it's the only popular CPU that can be proud of being > theoretically able to throw an EINTR *inside* a machine code > instruction. Modifying BP + small mistake = crash. Oops. Naaah. Stack frame? who needs it ;) Seriously, the IA32 is brain-dead - no arguments there. But this brain-dead architecture managed to capture most of the computer market - and unlike Windows, it does have technical merits. (E.g. IA64 vs. IA32). My favorite architecture was the Digital Alpha, but it's water under the bridge now... > > Peter - Gilboa ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]