On Fri, 2007-02-16 at 18:43 +0200, Peter wrote:
> On Fri, 16 Feb 2007, Gilboa Davara wrote:
> 
> > On Thu, 2007-02-15 at 19:23 +0200, Peter wrote:
> >> On Thu, 15 Feb 2007, Gilboa Davara wrote:
> >>
> >>> Small example.
> >>> About two years ago I go bored, and decided to implement binary trees in
> >>> (x86) Assembly.
> >>> The end result was between 2-10 times faster then GCC (-O2/-O3)
> >>> generated code. (Depending the size of the tree)
> >>> The main reason being the lack of a 3 way comparison in C.
> >>> (above/below/equal)
> >>
> >> And assembly lacks it too.
> >
> > ????????!!!?
> >
> > cmp $eax,$ebx
> > jb label_below
> > ja label_above
> > <equal code>
> 
> Each jump is equivalent with a cache line flush.

(Before I begin, my code targets x86_64 [AMD Opteron, Xeon 5xxx] and
i386  [P4 Xeon] - nothing else.)
- I'm talking about a short (+127/-128) jumps.
- As far as I remember:
        AMD Opteron's L1I cache line size is 64b.
        P4/Xeon is 128b.
        Core2 is 64b.
- Now, both the AMD and Core2 use aggressive pre-fetching that will
usually result in multiple adjacent instruction cache lines.

In short, It is very likely that as long as you keep your in-line
assembly code -small-, you will most fit your code nicely inside the L1
I cache.

> 
> >> But in C you can get creative with compound
> >> statements:
> >> int x,y;
> >> register int t;
> >>
> >> (t = x - y) && (((t < 0) && below()) || above()) || equal();
> >
> > .. Which will only work if the below/above/equal are made of short
> > statements which is a very problematic pre-requisite.
> 
> inline int below(your,optional,arguments);
> 
> will work fine. So will:
> 
> #define below(a,b,c) (z=a+b+c)

Been there, done that.
As I said, under both Windows and Linux the asm code yielded (much)
better performance.

> 
> > In my case I needed to store some additional information in each leaf -
> > making each step a compound statement by itself. (which in-turn,
> > rendered your compound less effective)
> 
> Don't be so sure about that. A compound statement can be optimized very 
> well.

.. Which will make is as readable as the asm code - or far worse...

> 
> >> which wastes 1 register variable. Still, there is no guarantee that this
> >> generates faster code than an optimizing compiler (and gcc is not known
> >> among the best optimizing compilers). Rewriting above using binary
> >> operators and masks may be even faster.
> >
> > The same code was also tested under Visual Studio 2K3 and showed the 
> > same results. The assembly code was considerably faster then the VS 
> > generate binary.
> 
> Assembly is not portable and it is a *** to debug.

No argument there.
(Though if you make your compound code complex enough, it'll make the
asm code far more debug-able)

> Yes, you can make it run faster. It's fun for the 1st few days, after that 
> you need to change 
> something or port it to a NSLU2 and things stop being nice very fast. 
> Especially if someone else needs to compile your code.


As I said above, I usually use -small- (<20 lines) blocks of in-line
assembly code.
Other then that, I'm fanatical about documentation. (Mostly because I
have a very small brain and it takes me 5 minutes to forget why I
trashed rax)

> 
> >> Atomic code execution should not require assembly because segment
> >> locking can be done using C (even if that C is inline assembly for
> >> some applications).
> >
> > A. I -was- talking about in-line assembly.
> > B. How can I implement "lock btX/inc/dec/sub/add" in pure C?
> > (Let alone using the resulting flags. [setXX])
> >
> > BTW, another valid excuse to using assembly (at least in
> > register-barren-world-known-as-i386) is the ability to trash the base
> > pointer. (every register count.)
> 
> Again, why are you assuming x86 assembly is the target ? It could be ARM 
> or MIPS or PPC. 

If I'm writing multi-platform code, I'll keep in-line assembly to the
minimum. (or none)
Contrary to popular belief, I'm not that mad ;)

> Optimizing x86 makes sense for extreme driver writing, 
> kernel code and such.

But it pays my mortgage ;)

>  Otherwise it makes little sense on a platform that 
> doubles its MIPS speed every 2 years. lock exists only on x86 and it 
> exists because x86 is a brainf***d architecture that allows 'long 
> instructions' (once upon a time known as microcode) to be interrupted in 
> the middle. I assure you that this is a very unique feature among CPUs. 

True.
But my target is i386/x86_64. (With the rare SPARC/POWER from time to
time)

> Think about it, it's the only popular CPU that can be proud of being 
> theoretically able to throw an EINTR *inside* a machine code 
> instruction. Modifying BP + small mistake = crash. Oops.

Naaah.
Stack frame? who needs it ;)

Seriously, the IA32 is brain-dead - no arguments there.
But this brain-dead architecture managed to capture most of the computer
market - and unlike Windows, it does have technical merits. (E.g. IA64
vs. IA32).
My favorite architecture was the Digital Alpha, but it's water under the
bridge now...

> 
> Peter

- Gilboa


=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to