On 10/31/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:

On Oct 23, 2006, at 6:44 PM, David Balmain wrote:
>> The cost of a double-dereference to dispatch a method is
>> insignificant.  However, the syntax is godawful...
>>
>>       len = instream->m->length_i(instream);
>>
>> ... which is why Dave has macro-fied it:
>>
>>       #define is_length(mis) mis->m->length_i(mis)
>>
>> How about if we implement every method call in Lucy that way?
>
> Sounds good to me. I would have implemented all of Ferret's classes
> this way except that I was naively worrying about the performance
> detriment of the extra point dereference. I thought the extra speed
> may be worth the cost in memory.

I've pondered this a little more, and I think you have a point.  The
extra deref is a cost in this design, and for a very tight loop, it
could make a difference.  In general, pointer dereferencing is such a
cheap operation it's not something worth worrying about.  But this is
a major design decision.  We're not talking about a localized area of
code.  This is everywhere, so it's important to choose correctly.

I really think that pointer dereferrencing is such a cheap operation
that it really isn't worth worrying about as it seems to have made so
little difference to Ferret's performance.

Maybe we can treat the i/o classes as a special case, since I think
that's where the majority of our function calls occur.  Potentially,
we could just have the Streams' macros alias functions directly...

     #define InStream_Write_VInt(instream) InStream_write_vint(instream)

... which is akin to declaring InStream_Write_VInt a "final" method.
(I assume that at the compiler level, that's how "final" methods are
implemented.)

That's an option if Lucy adopts the same architecture that Ferret and
KinoSearch have: no subclasses for InStream and OutStream.

If the function calls are all macrofied, it's easy enough to switch
up the macro definition and see if there's any effect.  My guess is
that there will be a small but measurable difference.

I think the difference will be very small indeed. But as you say,
since the function calls are macrofied it really is easy to switch out
the function call with a macro. The question is, which way to go with
first. I think we should have it consistent across the board and leave
this optimization for later. But that is my personal preference and it
certainly isn't a dealbreaker for me.

This being C, we can also copy a function pointer out of the vtable
and use it directly if we identify a really tight loop somewhere as a
bottleneck.

     InStream_read_vint_t read_vint = instream->_->read_vint;
     for (1 .. a_lot) {
         foo++ = read_vint(instream);
     }

However, I suspect that if we treat the i/o classes as special cases
and call their functions directly everywhere, we'll have addressed
90% of the bottlenecks.

True, the i/o methods, particularly the vint i/o methods, are the
major bottleneck. I spent a lot of time optimizing write_vint and
read_vint. They're ugly but they got me a massive 20% performance
gain. Also, adding a skip_vints method instead of just using read_vint
in a loop like Lucene gave me another 10%.

> Anyway, I changed the Streams to test
> the difference and I mean to refactor the rest of the classes like
> this when I have time.

Below my sig you'll find the output of a simple benchmarking app
"speed.c" written by a guy named Michael Lee, run on my G4 Laptop and
then also on a Pentium 4.  It shows the relative costs of some simple
operations.  You can download it from <http://www.vmlinux.org/jocke/
programming/www.ontek.com/mikey/speed.c>.  His original 1997 article
on C optimization (which has a broken link to speed.c) is at <http://
www.prism.uvsq.fr/~cedb/local_copies/lee.html>.

Since that article is nearly a decade old, it doesn't spend a lot of
time on CPU caching, which supposedly is becoming more important
these days.  I don't have any practical experience doing those kinds
of optimizations, though.  Maybe we could stuff all the vtables in
one giant struct and hope that with all our function pointers jammed
up against one another we'd get less cache thrash.  But I think
that's chasing our tail.

Unless we can make this really transparent I don't think it is worth
it. I'm not exactly sure but maybe just declaring all of the structs
next to eachother will give the same benifit. I don't know enough
about C compilers to say for sure.

The bottom line for me is that I'd prefer that the method calls be
macrofied because it's going to clean up the source quite a bit.  The
best way to optimize is to find a better algorithm, and simpler,
cleaner, smaller source code makes refactoring easier.

This is a must for me too. I couldn't agree more.

--
Dave Balmain
http://www.davebalmain.com/

Reply via email to