[freenet-dev] Locally scalable Freenet design proposal

Cory Nelson Wed, 16 Jul 2008 07:08:30 -0700

Hey guys,

I know my criticism of Freenet probably makes me a bit unpopular with
you, but I hope you at least know I stick around because I like what
Freenet stands for and am just frustrated from wanting it to get
better.  If I had the time I would spend some learning Java to send
patches, but I do not so I'll propose some of what I'd like to see
here.


The biggest reason I have no friends who use Freenet, and am therefor
stuck on opennet, is because it takes up too much CPU.  For people who
run demanding apps like games, Freenet simply isn't low-profile enough
to keep running without a really powerful box.  And really, Freenet
doesn't actually do a whole lot so it shouldn't ever *need* a powerful
box.  Having built highly scalable, performant daemons in C/C++/C#, I
hope I can give some general advice without knowing Java.

Some key ideas for highly scalable software:

a) Use as few threads as possible.  This is important for two reasons:

    aa) If a thread locks and goes into a context switch before
unlocking, all the other threads waiting for the lock are stuck doing
nothing until the other thread wakes up to unlock.  The more threads
there are, the more chances there are of this happening.  Not doing
work sucks, but this sucks even more:  most synchronization stuff will
spin for a short time in user-mode in hopes of the locked thread
completing quickly and avoiding a context switch.  This usually
greatly improves performance.  Making these threads do long waits will
make the sync primitives not be able to use this, allocating kernel
objects to wait on which are significantly slower and increases memory
usage.  This is a *huge* scalability killer.

    ab) Lets say that a context switch involves bringing in at least
256 bytes worth of cache lines: thread state, CPU state, stack space,
etc.  With Freenet having 250 threads open -- which for me is about
average with an empty download queue and FMS running -- this means
64KB is being constantly shuffled into cache.  This is detrimental to
performance of the entire system, especially for low-end systems with
512KB or less cache.

b) Data that is used frequently by multiple threads should be kept out
of the same cache line (modern cache line size is 64 bytes).  Sharing
data can make the CPU's cache coherency logic trigger cross-talk.
I've seen this alone hurt app performance by as much as a 20% slowdown
on a Core2 Quad.

c) Work with buffers aligned for the operating system's memory
manager.  Windows locks a full page in memory when you use one for
I/O, so try to keep I/O buffers aligned to 4KB addresses and size.  I
realize Java doesn't give you any way to control the address, so
hopefully it will do the right thing here.

The best architecture for I have found for scalability involves a
"completion queue".  This is basically just a queue of callbacks that
a thread constantly pops from.  I/O is all done asynchronously, and
pushes a callback onto the queue when the I/O is complete.  The
callback can then initiate more I/O, hash stuff, or do whatever it
needs to.  Locks must always be released before a callback finishes,
or deadlocks would occur.  This gives clean code that is basically a
chain of methods:

void main() {
   queue q;

   begin_request();
   for(;;) {
      callback = q.pop();
      callback();
   }
}

void begin_request() {
   begin_send(buffer, len, on_send); // on_send will be called once
the send has finished.
}

void on_send(int error, int transfered_bytes) {
   if(error) ...
   else {
      buffer += transfered_bytes;
      len -= transfered_bytes;

      if(len) begin_send(buffer, len, on_send);
      else ...
   }
}

The design is quite different from blocking code, but I would argue it
is just as simple, and can even result in much cleaner code due to
splitting an otherwise huge method into smaller operation-centric
methods which are more easily understood.

This would open up two design choices:
a) a single thread per logical processor, each running the loop in
main() above, which can scale fantastically.
b) a single thread, period.  all locks could be removed from code,
greatly simplifying it and removing any chance of deadlock, priority
inversion, etc.

I have no idea if Java is capable of this.  Toad seemed to indicate it
isn't, but I thought I'd share my ideas anyway.  I've heard Apache's
MINA mentioned a few times, maybe that can help.

-- 
Cory Nelson

[freenet-dev] Locally scalable Freenet design proposal

Reply via email to