> I'm about to start welding in support for multiple interpreters, running
> both serially and simultaneously, into Parrot. (With provisions for
> starting and coordinating interpreters from other interpreters)
>
> This is just a heads-up, since it probably means platforms without POSIX
> thread support will need to provide some workarounds. (I'll be putting
> together a generic low-level thread interface with stub routines/#defines
> to make things easier) FWIW, I do *not* plan on supporting POSIX d4
> threads--final draft or nothin'. (Or, rather, final draft or someone else
> writes the wrappers...)
>
> Whether threads of some sort will be required for Parrot's up in the air--I
> want to wait and see what sort of performance impact there is before making
> that decision.
>
>                                       Dan
>

Well, the memory allocator is most defaintely affected by having
threading enabled at compile time.  The default GNU (glibc) memory allocator
assumes threading, as I'm sure the Solaris one does, so traditional
builds (a-la RedHat) are not going to be affected (assuming there is a
"use-the-OS-malloc" configuration flag like in perl5).  The simple
allocators (like perl5's) just use a big lock, which just adds a fixed
overhead for both threaded and non-threaded (though hurting
scalability).  But glib, and the
Solaris magazine-architecture (proposed in the earlier email) require
a redesign which adds a tremendous amount of overhead for single
threading (though they scale nicely).  I've been avidly reading up on
memory allocators and writing my version of a magazine allocator.  I
deviate substantially from the SUN paper, given that they suggest just
using sbrk and mmap for anything below the slab (since they're
trusting that the vmem architecture in the kernel is fast enough).
This clearly isn't acceptible on non Solaris machines who may be
horribly slow at mmap.  Additionally, I'm taking advantage of the
per-thread-interpreter, which avoids having to do ANY locking for the
simplest case of small allocs/frees. More complex cases such as
larger allocs which are too sensative to unused memory trade off
locking contention for free space. The largest allocs (as with most
memory schemes) are handled with simple low-level access (sbrk / mmap).

My current incarnation is very module, which means that there are lots
of function calls in the worst case.  Once I debug it fully, I'll look
into consolidating everything into one large function call which
should avoid some redundant locking and further speed things up.  The
bad part about SUN's paper is that their benchmarks are hog-wash..
One of them shows performance when you use alloc/dealloc pairs of a
fixed size.  This is the optimal case for their allocation scheme,
since it only invoves a handful of assembly instructions in highly
CPU-local cached memory allocations.  My benchmark takes a rather
large array and randomly chooses cells within the array.  If the
cell-value is null, it makes a randomly sized allocation, if it's
filled, it frees it.  This at least simulates multi-sized,
multi-lifetimed operations.  My multi-threaded simultator simply utilizes
n-thread-local-arrays each of size max_array_size/n.


Assuming that we utilize one interpreter per thread, then there will
be a significantly reduced need for locking (and CPU cache sharing on
multi-CPU architectures).  However, when we "require" multithreading,
there is the temptation of handing off functionality like garbage collection
to their own threads.  MT is almost always slower (more total cpu
time) / memory hogging than ST. (Though obviously less real time can
pass with the rare case of multiple CPUs)  Additionally, I'd argue
that most apps are single-threaded in design, and wouldn't value the
loss of performance tradeoff for threaded-functionality.

Perl5 runs measurably slower when built for threading (even with only
a single thread). And I believe that's mostly due to the extra pthread function
calls.  Granted Solaris has a nice and fast proprietary MT library,
though it doesn't do the general (platform independent) case any good.

I think it's fair to assume no additional logic / locking occurs
within the op-codes (instead locks are relegated to specialized API
functions that an op-code might eventually call).  So depending on how
the GC (and memory allocation in general) handles MT compilation, we
shouldn't see too bad of a slow-down.  But I still would like the
option to have a non-MT compiled core for "hell-bent-execution", which
I've regularly found to be useful (operating on multi-meg data-sets
which can take half an hour or more to process).

-Michael

Reply via email to