Re: Threads: Time to get the terminology straight

2004-01-05 Thread Dan Sugalski
At 2:43 AM + 1/5/04, Nigel Sandever wrote:
05/01/04 01:22:32, Sam Vilain <[EMAIL PROTECTED]> wrote:

[STUFF] :)

In another post you mentions intel hyperthreading.
Essentially, duplicate sets of registers within a single CPU.
Do these need to apply lock on every machine level entity that
they access?
 No.
Why not?

Because they can only modify an entity if it is loaded into a register
and the logic behind hyperthreading won't allow both register sets
to load the same entity concurrently.
This seems generally common for SMT core CPUs (though some of them 
may allow multiple references in different threads on-chip and just 
keeps the data sync'd up--I've not looked), unfortunately, is 
untenable for Parrot. The hardware gets to throw massive amounts of 
parallelism at the problem, a luxury we, alas, don't have. (I'm 
pretty sure they also play some serious register remapping games and 
deferred & speculative execution tricks to not deadlock)

To do this would require that every time a PMC register was loaded 
that all the PMC registers of any other active interpreter in the 
interpreter group be scanned. Kind of a pricey operation.
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Threads: Time to get the terminology straight

2004-01-05 Thread Nigel Sandever

05/01/04 04:51:20, Sam Vilain <[EMAIL PROTECTED]> wrote:

>On Mon, 05 Jan 2004 15:43, Nigel Sandever wrote;
>
>  > I accept that it may not be possible on all platforms, and it may
>  > be too expensive on some others. It may even be undesirable in the
>  > context of Parrot, but I have seen no argument that goes to
>  > invalidate the underlying premise.
>
>I think you missed this:
>
>LT> Different VMs can run on different CPUs. Why should we make atomic
>LT> instructions out if these? We have a JIT runtime performing at 1
>LT> Parrot instruction per CPU instruction for native integers. Why
>LT> should we slow down that by a magnitude of many tenths?
>
>LT> We have to lock shared data, then you have to pay the penalty, but
>LT> not for each piece of code.
.
So far, I have only suggested  using the mechanism in conjuction with
PMCs and PMC registers. 

You can't add an in-use flag to a native integer. But then, native integers
are not a part of the VHLLs (perl/Python/Ruby). The are a consituent part 
of scalars, but they use different register set and opcodes. Copying the
integer value of a scalar into an I register would require locking the scalars
PMC. Once the value is in the I register, operations performed on it would
not need to be synchronised. Once the resultant is calculated, it need to be 
moved back to the PMC and lock is cleared. There should be no need to 
interlock on most opcodes dealing with the I and R register sets.

The S registers are a different kettle of fish, and I haven't worked through 
the implications for these. My gut feels is that the C-style strings pointed 
at by S registers would be protected by the in-use flag set on the PMCs for
the scalars from which they are derived. 

This means that when a single PMC opcode results in a sequence of non-PMC
operations, then other shared threads would be blocked from operations 
until the sequence of non-PMC ops in the first shared thread where complete.
But ONLY if they attempt access to the same PMC. 

If they are processing PMC or non-PMC operations that do not involve the 
in-use PMC, then they will not be blocked and will be scheduled for their 
timeslices in the normal way.

>
>and this:
>
>LT> I think, that you are missing multiprocessor systems totally.
>

If the mutex mechanism that is use to block the shared threads
is SMP, NUMA, AMP etc. safe, then the mechanism I describe is also 
safe in these envirnments.

>You are effectively excluding true parallellism by blocking other
>processors from executing Parrot ops while one has the lock. 
.
The block only occurs *IF* concurrent operations on the same data
are attempted.

> You may
>as well skip the thread libraries altogether and multi-thread the ops
>in a runloop like Ruby does.
>
>But let's carry the argument through, restricting it to UP systems,
>with hyperthreading switched off, and running Win32.  Is it even true
>that masking interrupts is enough on these systems?
.
No masking of interupts is involved anywhere! 
I don't know where the idea arises, but it wasn't from me.

>
>Win32 `Critical Sections' must be giving the scheduler hints not to
>run other pending threads whilst a critical section is running.  Maybe
>it uses the CPU sti/cli flags for that, to avoid the overhead of
>setting a memory word somewhere (bad enough) or calling the system
>(crippling).  In that case, setting STI/CLI might only incur a ~50%
>performance penalty for integer operations.
.
I don't have access to the sources, but I do know that when one 
thread has entered a critical section, all other threads and processes
continue to be scheduled in the normal way except those that also try
to enter the critical section. 

Scheduling is only disabled for those threads that *ask* to be so,
and no others. Either within the process or other processes. How the 
mechanism works I can only speculate, but no CLI/STI instructions are 
involved.

 
When the first thread enters the critsec, a flag is set in the 
critsec memory.

When a second thread attempts to enter the critsec, a flag is 
set in the corresponding scheduler table to indicate that it should 
not be scheduled again until the flag is cleared. 

When the first thread leaves the critsec, the flag in the critsec
memory is cleared and the flag(s) in the scheduler tables for any 
thread(s) blocking on the critsec are also cleared. 

Which ever of the blocked threads is next scheduled, it aquires the
critsec, sets the flag in the critsec memory and the process repeats.



No masking of interupts is involved.

>
>but then there's this:
>
>  NS> Other internal housekeeping operations, memory allocation, garbage
>  NS> collection etc. are performed as "sysopcodes", performed by the VMI
>  NS> within the auspices of the critical section, and thus secured.
>
>UG> there may be times when a GC run needs to be initiated DURING a VM
>UG> operation. if the op requires an immediate lare chunk of ram it
>UG> can trigger a GC pass or allocation request. you can't for

Re: Threads: Time to get the terminology straight

2004-01-05 Thread Nigel Sandever

05/01/04 04:34:15, Luke Palmer <[EMAIL PROTECTED]> wrote:

 LP> I think you're saying that each thread gets its own register set and
 LP> register stacks (the call chain is somewhat hidden within the register
 LP> stacks).  Dan has been saying we'll do this all along.
 .
That has been my interpretation, but there are others that appear to be
contradicting that view.

 LP> But you're also saying that a particular PMC can only be loaded into one
 LP> register at a time, in any thread.  So if thread A has a string in its
 LP> P17, then if thread B tries to load that string into its, say, P22, it
 LP> will block until thread A releases it.  Is this correct?
 .
Yes!

 LP> This is a novel idea.  It reduces lock acquisition/release from every
 LP> opcode to only the C opcodes.  That's a win.
 .
That is the intent.

 LP> Sadly, it greatly increases the chances of deadlocks.  Take this
 LP> example:
.
my ($somestr, $otherstr);
sub A {
$somestr .= "foo";
$otherstr .= "bar";
}
sub B {
$otherstr .= "foo";
$somestr .= "bar";
}
Thread->new(\&A);
Thread->new(\&B);
.
 LP> This is a fairly trivial example, and it should work smoothly if we're
 LP> automatically locking for the user.

 LP> But consider your scheme:  A loads $somestr into its P17, and performs
 LP> the concatenation.   B loads $otherstr into its P17, and performs the
 LP> concatenation.  A tries to load $otherstr into its P18, but blocks
 LP> because it's in B's P17.  B then tries to load $somestr into its P18,
 LP> but blocks because it's in A's P17.  Deadlock.

 LP> Did I accurately portray your scheme?  If not, could you explain what
 LP> yours does in terms of this example?
.
Where your description falls down is considering each sub as an opcode.
I *think* that each statement within the sub is an opcode.

Therefore the sequence of opcodes in A (simplified) is:

A sets the in-use flag on $somestr when it loads it into P17.
If the inuse flag was set, it aquires the mutex.
A executes the CONCAT opcode against $somestr. 
Constant 'foo' as operand.
Releases the mutex.
A then clears the in-use flag for $somestr.

A then set the in-use flag for $otherstr.
If the inuse flag was set, it aquires the mutex.
A executes CONCAT opcode against P18.
Constant 'bar' as operand.
Releases the mutex.
A then clears the in-use flag for $otherstr.

And the sequence of opcodes in B (simplified) is:

B sets the in-use flag on $otherstr when it loads it into P17.
If the inuse flag was set, it aquires the mutex.
B executes the CONCAT opcode for against $otherstr.
Constant 'foo' as operand.
Releases the mutex.
B then clears the in-use flag for $otherstr.

B then set the in-use flag for $somestr.
If the inuse flag was set, it aquires the mutex.
B executes CONCAT opcode against P18.
Constant 'bar; as operand.
Releases the mutex.
B then clears the in-use flag for $somestr.

The deadlock doesn't occur (if I'm right) because however you overlap
those two sequences, only one of the two in-use flags is set at any
given time. Blocking only occurs if the two threads are attempting 
simultaneous operatons on the same PMC. When this occurs, the second
thread blocks until the first completes it's operation, when it 
immediately releases the lock and clears the in-use flag.

Ie. (In this example) Only one in-use flag (per thread) will be set 
at any given time, so possibility for deadlock cannot arise.

If the situation requires a single opcode to manipulate two PMCs, then
other threads will be prevented from accessing either of those PMCs
until the single opcode on the first thread clears both in-use flags.

It may be possible to set up a scenario where, loops of dual PMC, single
opcodes could deadlock. I haven't convinced myself of this one way or the
other. But notionally, opcodes with dual PMC operands should be fairly
rare if at all. Mostly, perl-level statements which involve multiple PMCs
will resolve down to sequences of opcodes that involve a single PMC and
intermediate results from previous opcodes held in non-PMC registers.

NOTE: That last paragraph is *highly* speculative on my part. It's how 
I think it should work. Not how it does or should. 


 LP> Luke

Does it work? If it does, can it be implemented, cross-platform, within
the Dan's constraints? I do not know the answer to this. I'm trying to
work that out now. I would have preferred to have had categorical answers
before bringing it here, but time ran out and my knowledge of Parrot and
other platforms lets me down.

Cheers, Nigel.



 End of forwarded message 





Re: Threads: Time to get the terminology straight

2004-01-04 Thread Sam Vilain
On Mon, 05 Jan 2004 15:43, Nigel Sandever wrote;

  > I accept that it may not be possible on all platforms, and it may
  > be too expensive on some others. It may even be undesirable in the
  > context of Parrot, but I have seen no argument that goes to
  > invalidate the underlying premise.

I think you missed this:

LT> Different VMs can run on different CPUs. Why should we make atomic
LT> instructions out if these? We have a JIT runtime performing at 1
LT> Parrot instruction per CPU instruction for native integers. Why
LT> should we slow down that by a magnitude of many tenths?

LT> We have to lock shared data, then you have to pay the penalty, but
LT> not for each piece of code.

and this:

LT> I think, that you are missing multiprocessor systems totally.

You are effectively excluding true parallellism by blocking other
processors from executing Parrot ops while one has the lock.  You may
as well skip the thread libraries altogether and multi-thread the ops
in a runloop like Ruby does.

But let's carry the argument through, restricting it to UP systems,
with hyperthreading switched off, and running Win32.  Is it even true
that masking interrupts is enough on these systems?

Win32 `Critical Sections' must be giving the scheduler hints not to
run other pending threads whilst a critical section is running.  Maybe
it uses the CPU sti/cli flags for that, to avoid the overhead of
setting a memory word somewhere (bad enough) or calling the system
(crippling).  In that case, setting STI/CLI might only incur a ~50%
performance penalty for integer operations.

but then there's this:

  NS> Other internal housekeeping operations, memory allocation, garbage
  NS> collection etc. are performed as "sysopcodes", performed by the VMI
  NS> within the auspices of the critical section, and thus secured.

UG> there may be times when a GC run needs to be initiated DURING a VM
UG> operation. if the op requires an immediate lare chunk of ram it
UG> can trigger a GC pass or allocation request. you can't force those
UG> things to only happen between normal ops (which is what making
UG> them into ops does). so GC and allocation both need to be able to
UG> lock all shared things in their interpreter (and not just do a
UG> process global lock) so those things won't be modified by the
UG> other threads that share them.

I *think* this means that even if we *could* use critical sections for
each op, where this works and isn't terribly inefficient, GC throws a
spanner in the works.  This could perhaps be worked around.

In any case, it won't work on the fastest known threading
implementations (Solaris, Linux NPTL, etc), as they won't know to
block all the other threads in a given process just because one of
them set a CPU flag cycles before it was pre-empted.

So, in summary - it won't work on MP, and on UP, it couldn't possibly
be as overhead-free as the other solutions.

Clear as mud ?  :-)

[back to processors]
> Do these need to apply lock on every machine level entity that
> they access?

Yes, but the only resource that matters here is memory.  Locking
*does* take place inside the processor, but the locks are all close
enough to be inspected in under a cycle.  And misses incur a penalty
of several cycles - maybe dozens, depending on who has the memory
locked.

Registers are also "locked" by virtue of the fact that the
out-of-order execution and pipelining logic will not schedule/allow an
instruction to proceed until its data is ready.  Any CPU with
pipelining has this problem.

There is an interesting comparison to be drawn between the JIT
assembly happening inside the processor from the bytecode being
executed (x86) into a RISC core machine language (µ-ops) on
hyperthreading systems, and Parrot's compiling PASM to native machine
code.  It each case is the µ-ops that are ordered to maximize
performance and fed into the execution units.

On a hyperthreading processor, it has the luxury of knowing how long
it will take to check the necessary locks for each instruction,
probably under a cycle, so that µ-ops may scream along.

With Parrot, it might have to contact another host over an ethernet
controller to acquire a lock (eg, threads running in an OpenMOSIX
cluster).  This cannot happen for every instruction!
-- 
Sam Vilain, [EMAIL PROTECTED]

  The golden rule is that there are no golden rules
GEORGE BERNARD SHAW




Re: Threads: Time to get the terminology straight

2004-01-04 Thread Luke Palmer
Nigel Sandever writes:
> Whilst the state of higher level objects, that the machine level 
> objects are a part of, may have their state corrupted by two 
> threads modifying things concurrently. The state of the threads
> (registers sets+stack) themselves cannot be corrupted. 

I'm going to ask for some clarification here.

I think you're saying that each thread gets its own register set and
register stacks (the call chain is somewhat hidden within the register
stacks).  Dan has been saying we'll do this all along.

But you're also saying that a particular PMC can only be loaded into one
register at a time, in any thread.  So if thread A has a string in its
P17, then if thread B tries to load that string into its, say, P22, it
will block until thread A releases it.  Is this correct?

This is a novel idea.  It reduces lock acquisition/release from every
opcode to only the C opcodes.  That's a win.

Sadly, it greatly increases the chances of deadlocks.  Take this
example:

my ($somestr, $otherstr);
sub A {
$somestr .= "foo";
$otherstr .= "bar";
}
sub B {
$otherstr .= "foo";
$somestr .= "bar";
}
Thread->new(\&A);
Thread->new(\&B);

This is a fairly trivial example, and it should work smoothly if we're
automatically locking for the user. 

But consider your scheme:  A loads $somestr into its P17, and performs
the concatenation.   B loads $otherstr into its P17, and performs the
concatenation.  A tries to load $otherstr into its P18, but blocks
because it's in B's P17.  B then tries to load $somestr into its P18,
but blocks because it's in A's P17.  Deadlock.

Did I accurately portray your scheme?  If not, could you explain what
yours does in terms of this example?

Luke

> Regards, Nigel


Re: Threads: Time to get the terminology straight

2004-01-04 Thread Nigel Sandever
05/01/04 01:22:32, Sam Vilain <[EMAIL PROTECTED]> wrote:

[STUFF] :)

In another post you mentions intel hyperthreading. 
Essentially, duplicate sets of registers within a single CPU.

Do these need to apply lock on every machine level entity that
they access?
 No. 

Why not?

Because they can only modify an entity if it is loaded into a register
and the logic behind hyperthreading won't allow both register sets 
to load the same entity concurrently. 

( I know this is a gross simplificationof the interactions 
between the on-board logic and L1/L2 caching!)

--- Not an advert or glorification of Intel. Just an example -

Hyper-Threading Technology provides thread-level-parallelism (TLP) 
on each processor resulting in increased utilization of processor 
execution resources. As a result, resource utilization yields higher 
processing throughput. Hyper-Threading Technology is a form of 
simultaneous multi-threading technology (SMT) where multiple 
threads of software applications can be run simultaneously on one
 processor. 

This is achieved by duplicating the *architectural state* on each 
processor, while *sharing one set of processor execution resources*.
--

The last paragraph is the salient one as far as I am concerned.

The basic premise of my original proposal was that multi-threaded,
machine level applications don't have to interlock on machine level 
entities, because each operation they perform is atomic. 

Whilst the state of higher level objects, that the machine level 
objects are a part of, may have their state corrupted by two 
threads modifying things concurrently. The state of the threads
(registers sets+stack) themselves cannot be corrupted. 

This is because they have their own internally consistant state,
that only changes atomically, and that is completely separated,
each from the other. They only share common data (code is data
to the cpu, just bytecode is data to a VM).

So, if you are going to emulate a (hyper)threaded CPU, in a 
register-based virtual machine interpreter. And allow for
concurrent threads of execution within that VMI.
Then one way of ensuring that the internal state of the 
VMI was never corrupted, would be to have each thread have
it's own copy of the *architectural state* of the VM, whilst
 *one set of processor execution resources*.

For this to work, you would need to achieve the same opcode
atomicity at the VMI level. Interlocking the threads so that
on shared thread can not start an opcode until anothe shared 
threads has completed gives this atomicity. The penalty is that
if the interlocking is done for every opcode, then shared 
threads end up with very long virtual timeslices. To prevent 
that being the case (most of the time), the interlocking should
 only come into effect *if* concurrent access to a VM level 
entity is imminant. 

As the VMI cannot access (modify) the state of a VM level
entity (PMC) until it has loaded it into a VM register, the
interlosking only need come into effect *if*, the entity
who's reference is being loaded into a PMC register is 
currently in-use by (another) thread. 

The state if a PMCs in-useness can be flagged by a single bit
in its header. This can be detected by a shared thread when
the reference to it is loaded into teh PMC register and 
when it is, that shared thread then waits on the single,
shared mutex before proceeding.

It is only when the combination of atomised VM opcodes,
and lightweight in-use detection come together, that the
need for a mutex/entity can be avoided.

If the mutex used is capable of handling SMP, NUMA,
clusters etc, then the mechinsm will work. 

If the lightweight bit-test&-set opcode isn't available,
then a heavyweight equivalent could be used, though the
advantages would be reduced.


>Sam Vilain, [EMAIL PROTECTED]

I hope that clarifiies my thinking and how I arrived at it.

I accept that it may not be possible on all platforms, and
it may be too expensive on some others. It may even be 
undesirable in the context of Parrot, but I have seen no
argument that goes to invalidate the underlying premise.

Regards, Nigel





Re: Threads: Time to get the terminology straight

2004-01-04 Thread Sam Vilain
On Mon, 05 Jan 2004 12:58, Nigel Sandever wrote;

  > Everything else, were my attempts at solving the requirements of
  > synchronisation that this would require, whilst minimising the
  > cost of that synchronisation, by avoiding the need for a mutex on
  > every shared entity, and the costs of attempting to aquire a mutex
  > except when two SHARED THREADS attempted concurrent access to a
  > shared entity.

This paragraph sounds like you're trying to solve an intractable problem.
Try posting some psuedocode to explain what you mean.

But it has given me an idea that could minimise the number of locks,
ie not require a mutex on each PMC, just each shared PMC.  10 points
to the person who finds a flaw in this approach :)

  Each object you share, could create a new (virtual) shared memory
  segment with its own semaphore.  This virtual memory segment is
  considered its own COW domain (ie, its own thread to the GC);
  references inserted back to non-shared memory will pull the
  structures into that virtual COW thread.

  Access to the entire structure is controlled via a multiple
  reader/single writer lock (close to what a semaphore is IIRC); locks
  for a thread are released when references to places inside the
  shared segment are no longer anywhere in any @_ on the locking
  thread's call stack, or in use by any opcode (is that good enough?),
  and are acquired for writing when anything needs to be changed.

  Virtual shared memory segments can then easily be cleaned up by
  normal GC.

The major problem I can see is that upgrading a lock from reading to
writing can't work if there are concurrent writes (and the read lock
to be upgraded cannot sensibly be released).  But that should be OK,
since operation signatures will mark variables that need changing as
read-write as early as possible.

For example, in this sort of code (sorry for P5 code);

  sub changer {
 my $shared_object = shift;
 $shared_object->{bar} = "baz";
  }

A read lock to the segment \$shared_object is in is acquired, then
released when it is `shifted' off.  As the next instruction has a
writable lvalue, it acquires a write lock.  But this code:

  sub changer {
 my $shared_object = shift;
 $shared_object->{bar} = &somefunc();
  }

Will hold the write lock on $shared_object open until &somefunc
runs.

My 2¢ :).  This discussion will certainly reach a dollar soon ;).
-- 
Sam Vilain, [EMAIL PROTECTED]

  Start every day with a smile and get it over with. 
W C FIELDS




Re: Threads: Time to get the terminology straight

2004-01-04 Thread Nigel Sandever
On Sun, 4 Jan 2004 15:47:35 -0500, [EMAIL PROTECTED] (Dan Sugalski) wrote:

> *) INTERPRETER - those bits of the Parrot_Interp structure that are 
> absolutely required to be thread-specific. This includes the current 
> register sets and stack pointers, as well as security context 
> information. Basically if a continuation captures it, it's the 
> interpreter.
> 
> *) INTERPRETER ENVIRONMENT - Those bits of the Parrot_Interp 
> structure that aren't required to be thread-specific (though I'm not 
> sure there are any) *PLUS* anything pointed to that doesn't have to 
> be thread-specific.
> 
> The environment includes the global namespaces, pads, stack chunks, 
> memory allocation regions, arenas, and whatnots. Just because the 
> pointer to the current pad is thread-specific doesn't mean the pad 
> *itself* has to be. It can be shared.
> 

> *) SHARED THREAD - A thread that's part of a group of threads sharing 
> a common interpreter environment.

Ignoring the implementation of the synchronisation required, the basic
premise of my long post was that each SHARED THREAD, should have it's
own INTERPRETER (a VM in my terms). and that these should share a 
common INTERPRETER ENVIRONMENT.

Simplistically, 5005threads shared an INTERPRETER ENVIRONMENT 
and a single INTERPRETER. Synchronising threaded access to the shared
INTERPRETER (rather than it's environment) was the biggest headache.
(I *think*).

With ithreads, each SHARED THREAD has it's own INTERPRETER *and*
INTERPRETER ENVIRONMENT. This removes the contention for and the
need to synchronise access to the INTERPRETER, but requires the 
duplication of shared elements of the INTERPRETER ENVIRONMENT and
the copy_on_read, with the inherent costs of the duplication at start-up, 
and slow, indirect access to shared entities across the duplicated 
INTERPRETER ENVIRONMENTS.

My proposal was that each SHARED THREAD, 
should have a separate copy of the INTERPRETER, 
but share a copy of the INTERPRETER ENVIRONMENT.

Everything else, were my attempts at solving the requirements of 
synchronisation that this would require, whilst minimising the cost
of that synchronisation, by avoiding the need for a mutex on every
shared entity, and the costs of attempting to aquire a mutex except
when two SHARED THREADS attempted concurrent access to a
shared entity. 

I think that by having SHARED THREAD == INTERPRETER, sharing
a common INTERPRETER ENVIRONMENT, you can avoid (some) of 
the problems associated with 5005threads but retain the direct 
access of shared entities. 

This imposes it's own set of requirements and costs, but (I believe)
that the ideas that underly the mechanisms I offered as solutions
are sound. The specific implementation is a platform specific detail
that could be pushed down to a lower level.

> ...those bits of the Parrot_Interp 
> structure that aren't required to be thread-specific (though I'm not 
> sure there are any) 

This is were I have a different (and quite possibly incorrect) view.
My mental picture of the INTERPRETER ENVIRONMENT includes
both the impementation of all the classes in the process *plus* all
the memory of every instance of those classes. 

I think your statement above implies that these would not be a part
of the INTERPRETER ENVIRONMENT per se, but would be allocated 
from global heap and only referenced from the bytecode that would live
in the INTERPRETER ENVIRONMENT? 

I realise that this is possible, and maybe even desirable, but the cost
of the GC walking a global heap, especially in the situationof a single 
process that contains to entirely separate instances of the 
INTERPRETER ENVIRONMENT, would be (I *think*) rather high.

I realise that this is a fairly rare occurance on mist platforms,
but in the win32 situation of emulated forks, each pseudo-process
must have an entirely separate INTERPRETER ENVIRONMENT,
potentially with each having multiple SHARED THREADS. 

If the memory for all entities in all pseudo-process is allocated from 
a (real) process-global heap, then the multiple GC's required by the 
multiple pseudo-process are going to be walking the same heap.
Possibly concurrently.  I realise that this problem (if it is such)
does not occur on platforms that have real forks available, but it
would be useful if the high level design would allow for the use of
separate (virtual) heaps tied to the INTERPRETER ENVIRONMENTs
which win32 has the ability to do.


> 
>  Dan
>

Nigel.





Threads: Time to get the terminology straight

2004-01-04 Thread Dan Sugalski
I think some of the massive back and forth that's going on is in part 
due to terminology problems, which are in part causing some 
conceptual problems. So, for the moment, let's agree on the following 
things:

*) MUTEX - this is a low level, under the hood, not exposed to users, 
thing that can be locked. They're non-recursive, non-read/write, 
exclusive things. When a thread gets a mutex, any other attempt to 
get that mutex will block until the owning thread releases the mutex. 
The platform-native lock construct will be used for this.

*) LOCK - This is an exposed-to-HLL-code thing that can be locked. 
Only PMCs can be locked, and the lock may or may not be recursive or 
read/write.

*) CONDITION VARIABLE - the "sleep until something pings me" 
construct. Useful for queue construction, always associated with a 
MUTEX.

*) RENDEZVOUS POINT - A HLL version of a condition variable. *not* 
associated with a lock -- these are standalone.

Note that the mutex/condition association's a POSIX limitation, and 
POSIX threads is what we have on some platforms. If you want to 
propose abstracting it away, go for it. The separation doesn't buy us 
anything, though it's useful in other circumstances.

*) INTERPRETER - those bits of the Parrot_Interp structure that are 
absolutely required to be thread-specific. This includes the current 
register sets and stack pointers, as well as security context 
information. Basically if a continuation captures it, it's the 
interpreter.

*) INTERPRETER ENVIRONMENT - Those bits of the Parrot_Interp 
structure that aren't required to be thread-specific (though I'm not 
sure there are any) *PLUS* anything pointed to that doesn't have to 
be thread-specific.

The environment includes the global namespaces, pads, stack chunks, 
memory allocation regions, arenas, and whatnots. Just because the 
pointer to the current pad is thread-specific doesn't mean the pad 
*itself* has to be. It can be shared.

*) INDEPENDENT THREAD - A thread that has no contact *AT ALL* with 
the internal data of any other thread in the current process. 
Independent threads need no synchronization for anything other than 
what few global things we have. And the fewer the better, though alas 
we can't have none at all.

Note that independent threads may still communicate back and forth by 
passing either atomic things (ints, floats, and pointers) or static 
buffers that can become the property of the destination thread.

*) SHARED THREAD - A thread that's part of a group of threads sharing 
a common interpreter environment.

Anyway, there's some terminology. It doesn't solve the design 
problem, but hopefully it'll help everyone talk the same language.

Remember that everything from the wrapped OS interface on up is up 
for grabs -- while we're not going to build our own mutexes or thread 
scheduler, everything that's been implemented or designed to date can 
be changed with sufficient good reason. (Though, again, the more you 
want to change the more spectacular the design has to be)
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk