Re: Threads: Time to get the terminology straight
At 2:43 AM + 1/5/04, Nigel Sandever wrote: 05/01/04 01:22:32, Sam Vilain <[EMAIL PROTECTED]> wrote: [STUFF] :) In another post you mentions intel hyperthreading. Essentially, duplicate sets of registers within a single CPU. Do these need to apply lock on every machine level entity that they access? No. Why not? Because they can only modify an entity if it is loaded into a register and the logic behind hyperthreading won't allow both register sets to load the same entity concurrently. This seems generally common for SMT core CPUs (though some of them may allow multiple references in different threads on-chip and just keeps the data sync'd up--I've not looked), unfortunately, is untenable for Parrot. The hardware gets to throw massive amounts of parallelism at the problem, a luxury we, alas, don't have. (I'm pretty sure they also play some serious register remapping games and deferred & speculative execution tricks to not deadlock) To do this would require that every time a PMC register was loaded that all the PMC registers of any other active interpreter in the interpreter group be scanned. Kind of a pricey operation. -- Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Threads: Time to get the terminology straight
05/01/04 04:51:20, Sam Vilain <[EMAIL PROTECTED]> wrote: >On Mon, 05 Jan 2004 15:43, Nigel Sandever wrote; > > > I accept that it may not be possible on all platforms, and it may > > be too expensive on some others. It may even be undesirable in the > > context of Parrot, but I have seen no argument that goes to > > invalidate the underlying premise. > >I think you missed this: > >LT> Different VMs can run on different CPUs. Why should we make atomic >LT> instructions out if these? We have a JIT runtime performing at 1 >LT> Parrot instruction per CPU instruction for native integers. Why >LT> should we slow down that by a magnitude of many tenths? > >LT> We have to lock shared data, then you have to pay the penalty, but >LT> not for each piece of code. . So far, I have only suggested using the mechanism in conjuction with PMCs and PMC registers. You can't add an in-use flag to a native integer. But then, native integers are not a part of the VHLLs (perl/Python/Ruby). The are a consituent part of scalars, but they use different register set and opcodes. Copying the integer value of a scalar into an I register would require locking the scalars PMC. Once the value is in the I register, operations performed on it would not need to be synchronised. Once the resultant is calculated, it need to be moved back to the PMC and lock is cleared. There should be no need to interlock on most opcodes dealing with the I and R register sets. The S registers are a different kettle of fish, and I haven't worked through the implications for these. My gut feels is that the C-style strings pointed at by S registers would be protected by the in-use flag set on the PMCs for the scalars from which they are derived. This means that when a single PMC opcode results in a sequence of non-PMC operations, then other shared threads would be blocked from operations until the sequence of non-PMC ops in the first shared thread where complete. But ONLY if they attempt access to the same PMC. If they are processing PMC or non-PMC operations that do not involve the in-use PMC, then they will not be blocked and will be scheduled for their timeslices in the normal way. > >and this: > >LT> I think, that you are missing multiprocessor systems totally. > If the mutex mechanism that is use to block the shared threads is SMP, NUMA, AMP etc. safe, then the mechanism I describe is also safe in these envirnments. >You are effectively excluding true parallellism by blocking other >processors from executing Parrot ops while one has the lock. . The block only occurs *IF* concurrent operations on the same data are attempted. > You may >as well skip the thread libraries altogether and multi-thread the ops >in a runloop like Ruby does. > >But let's carry the argument through, restricting it to UP systems, >with hyperthreading switched off, and running Win32. Is it even true >that masking interrupts is enough on these systems? . No masking of interupts is involved anywhere! I don't know where the idea arises, but it wasn't from me. > >Win32 `Critical Sections' must be giving the scheduler hints not to >run other pending threads whilst a critical section is running. Maybe >it uses the CPU sti/cli flags for that, to avoid the overhead of >setting a memory word somewhere (bad enough) or calling the system >(crippling). In that case, setting STI/CLI might only incur a ~50% >performance penalty for integer operations. . I don't have access to the sources, but I do know that when one thread has entered a critical section, all other threads and processes continue to be scheduled in the normal way except those that also try to enter the critical section. Scheduling is only disabled for those threads that *ask* to be so, and no others. Either within the process or other processes. How the mechanism works I can only speculate, but no CLI/STI instructions are involved. When the first thread enters the critsec, a flag is set in the critsec memory. When a second thread attempts to enter the critsec, a flag is set in the corresponding scheduler table to indicate that it should not be scheduled again until the flag is cleared. When the first thread leaves the critsec, the flag in the critsec memory is cleared and the flag(s) in the scheduler tables for any thread(s) blocking on the critsec are also cleared. Which ever of the blocked threads is next scheduled, it aquires the critsec, sets the flag in the critsec memory and the process repeats. No masking of interupts is involved. > >but then there's this: > > NS> Other internal housekeeping operations, memory allocation, garbage > NS> collection etc. are performed as "sysopcodes", performed by the VMI > NS> within the auspices of the critical section, and thus secured. > >UG> there may be times when a GC run needs to be initiated DURING a VM >UG> operation. if the op requires an immediate lare chunk of ram it >UG> can trigger a GC pass or allocation request. you can't for
Re: Threads: Time to get the terminology straight
05/01/04 04:34:15, Luke Palmer <[EMAIL PROTECTED]> wrote: LP> I think you're saying that each thread gets its own register set and LP> register stacks (the call chain is somewhat hidden within the register LP> stacks). Dan has been saying we'll do this all along. . That has been my interpretation, but there are others that appear to be contradicting that view. LP> But you're also saying that a particular PMC can only be loaded into one LP> register at a time, in any thread. So if thread A has a string in its LP> P17, then if thread B tries to load that string into its, say, P22, it LP> will block until thread A releases it. Is this correct? . Yes! LP> This is a novel idea. It reduces lock acquisition/release from every LP> opcode to only the C opcodes. That's a win. . That is the intent. LP> Sadly, it greatly increases the chances of deadlocks. Take this LP> example: . my ($somestr, $otherstr); sub A { $somestr .= "foo"; $otherstr .= "bar"; } sub B { $otherstr .= "foo"; $somestr .= "bar"; } Thread->new(\&A); Thread->new(\&B); . LP> This is a fairly trivial example, and it should work smoothly if we're LP> automatically locking for the user. LP> But consider your scheme: A loads $somestr into its P17, and performs LP> the concatenation. B loads $otherstr into its P17, and performs the LP> concatenation. A tries to load $otherstr into its P18, but blocks LP> because it's in B's P17. B then tries to load $somestr into its P18, LP> but blocks because it's in A's P17. Deadlock. LP> Did I accurately portray your scheme? If not, could you explain what LP> yours does in terms of this example? . Where your description falls down is considering each sub as an opcode. I *think* that each statement within the sub is an opcode. Therefore the sequence of opcodes in A (simplified) is: A sets the in-use flag on $somestr when it loads it into P17. If the inuse flag was set, it aquires the mutex. A executes the CONCAT opcode against $somestr. Constant 'foo' as operand. Releases the mutex. A then clears the in-use flag for $somestr. A then set the in-use flag for $otherstr. If the inuse flag was set, it aquires the mutex. A executes CONCAT opcode against P18. Constant 'bar' as operand. Releases the mutex. A then clears the in-use flag for $otherstr. And the sequence of opcodes in B (simplified) is: B sets the in-use flag on $otherstr when it loads it into P17. If the inuse flag was set, it aquires the mutex. B executes the CONCAT opcode for against $otherstr. Constant 'foo' as operand. Releases the mutex. B then clears the in-use flag for $otherstr. B then set the in-use flag for $somestr. If the inuse flag was set, it aquires the mutex. B executes CONCAT opcode against P18. Constant 'bar; as operand. Releases the mutex. B then clears the in-use flag for $somestr. The deadlock doesn't occur (if I'm right) because however you overlap those two sequences, only one of the two in-use flags is set at any given time. Blocking only occurs if the two threads are attempting simultaneous operatons on the same PMC. When this occurs, the second thread blocks until the first completes it's operation, when it immediately releases the lock and clears the in-use flag. Ie. (In this example) Only one in-use flag (per thread) will be set at any given time, so possibility for deadlock cannot arise. If the situation requires a single opcode to manipulate two PMCs, then other threads will be prevented from accessing either of those PMCs until the single opcode on the first thread clears both in-use flags. It may be possible to set up a scenario where, loops of dual PMC, single opcodes could deadlock. I haven't convinced myself of this one way or the other. But notionally, opcodes with dual PMC operands should be fairly rare if at all. Mostly, perl-level statements which involve multiple PMCs will resolve down to sequences of opcodes that involve a single PMC and intermediate results from previous opcodes held in non-PMC registers. NOTE: That last paragraph is *highly* speculative on my part. It's how I think it should work. Not how it does or should. LP> Luke Does it work? If it does, can it be implemented, cross-platform, within the Dan's constraints? I do not know the answer to this. I'm trying to work that out now. I would have preferred to have had categorical answers before bringing it here, but time ran out and my knowledge of Parrot and other platforms lets me down. Cheers, Nigel. End of forwarded message
Re: Threads: Time to get the terminology straight
On Mon, 05 Jan 2004 15:43, Nigel Sandever wrote; > I accept that it may not be possible on all platforms, and it may > be too expensive on some others. It may even be undesirable in the > context of Parrot, but I have seen no argument that goes to > invalidate the underlying premise. I think you missed this: LT> Different VMs can run on different CPUs. Why should we make atomic LT> instructions out if these? We have a JIT runtime performing at 1 LT> Parrot instruction per CPU instruction for native integers. Why LT> should we slow down that by a magnitude of many tenths? LT> We have to lock shared data, then you have to pay the penalty, but LT> not for each piece of code. and this: LT> I think, that you are missing multiprocessor systems totally. You are effectively excluding true parallellism by blocking other processors from executing Parrot ops while one has the lock. You may as well skip the thread libraries altogether and multi-thread the ops in a runloop like Ruby does. But let's carry the argument through, restricting it to UP systems, with hyperthreading switched off, and running Win32. Is it even true that masking interrupts is enough on these systems? Win32 `Critical Sections' must be giving the scheduler hints not to run other pending threads whilst a critical section is running. Maybe it uses the CPU sti/cli flags for that, to avoid the overhead of setting a memory word somewhere (bad enough) or calling the system (crippling). In that case, setting STI/CLI might only incur a ~50% performance penalty for integer operations. but then there's this: NS> Other internal housekeeping operations, memory allocation, garbage NS> collection etc. are performed as "sysopcodes", performed by the VMI NS> within the auspices of the critical section, and thus secured. UG> there may be times when a GC run needs to be initiated DURING a VM UG> operation. if the op requires an immediate lare chunk of ram it UG> can trigger a GC pass or allocation request. you can't force those UG> things to only happen between normal ops (which is what making UG> them into ops does). so GC and allocation both need to be able to UG> lock all shared things in their interpreter (and not just do a UG> process global lock) so those things won't be modified by the UG> other threads that share them. I *think* this means that even if we *could* use critical sections for each op, where this works and isn't terribly inefficient, GC throws a spanner in the works. This could perhaps be worked around. In any case, it won't work on the fastest known threading implementations (Solaris, Linux NPTL, etc), as they won't know to block all the other threads in a given process just because one of them set a CPU flag cycles before it was pre-empted. So, in summary - it won't work on MP, and on UP, it couldn't possibly be as overhead-free as the other solutions. Clear as mud ? :-) [back to processors] > Do these need to apply lock on every machine level entity that > they access? Yes, but the only resource that matters here is memory. Locking *does* take place inside the processor, but the locks are all close enough to be inspected in under a cycle. And misses incur a penalty of several cycles - maybe dozens, depending on who has the memory locked. Registers are also "locked" by virtue of the fact that the out-of-order execution and pipelining logic will not schedule/allow an instruction to proceed until its data is ready. Any CPU with pipelining has this problem. There is an interesting comparison to be drawn between the JIT assembly happening inside the processor from the bytecode being executed (x86) into a RISC core machine language (µ-ops) on hyperthreading systems, and Parrot's compiling PASM to native machine code. It each case is the µ-ops that are ordered to maximize performance and fed into the execution units. On a hyperthreading processor, it has the luxury of knowing how long it will take to check the necessary locks for each instruction, probably under a cycle, so that µ-ops may scream along. With Parrot, it might have to contact another host over an ethernet controller to acquire a lock (eg, threads running in an OpenMOSIX cluster). This cannot happen for every instruction! -- Sam Vilain, [EMAIL PROTECTED] The golden rule is that there are no golden rules GEORGE BERNARD SHAW
Re: Threads: Time to get the terminology straight
Nigel Sandever writes: > Whilst the state of higher level objects, that the machine level > objects are a part of, may have their state corrupted by two > threads modifying things concurrently. The state of the threads > (registers sets+stack) themselves cannot be corrupted. I'm going to ask for some clarification here. I think you're saying that each thread gets its own register set and register stacks (the call chain is somewhat hidden within the register stacks). Dan has been saying we'll do this all along. But you're also saying that a particular PMC can only be loaded into one register at a time, in any thread. So if thread A has a string in its P17, then if thread B tries to load that string into its, say, P22, it will block until thread A releases it. Is this correct? This is a novel idea. It reduces lock acquisition/release from every opcode to only the C opcodes. That's a win. Sadly, it greatly increases the chances of deadlocks. Take this example: my ($somestr, $otherstr); sub A { $somestr .= "foo"; $otherstr .= "bar"; } sub B { $otherstr .= "foo"; $somestr .= "bar"; } Thread->new(\&A); Thread->new(\&B); This is a fairly trivial example, and it should work smoothly if we're automatically locking for the user. But consider your scheme: A loads $somestr into its P17, and performs the concatenation. B loads $otherstr into its P17, and performs the concatenation. A tries to load $otherstr into its P18, but blocks because it's in B's P17. B then tries to load $somestr into its P18, but blocks because it's in A's P17. Deadlock. Did I accurately portray your scheme? If not, could you explain what yours does in terms of this example? Luke > Regards, Nigel
Re: Threads: Time to get the terminology straight
05/01/04 01:22:32, Sam Vilain <[EMAIL PROTECTED]> wrote: [STUFF] :) In another post you mentions intel hyperthreading. Essentially, duplicate sets of registers within a single CPU. Do these need to apply lock on every machine level entity that they access? No. Why not? Because they can only modify an entity if it is loaded into a register and the logic behind hyperthreading won't allow both register sets to load the same entity concurrently. ( I know this is a gross simplificationof the interactions between the on-board logic and L1/L2 caching!) --- Not an advert or glorification of Intel. Just an example - Hyper-Threading Technology provides thread-level-parallelism (TLP) on each processor resulting in increased utilization of processor execution resources. As a result, resource utilization yields higher processing throughput. Hyper-Threading Technology is a form of simultaneous multi-threading technology (SMT) where multiple threads of software applications can be run simultaneously on one processor. This is achieved by duplicating the *architectural state* on each processor, while *sharing one set of processor execution resources*. -- The last paragraph is the salient one as far as I am concerned. The basic premise of my original proposal was that multi-threaded, machine level applications don't have to interlock on machine level entities, because each operation they perform is atomic. Whilst the state of higher level objects, that the machine level objects are a part of, may have their state corrupted by two threads modifying things concurrently. The state of the threads (registers sets+stack) themselves cannot be corrupted. This is because they have their own internally consistant state, that only changes atomically, and that is completely separated, each from the other. They only share common data (code is data to the cpu, just bytecode is data to a VM). So, if you are going to emulate a (hyper)threaded CPU, in a register-based virtual machine interpreter. And allow for concurrent threads of execution within that VMI. Then one way of ensuring that the internal state of the VMI was never corrupted, would be to have each thread have it's own copy of the *architectural state* of the VM, whilst *one set of processor execution resources*. For this to work, you would need to achieve the same opcode atomicity at the VMI level. Interlocking the threads so that on shared thread can not start an opcode until anothe shared threads has completed gives this atomicity. The penalty is that if the interlocking is done for every opcode, then shared threads end up with very long virtual timeslices. To prevent that being the case (most of the time), the interlocking should only come into effect *if* concurrent access to a VM level entity is imminant. As the VMI cannot access (modify) the state of a VM level entity (PMC) until it has loaded it into a VM register, the interlosking only need come into effect *if*, the entity who's reference is being loaded into a PMC register is currently in-use by (another) thread. The state if a PMCs in-useness can be flagged by a single bit in its header. This can be detected by a shared thread when the reference to it is loaded into teh PMC register and when it is, that shared thread then waits on the single, shared mutex before proceeding. It is only when the combination of atomised VM opcodes, and lightweight in-use detection come together, that the need for a mutex/entity can be avoided. If the mutex used is capable of handling SMP, NUMA, clusters etc, then the mechinsm will work. If the lightweight bit-test&-set opcode isn't available, then a heavyweight equivalent could be used, though the advantages would be reduced. >Sam Vilain, [EMAIL PROTECTED] I hope that clarifiies my thinking and how I arrived at it. I accept that it may not be possible on all platforms, and it may be too expensive on some others. It may even be undesirable in the context of Parrot, but I have seen no argument that goes to invalidate the underlying premise. Regards, Nigel
Re: Threads: Time to get the terminology straight
On Mon, 05 Jan 2004 12:58, Nigel Sandever wrote; > Everything else, were my attempts at solving the requirements of > synchronisation that this would require, whilst minimising the > cost of that synchronisation, by avoiding the need for a mutex on > every shared entity, and the costs of attempting to aquire a mutex > except when two SHARED THREADS attempted concurrent access to a > shared entity. This paragraph sounds like you're trying to solve an intractable problem. Try posting some psuedocode to explain what you mean. But it has given me an idea that could minimise the number of locks, ie not require a mutex on each PMC, just each shared PMC. 10 points to the person who finds a flaw in this approach :) Each object you share, could create a new (virtual) shared memory segment with its own semaphore. This virtual memory segment is considered its own COW domain (ie, its own thread to the GC); references inserted back to non-shared memory will pull the structures into that virtual COW thread. Access to the entire structure is controlled via a multiple reader/single writer lock (close to what a semaphore is IIRC); locks for a thread are released when references to places inside the shared segment are no longer anywhere in any @_ on the locking thread's call stack, or in use by any opcode (is that good enough?), and are acquired for writing when anything needs to be changed. Virtual shared memory segments can then easily be cleaned up by normal GC. The major problem I can see is that upgrading a lock from reading to writing can't work if there are concurrent writes (and the read lock to be upgraded cannot sensibly be released). But that should be OK, since operation signatures will mark variables that need changing as read-write as early as possible. For example, in this sort of code (sorry for P5 code); sub changer { my $shared_object = shift; $shared_object->{bar} = "baz"; } A read lock to the segment \$shared_object is in is acquired, then released when it is `shifted' off. As the next instruction has a writable lvalue, it acquires a write lock. But this code: sub changer { my $shared_object = shift; $shared_object->{bar} = &somefunc(); } Will hold the write lock on $shared_object open until &somefunc runs. My 2¢ :). This discussion will certainly reach a dollar soon ;). -- Sam Vilain, [EMAIL PROTECTED] Start every day with a smile and get it over with. W C FIELDS
Re: Threads: Time to get the terminology straight
On Sun, 4 Jan 2004 15:47:35 -0500, [EMAIL PROTECTED] (Dan Sugalski) wrote: > *) INTERPRETER - those bits of the Parrot_Interp structure that are > absolutely required to be thread-specific. This includes the current > register sets and stack pointers, as well as security context > information. Basically if a continuation captures it, it's the > interpreter. > > *) INTERPRETER ENVIRONMENT - Those bits of the Parrot_Interp > structure that aren't required to be thread-specific (though I'm not > sure there are any) *PLUS* anything pointed to that doesn't have to > be thread-specific. > > The environment includes the global namespaces, pads, stack chunks, > memory allocation regions, arenas, and whatnots. Just because the > pointer to the current pad is thread-specific doesn't mean the pad > *itself* has to be. It can be shared. > > *) SHARED THREAD - A thread that's part of a group of threads sharing > a common interpreter environment. Ignoring the implementation of the synchronisation required, the basic premise of my long post was that each SHARED THREAD, should have it's own INTERPRETER (a VM in my terms). and that these should share a common INTERPRETER ENVIRONMENT. Simplistically, 5005threads shared an INTERPRETER ENVIRONMENT and a single INTERPRETER. Synchronising threaded access to the shared INTERPRETER (rather than it's environment) was the biggest headache. (I *think*). With ithreads, each SHARED THREAD has it's own INTERPRETER *and* INTERPRETER ENVIRONMENT. This removes the contention for and the need to synchronise access to the INTERPRETER, but requires the duplication of shared elements of the INTERPRETER ENVIRONMENT and the copy_on_read, with the inherent costs of the duplication at start-up, and slow, indirect access to shared entities across the duplicated INTERPRETER ENVIRONMENTS. My proposal was that each SHARED THREAD, should have a separate copy of the INTERPRETER, but share a copy of the INTERPRETER ENVIRONMENT. Everything else, were my attempts at solving the requirements of synchronisation that this would require, whilst minimising the cost of that synchronisation, by avoiding the need for a mutex on every shared entity, and the costs of attempting to aquire a mutex except when two SHARED THREADS attempted concurrent access to a shared entity. I think that by having SHARED THREAD == INTERPRETER, sharing a common INTERPRETER ENVIRONMENT, you can avoid (some) of the problems associated with 5005threads but retain the direct access of shared entities. This imposes it's own set of requirements and costs, but (I believe) that the ideas that underly the mechanisms I offered as solutions are sound. The specific implementation is a platform specific detail that could be pushed down to a lower level. > ...those bits of the Parrot_Interp > structure that aren't required to be thread-specific (though I'm not > sure there are any) This is were I have a different (and quite possibly incorrect) view. My mental picture of the INTERPRETER ENVIRONMENT includes both the impementation of all the classes in the process *plus* all the memory of every instance of those classes. I think your statement above implies that these would not be a part of the INTERPRETER ENVIRONMENT per se, but would be allocated from global heap and only referenced from the bytecode that would live in the INTERPRETER ENVIRONMENT? I realise that this is possible, and maybe even desirable, but the cost of the GC walking a global heap, especially in the situationof a single process that contains to entirely separate instances of the INTERPRETER ENVIRONMENT, would be (I *think*) rather high. I realise that this is a fairly rare occurance on mist platforms, but in the win32 situation of emulated forks, each pseudo-process must have an entirely separate INTERPRETER ENVIRONMENT, potentially with each having multiple SHARED THREADS. If the memory for all entities in all pseudo-process is allocated from a (real) process-global heap, then the multiple GC's required by the multiple pseudo-process are going to be walking the same heap. Possibly concurrently. I realise that this problem (if it is such) does not occur on platforms that have real forks available, but it would be useful if the high level design would allow for the use of separate (virtual) heaps tied to the INTERPRETER ENVIRONMENTs which win32 has the ability to do. > > Dan > Nigel.
Threads: Time to get the terminology straight
I think some of the massive back and forth that's going on is in part due to terminology problems, which are in part causing some conceptual problems. So, for the moment, let's agree on the following things: *) MUTEX - this is a low level, under the hood, not exposed to users, thing that can be locked. They're non-recursive, non-read/write, exclusive things. When a thread gets a mutex, any other attempt to get that mutex will block until the owning thread releases the mutex. The platform-native lock construct will be used for this. *) LOCK - This is an exposed-to-HLL-code thing that can be locked. Only PMCs can be locked, and the lock may or may not be recursive or read/write. *) CONDITION VARIABLE - the "sleep until something pings me" construct. Useful for queue construction, always associated with a MUTEX. *) RENDEZVOUS POINT - A HLL version of a condition variable. *not* associated with a lock -- these are standalone. Note that the mutex/condition association's a POSIX limitation, and POSIX threads is what we have on some platforms. If you want to propose abstracting it away, go for it. The separation doesn't buy us anything, though it's useful in other circumstances. *) INTERPRETER - those bits of the Parrot_Interp structure that are absolutely required to be thread-specific. This includes the current register sets and stack pointers, as well as security context information. Basically if a continuation captures it, it's the interpreter. *) INTERPRETER ENVIRONMENT - Those bits of the Parrot_Interp structure that aren't required to be thread-specific (though I'm not sure there are any) *PLUS* anything pointed to that doesn't have to be thread-specific. The environment includes the global namespaces, pads, stack chunks, memory allocation regions, arenas, and whatnots. Just because the pointer to the current pad is thread-specific doesn't mean the pad *itself* has to be. It can be shared. *) INDEPENDENT THREAD - A thread that has no contact *AT ALL* with the internal data of any other thread in the current process. Independent threads need no synchronization for anything other than what few global things we have. And the fewer the better, though alas we can't have none at all. Note that independent threads may still communicate back and forth by passing either atomic things (ints, floats, and pointers) or static buffers that can become the property of the destination thread. *) SHARED THREAD - A thread that's part of a group of threads sharing a common interpreter environment. Anyway, there's some terminology. It doesn't solve the design problem, but hopefully it'll help everyone talk the same language. Remember that everything from the wrapped OS interface on up is up for grabs -- while we're not going to build our own mutexes or thread scheduler, everything that's been implemented or designed to date can be changed with sufficient good reason. (Though, again, the more you want to change the more spectacular the design has to be) -- Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk