Re: Threads: Time to get the terminology straight

Nigel Sandever Mon, 05 Jan 2004 01:04:28 -0800

05/01/04 04:51:20, Sam Vilain <[EMAIL PROTECTED]> wrote:

>On Mon, 05 Jan 2004 15:43, Nigel Sandever wrote;
>
>  > I accept that it may not be possible on all platforms, and it may
>  > be too expensive on some others. It may even be undesirable in the
>  > context of Parrot, but I have seen no argument that goes to
>  > invalidate the underlying premise.
>
>I think you missed this:
>
>LT> Different VMs can run on different CPUs. Why should we make atomic
>LT> instructions out if these? We have a JIT runtime performing at 1
>LT> Parrot instruction per CPU instruction for native integers. Why
>LT> should we slow down that by a magnitude of many tenths?
>
>LT> We have to lock shared data, then you have to pay the penalty, but
>LT> not for each piece of code.
.
So far, I have only suggested  using the mechanism in conjuction with
PMCs and PMC registers.


You can't add an in-use flag to a native integer. But then, native integers
are not a part of the VHLLs (perl/Python/Ruby). The are a consituent part 
of scalars, but they use different register set and opcodes. Copying the
integer value of a scalar into an I register would require locking the scalars
PMC. Once the value is in the I register, operations performed on it would
not need to be synchronised. Once the resultant is calculated, it need to be 
moved back to the PMC and lock is cleared. There should be no need to 
interlock on most opcodes dealing with the I and R register sets.

The S registers are a different kettle of fish, and I haven't worked through 
the implications for these. My gut feels is that the C-style strings pointed 
at by S registers would be protected by the in-use flag set on the PMCs for
the scalars from which they are derived. 

This means that when a single PMC opcode results in a sequence of non-PMC
operations, then other shared threads would be blocked from operations 
until the sequence of non-PMC ops in the first shared thread where complete.
But ONLY if they attempt access to the same PMC. 

If they are processing PMC or non-PMC operations that do not involve the 
in-use PMC, then they will not be blocked and will be scheduled for their 
timeslices in the normal way.

>
>and this:
>
>LT> I think, that you are missing multiprocessor systems totally.
>

If the mutex mechanism that is use to block the shared threads
is SMP, NUMA, AMP etc. safe, then the mechanism I describe is also 
safe in these envirnments.

>You are effectively excluding true parallellism by blocking other
>processors from executing Parrot ops while one has the lock. 
.
The block only occurs *IF* concurrent operations on the same data
are attempted.

> You may
>as well skip the thread libraries altogether and multi-thread the ops
>in a runloop like Ruby does.
>
>But let's carry the argument through, restricting it to UP systems,
>with hyperthreading switched off, and running Win32.  Is it even true
>that masking interrupts is enough on these systems?
.
No masking of interupts is involved anywhere! 
I don't know where the idea arises, but it wasn't from me.

>
>Win32 `Critical Sections' must be giving the scheduler hints not to
>run other pending threads whilst a critical section is running.  Maybe
>it uses the CPU sti/cli flags for that, to avoid the overhead of
>setting a memory word somewhere (bad enough) or calling the system
>(crippling).  In that case, setting STI/CLI might only incur a ~50%
>performance penalty for integer operations.
.
I don't have access to the sources, but I do know that when one 
thread has entered a critical section, all other threads and processes
continue to be scheduled in the normal way except those that also try
to enter the critical section. 

Scheduling is only disabled for those threads that *ask* to be so,
and no others. Either within the process or other processes. How the 
mechanism works I can only speculate, but no CLI/STI instructions are 
involved.

<total speculation> 
When the first thread enters the critsec, a flag is set in the 
critsec memory.

When a second thread attempts to enter the critsec, a flag is 
set in the corresponding scheduler table to indicate that it should 
not be scheduled again until the flag is cleared. 

When the first thread leaves the critsec, the flag in the critsec
memory is cleared and the flag(s) in the scheduler tables for any 
thread(s) blocking on the critsec are also cleared. 

Which ever of the blocked threads is next scheduled, it aquires the
critsec, sets the flag in the critsec memory and the process repeats.

</total speculation>

No masking of interupts is involved.

>
>but then there's this:
>
>  NS> Other internal housekeeping operations, memory allocation, garbage
>  NS> collection etc. are performed as "sysopcodes", performed by the VMI
>  NS> within the auspices of the critical section, and thus secured.
>
>UG> there may be times when a GC run needs to be initiated DURING a VM
>UG> operation. if the op requires an immediate lare chunk of ram it
>UG> can trigger a GC pass or allocation request. you can't force those
>UG> things to only happen between normal ops (which is what making
>UG> them into ops does). so GC and allocation both need to be able to
>UG> lock all shared things in their interpreter (and not just do a
>UG> process global lock) so those things won't be modified by the
>UG> other threads that share them.
.
I did answer that.

However, if the GC needs to run globally, then all threads must 
be stopped or blocked regardless. Nothing in this mechanism
changes that fact for better or worse.

Personally, I think that a non-global GC is possible and preferable
but that is an entirely unrelated discussion.

>
>I *think* this means that even if we *could* use critical sections for
>each op, where this works and isn't terribly inefficient, GC throws a
>spanner in the works.  This could perhaps be worked around.
.
I disagree that it would be inefficient. I'm open to be proved wrong,
but no proof has been forthcoming. Only opinion.

>
>In any case, it won't work on the fastest known threading
>implementations (Solaris, Linux NPTL, etc), as they won't know to
>block all the other threads in a given process just because one of
>them set a CPU flag cycles before it was pre-empted.
.
You cannot divorce one part of the mechanism from the other. 
It requires both the critsec (or equivalent mutex) between the threads
*plus* the in-use flags to achieve it's purpose.

>
>So, in summary - it won't work on MP, and on UP, it couldn't possibly
>be as overhead-free as the other solutions.
>
>Clear as mud ?  :-)
.
Your summary does not fit with mine. :-)

>
>[back to processors]
>> Do these need to apply lock on every machine level entity that
>> they access?
>
>Yes, but the only resource that matters here is memory.  Locking
>*does* take place inside the processor, but the locks are all close
>enough to be inspected in under a cycle.  And misses incur a penalty
>of several cycles - maybe dozens, depending on who has the memory
>locked.
>
>Registers are also "locked" by virtue of the fact that the
>out-of-order execution and pipelining logic will not schedule/allow an
>instruction to proceed until its data is ready.  Any CPU with
>pipelining has this problem.
.
Essentially you are saying that there is a mechanisms of interlocks 
that works. 

That these are implemented at the silicon/ micro-code level is pretty 
irrelevant.  If they can be implemented there, they could also be 
implemented at the VM level.

Whether they would be worthwhile is a different argument.

>
>There is an interesting comparison to be drawn between the JIT
>assembly happening inside the processor from the bytecode being
>executed (x86) into a RISC core machine language (5-ops) on
>hyperthreading systems, and Parrot's compiling PASM to native machine
>code.  It each case is the 5-ops that are ordered to maximize
>performance and fed into the execution units.
>
>On a hyperthreading processor, it has the luxury of knowing how long
>it will take to check the necessary locks for each instruction,
>probably under a cycle, so that 5-ops may scream along.
>
>With Parrot, it might have to contact another host over an ethernet
>controller to acquire a lock (eg, threads running in an OpenMOSIX
>cluster).  

>This cannot happen for every instruction!
.
By instruction, I assume you mean VM level operation.

Which is why you need the in-use flag to ensure that the lock
is only aquired when it is needed. 

Again, If there is a mutex available on the target system that
is able to operate correctly within that system, then this can be 
used in-place of the critsec on win32. 

The combination of that mutex *plus* the in-use flags (however they are
best achieved on the target system) is the mechanism. 

The particular terms 'critical section' and BTS are only the terms I
used because they are what I am familiar with, and perceive as the
best choices (amongst many others) for use on win32. 

>-- 
>Sam Vilain, [EMAIL PROTECTED]

Regards, Nigel.

Re: Threads: Time to get the terminology straight

Reply via email to