Re: Threads: Time to get the terminology straight
On Mon, 05 Jan 2004 15:43, Nigel Sandever wrote; > I accept that it may not be possible on all platforms, and it may > be too expensive on some others. It may even be undesirable in the > context of Parrot, but I have seen no argument that goes to > invalidate the underlying premise. I think you missed this: LT> Different VMs can run on different CPUs. Why should we make atomic LT> instructions out if these? We have a JIT runtime performing at 1 LT> Parrot instruction per CPU instruction for native integers. Why LT> should we slow down that by a magnitude of many tenths? LT> We have to lock shared data, then you have to pay the penalty, but LT> not for each piece of code. and this: LT> I think, that you are missing multiprocessor systems totally. You are effectively excluding true parallellism by blocking other processors from executing Parrot ops while one has the lock. You may as well skip the thread libraries altogether and multi-thread the ops in a runloop like Ruby does. But let's carry the argument through, restricting it to UP systems, with hyperthreading switched off, and running Win32. Is it even true that masking interrupts is enough on these systems? Win32 `Critical Sections' must be giving the scheduler hints not to run other pending threads whilst a critical section is running. Maybe it uses the CPU sti/cli flags for that, to avoid the overhead of setting a memory word somewhere (bad enough) or calling the system (crippling). In that case, setting STI/CLI might only incur a ~50% performance penalty for integer operations. but then there's this: NS> Other internal housekeeping operations, memory allocation, garbage NS> collection etc. are performed as "sysopcodes", performed by the VMI NS> within the auspices of the critical section, and thus secured. UG> there may be times when a GC run needs to be initiated DURING a VM UG> operation. if the op requires an immediate lare chunk of ram it UG> can trigger a GC pass or allocation request. you can't force those UG> things to only happen between normal ops (which is what making UG> them into ops does). so GC and allocation both need to be able to UG> lock all shared things in their interpreter (and not just do a UG> process global lock) so those things won't be modified by the UG> other threads that share them. I *think* this means that even if we *could* use critical sections for each op, where this works and isn't terribly inefficient, GC throws a spanner in the works. This could perhaps be worked around. In any case, it won't work on the fastest known threading implementations (Solaris, Linux NPTL, etc), as they won't know to block all the other threads in a given process just because one of them set a CPU flag cycles before it was pre-empted. So, in summary - it won't work on MP, and on UP, it couldn't possibly be as overhead-free as the other solutions. Clear as mud ? :-) [back to processors] > Do these need to apply lock on every machine level entity that > they access? Yes, but the only resource that matters here is memory. Locking *does* take place inside the processor, but the locks are all close enough to be inspected in under a cycle. And misses incur a penalty of several cycles - maybe dozens, depending on who has the memory locked. Registers are also "locked" by virtue of the fact that the out-of-order execution and pipelining logic will not schedule/allow an instruction to proceed until its data is ready. Any CPU with pipelining has this problem. There is an interesting comparison to be drawn between the JIT assembly happening inside the processor from the bytecode being executed (x86) into a RISC core machine language (µ-ops) on hyperthreading systems, and Parrot's compiling PASM to native machine code. It each case is the µ-ops that are ordered to maximize performance and fed into the execution units. On a hyperthreading processor, it has the luxury of knowing how long it will take to check the necessary locks for each instruction, probably under a cycle, so that µ-ops may scream along. With Parrot, it might have to contact another host over an ethernet controller to acquire a lock (eg, threads running in an OpenMOSIX cluster). This cannot happen for every instruction! -- Sam Vilain, [EMAIL PROTECTED] The golden rule is that there are no golden rules GEORGE BERNARD SHAW
Re: Threads: Time to get the terminology straight
Nigel Sandever writes: > Whilst the state of higher level objects, that the machine level > objects are a part of, may have their state corrupted by two > threads modifying things concurrently. The state of the threads > (registers sets+stack) themselves cannot be corrupted. I'm going to ask for some clarification here. I think you're saying that each thread gets its own register set and register stacks (the call chain is somewhat hidden within the register stacks). Dan has been saying we'll do this all along. But you're also saying that a particular PMC can only be loaded into one register at a time, in any thread. So if thread A has a string in its P17, then if thread B tries to load that string into its, say, P22, it will block until thread A releases it. Is this correct? This is a novel idea. It reduces lock acquisition/release from every opcode to only the C opcodes. That's a win. Sadly, it greatly increases the chances of deadlocks. Take this example: my ($somestr, $otherstr); sub A { $somestr .= "foo"; $otherstr .= "bar"; } sub B { $otherstr .= "foo"; $somestr .= "bar"; } Thread->new(\&A); Thread->new(\&B); This is a fairly trivial example, and it should work smoothly if we're automatically locking for the user. But consider your scheme: A loads $somestr into its P17, and performs the concatenation. B loads $otherstr into its P17, and performs the concatenation. A tries to load $otherstr into its P18, but blocks because it's in B's P17. B then tries to load $somestr into its P18, but blocks because it's in A's P17. Deadlock. Did I accurately portray your scheme? If not, could you explain what yours does in terms of this example? Luke > Regards, Nigel
Re: Threads: Time to get the terminology straight
05/01/04 01:22:32, Sam Vilain <[EMAIL PROTECTED]> wrote: [STUFF] :) In another post you mentions intel hyperthreading. Essentially, duplicate sets of registers within a single CPU. Do these need to apply lock on every machine level entity that they access? No. Why not? Because they can only modify an entity if it is loaded into a register and the logic behind hyperthreading won't allow both register sets to load the same entity concurrently. ( I know this is a gross simplificationof the interactions between the on-board logic and L1/L2 caching!) --- Not an advert or glorification of Intel. Just an example - Hyper-Threading Technology provides thread-level-parallelism (TLP) on each processor resulting in increased utilization of processor execution resources. As a result, resource utilization yields higher processing throughput. Hyper-Threading Technology is a form of simultaneous multi-threading technology (SMT) where multiple threads of software applications can be run simultaneously on one processor. This is achieved by duplicating the *architectural state* on each processor, while *sharing one set of processor execution resources*. -- The last paragraph is the salient one as far as I am concerned. The basic premise of my original proposal was that multi-threaded, machine level applications don't have to interlock on machine level entities, because each operation they perform is atomic. Whilst the state of higher level objects, that the machine level objects are a part of, may have their state corrupted by two threads modifying things concurrently. The state of the threads (registers sets+stack) themselves cannot be corrupted. This is because they have their own internally consistant state, that only changes atomically, and that is completely separated, each from the other. They only share common data (code is data to the cpu, just bytecode is data to a VM). So, if you are going to emulate a (hyper)threaded CPU, in a register-based virtual machine interpreter. And allow for concurrent threads of execution within that VMI. Then one way of ensuring that the internal state of the VMI was never corrupted, would be to have each thread have it's own copy of the *architectural state* of the VM, whilst *one set of processor execution resources*. For this to work, you would need to achieve the same opcode atomicity at the VMI level. Interlocking the threads so that on shared thread can not start an opcode until anothe shared threads has completed gives this atomicity. The penalty is that if the interlocking is done for every opcode, then shared threads end up with very long virtual timeslices. To prevent that being the case (most of the time), the interlocking should only come into effect *if* concurrent access to a VM level entity is imminant. As the VMI cannot access (modify) the state of a VM level entity (PMC) until it has loaded it into a VM register, the interlosking only need come into effect *if*, the entity who's reference is being loaded into a PMC register is currently in-use by (another) thread. The state if a PMCs in-useness can be flagged by a single bit in its header. This can be detected by a shared thread when the reference to it is loaded into teh PMC register and when it is, that shared thread then waits on the single, shared mutex before proceeding. It is only when the combination of atomised VM opcodes, and lightweight in-use detection come together, that the need for a mutex/entity can be avoided. If the mutex used is capable of handling SMP, NUMA, clusters etc, then the mechinsm will work. If the lightweight bit-test&-set opcode isn't available, then a heavyweight equivalent could be used, though the advantages would be reduced. >Sam Vilain, [EMAIL PROTECTED] I hope that clarifiies my thinking and how I arrived at it. I accept that it may not be possible on all platforms, and it may be too expensive on some others. It may even be undesirable in the context of Parrot, but I have seen no argument that goes to invalidate the underlying premise. Regards, Nigel
Re: Threads: Time to get the terminology straight
On Mon, 05 Jan 2004 12:58, Nigel Sandever wrote; > Everything else, were my attempts at solving the requirements of > synchronisation that this would require, whilst minimising the > cost of that synchronisation, by avoiding the need for a mutex on > every shared entity, and the costs of attempting to aquire a mutex > except when two SHARED THREADS attempted concurrent access to a > shared entity. This paragraph sounds like you're trying to solve an intractable problem. Try posting some psuedocode to explain what you mean. But it has given me an idea that could minimise the number of locks, ie not require a mutex on each PMC, just each shared PMC. 10 points to the person who finds a flaw in this approach :) Each object you share, could create a new (virtual) shared memory segment with its own semaphore. This virtual memory segment is considered its own COW domain (ie, its own thread to the GC); references inserted back to non-shared memory will pull the structures into that virtual COW thread. Access to the entire structure is controlled via a multiple reader/single writer lock (close to what a semaphore is IIRC); locks for a thread are released when references to places inside the shared segment are no longer anywhere in any @_ on the locking thread's call stack, or in use by any opcode (is that good enough?), and are acquired for writing when anything needs to be changed. Virtual shared memory segments can then easily be cleaned up by normal GC. The major problem I can see is that upgrading a lock from reading to writing can't work if there are concurrent writes (and the read lock to be upgraded cannot sensibly be released). But that should be OK, since operation signatures will mark variables that need changing as read-write as early as possible. For example, in this sort of code (sorry for P5 code); sub changer { my $shared_object = shift; $shared_object->{bar} = "baz"; } A read lock to the segment \$shared_object is in is acquired, then released when it is `shifted' off. As the next instruction has a writable lvalue, it acquires a write lock. But this code: sub changer { my $shared_object = shift; $shared_object->{bar} = &somefunc(); } Will hold the write lock on $shared_object open until &somefunc runs. My 2¢ :). This discussion will certainly reach a dollar soon ;). -- Sam Vilain, [EMAIL PROTECTED] Start every day with a smile and get it over with. W C FIELDS
Re: Threads Design. A Win32 perspective.
On Sun, Jan 04, 2004 at 12:17:33PM -0800, Jeff Clites wrote: > What are these standard techniques? The JVM spec does seem to guarantee > that even in the absence of proper locking by user code, things won't > go completely haywire, but I can't figure out how this is possible > without actual locking. (That is, I'm wondering if Java is doing > something clever.) For instance, inserting something into a collection > will often require updating more than one memory location (especially > if the collection is out of space and needs to be grown), and I can't > figure out how this could be guaranteed not to completely corrupt > internal state in the absence of locking. (And if it _does_ require > locking, then it seems that the insertion method would in fact then be > synchronized.) My understanding is that Java Collections are generally implemented in Java. Since the underlying Java bytecode does not permit unsafe operations, Collections are therefore safe. (Of course, unsynchronized writes to a Collection will probably result in exceptions--but it won't crash the JVM.) For example, insertion into a list might be handled something like this (apologies for rusty Java skills): void append(Object new_entry) { if (a.length <= size) { Object new_a[] = new Object[size * 2]; for (int i = 0; i < size; i++) { new_a[i] = a; } } a[size++] = new_entry; } If two threads call this function at the same time, they may well leave the list object in an inconsistent state--but there is no way that the above code can cause JVM-level problems. The key decision in Java threading is to forbid modification of all bytecode-level types that cannot be atomically modified. For example, the size of an array cannot be changed, and strings are constant. If it WERE possible to resize arrays, the above code would require locks to avoid potential JVM corruption--every access to 'a' would need a lock against the possiblity that another thread was in the process of resizing it. It's my understanding that Parrot has chosen to take the path of using many mutable data structures at the VM level; unfortunately, this is pretty much incompatible with a fast or elegant threading model. - Damien
Re: Thread notes
On Mon, 05 Jan 2004 10:13, Dan Sugalski wrote; [...] > these things. It's a set of 8 4-processor nodes with a fast > interconnect between them which functions as a 32 CPU system. The > four processors in each node are in a traditional SMP setup with a > shared memory bus, tightly coupled caches, and fight-for-the-bus [...] I know what a NUMA system is, I was just a little worried by the combination of the terms SMP and NUMA in the same sentence :). Normally "SMP" means "Shared Everything" - meaning Uniform Memory Access. If compared to the term MPP or AMP (in which different CPUs are put to different tasks), it is true that each node in a NUMA system could be put to any task. So, the term "SMP" would seem to fit partially; but the implication is with NUMA that there are clear benefits to *not* having each processor doing *exactly* the same thing, all the time. "CPU affinity" & all that. Groups of processors in a NUMA machine have shared a group of memory in all the NUMA systems I've seen too (SGI Origin/Onyx, and Sun Enterprise servers *must* be, though they don't mention it!). So I'd say that the "SMP" is practically redundant, borderlining on confusing. Maybe a term like "8 x 4MP NUMA" is better. I did apologise at the beginning for being pedantic. But hey, didn't this digression serve to elaborate on the meaning of NUMA ? :) > Given the increases in processor vs memory vs bus speeds, this > setup may not hold for that much longer, as it's only really > workable when a single CPU doesn't saturate the memory bus with > any regularity, which is getting harder and harder to > do. (backplane and memory speeds can be increased pretty > significantly with a sufficient application of cash, which is why > the mini and mainframe systems can actually do it, but there are > limits beyond which cash just won't get you) Opteron, and Sparc IV (IIRC) both have 3 bi-directional high speed (=core speed) interconnects, so these could `easily' be arranged into NUMA configurations with SMP groups. Also, some high-end processors are going multicore, which presumably has different characteristics again (especially if the two chips on the die share a cache!). Then of course there's single-processor multi-threading (eg, Intel HyperThreading). These systems have twice the registers internally and interleave instructions from each `thread' as the processor can deal with them; using seperate registers for them all helps keep the execution units busy (kind of like what GCC does with unrolled loops on a RISC system with more registers in the first place). These perform like little NUMAs, because the cache is `hotter' (ie, locks on those memory pages are held) on the other virtual processor than other CPUs on the motherboard. If my understanding is correct, the Intel implementation is not truly SMP, as the other virtual processor must share code segments to run threads. If that is true, doing JIT (or otherwise changing the executable code, eg dlopen()) in a thread might break Hyperthreading. But then again, it might not. Maybe someone who gives a flying fork() would like to devise a test to see if this is the case. Apparently current dual Opteron systems are also effectively NUMA (as each chip has its own memory controller), but at the moment, NUMA mode with Linux is slower than straight SMP mode. Presumably because it's a bitch to code for ;-) So these fun systems are here to stay! :) -- Sam Vilain, [EMAIL PROTECTED] All things being equal, a fat person uses more soap than a thin person. - anon.
Re: Threads: Time to get the terminology straight
On Sun, 4 Jan 2004 15:47:35 -0500, [EMAIL PROTECTED] (Dan Sugalski) wrote: > *) INTERPRETER - those bits of the Parrot_Interp structure that are > absolutely required to be thread-specific. This includes the current > register sets and stack pointers, as well as security context > information. Basically if a continuation captures it, it's the > interpreter. > > *) INTERPRETER ENVIRONMENT - Those bits of the Parrot_Interp > structure that aren't required to be thread-specific (though I'm not > sure there are any) *PLUS* anything pointed to that doesn't have to > be thread-specific. > > The environment includes the global namespaces, pads, stack chunks, > memory allocation regions, arenas, and whatnots. Just because the > pointer to the current pad is thread-specific doesn't mean the pad > *itself* has to be. It can be shared. > > *) SHARED THREAD - A thread that's part of a group of threads sharing > a common interpreter environment. Ignoring the implementation of the synchronisation required, the basic premise of my long post was that each SHARED THREAD, should have it's own INTERPRETER (a VM in my terms). and that these should share a common INTERPRETER ENVIRONMENT. Simplistically, 5005threads shared an INTERPRETER ENVIRONMENT and a single INTERPRETER. Synchronising threaded access to the shared INTERPRETER (rather than it's environment) was the biggest headache. (I *think*). With ithreads, each SHARED THREAD has it's own INTERPRETER *and* INTERPRETER ENVIRONMENT. This removes the contention for and the need to synchronise access to the INTERPRETER, but requires the duplication of shared elements of the INTERPRETER ENVIRONMENT and the copy_on_read, with the inherent costs of the duplication at start-up, and slow, indirect access to shared entities across the duplicated INTERPRETER ENVIRONMENTS. My proposal was that each SHARED THREAD, should have a separate copy of the INTERPRETER, but share a copy of the INTERPRETER ENVIRONMENT. Everything else, were my attempts at solving the requirements of synchronisation that this would require, whilst minimising the cost of that synchronisation, by avoiding the need for a mutex on every shared entity, and the costs of attempting to aquire a mutex except when two SHARED THREADS attempted concurrent access to a shared entity. I think that by having SHARED THREAD == INTERPRETER, sharing a common INTERPRETER ENVIRONMENT, you can avoid (some) of the problems associated with 5005threads but retain the direct access of shared entities. This imposes it's own set of requirements and costs, but (I believe) that the ideas that underly the mechanisms I offered as solutions are sound. The specific implementation is a platform specific detail that could be pushed down to a lower level. > ...those bits of the Parrot_Interp > structure that aren't required to be thread-specific (though I'm not > sure there are any) This is were I have a different (and quite possibly incorrect) view. My mental picture of the INTERPRETER ENVIRONMENT includes both the impementation of all the classes in the process *plus* all the memory of every instance of those classes. I think your statement above implies that these would not be a part of the INTERPRETER ENVIRONMENT per se, but would be allocated from global heap and only referenced from the bytecode that would live in the INTERPRETER ENVIRONMENT? I realise that this is possible, and maybe even desirable, but the cost of the GC walking a global heap, especially in the situationof a single process that contains to entirely separate instances of the INTERPRETER ENVIRONMENT, would be (I *think*) rather high. I realise that this is a fairly rare occurance on mist platforms, but in the win32 situation of emulated forks, each pseudo-process must have an entirely separate INTERPRETER ENVIRONMENT, potentially with each having multiple SHARED THREADS. If the memory for all entities in all pseudo-process is allocated from a (real) process-global heap, then the multiple GC's required by the multiple pseudo-process are going to be walking the same heap. Possibly concurrently. I realise that this problem (if it is such) does not occur on platforms that have real forks available, but it would be useful if the high level design would allow for the use of separate (virtual) heaps tied to the INTERPRETER ENVIRONMENTs which win32 has the ability to do. > > Dan > Nigel.
Re: Thread Question and Suggestion -- Matt
On Jan 4, 2004, at 1:58 PM, Matt Fowles wrote: Dave Mitchell wrote: Why on earth would they be all one kernel-level thread? Truth to tell I got the idea from Ruby. As I said it make syncronization easier, because the interpreter can dictate when threads context switch, allowing them to only switch at safe points. There are some tradeoffs to this though. I had forgotten about threads calling into C code. Although the example of regular expressions doesn't work as I think those are supposed to compile to byte code... Ah yes, I think you are right about regexes, judging from 'perldoc ops/rx.ops'. I was thinking of a Perl5-style regex engine, in which regex application is a call into compiled code (I believe...). JEff
Re: Thread Question and Suggestion -- Matt
Dave Mitchell wrote: On Sat, Jan 03, 2004 at 08:24:06PM -0500, Matt Fowles wrote: All~ I have a naive question: Why must each thread have its own interpreter? I understand that this suggestion will likely be disregarded because of the answer to the above question. But here goes anyway... Why not have the threads that share everything share interpreters. We can have these threads be within the a single interpreter thus eliminating the need for complicated GC locking and resource sharing complexity. Because all of these threads will be one kernel level thread Why on earth would they be all one kernel-level thread? Truth to tell I got the idea from Ruby. As I said it make syncronization easier, because the interpreter can dictate when threads context switch, allowing them to only switch at safe points. There are some tradeoffs to this though. I had forgotten about threads calling into C code. Although the example of regular expressions doesn't work as I think those are supposed to compile to byte code... Matt
Re: Thread notes
At 10:01 AM +1300 1/5/04, Sam Vilain wrote: On Sun, 04 Jan 2004 17:53, Dan Sugalski wrote; > Given that it's not a SMP, massively out of order NUMA system with > delayed writes... no. 'Fraid not. Sorry to be pedantic, but I always thought that the NU in NUMA implied a contradiction of the S in SMP! "NUMA MP" or "SMP", what does it mean to have *both* ? It means you've got loosely coupled clusters of SMP things. For an example, if you go buy an Alpha GS3200 32 processor system (assuming DEC^WCompaq^HP still knows how to sell the things) you have one of these things. It's a set of 8 4-processor nodes with a fast interconnect between them which functions as a 32 CPU system. The four processors in each node are in a traditional SMP setup with a shared memory bus, tightly coupled caches, and fight-for-the-bus access to the memory on that node. Access to memory on another node goes over a slower bus, though it still looks and acts like local memory. Nearly all of the NUMA systems I know of act like this, because it's still feasible to have tightly coupled 2 or 4 CPU SMP systems. The global slowdown generally occurs past that point, so the NUMA systems usually group 2 or 4 CPU SMP systems together this way. Given the increases in processor vs memory vs bus speeds, this setup may not hold for that much longer, as it's only really workable when a single CPU doesn't saturate the memory bus with any regularity, which is getting harder and harder to do. (backplane and memory speeds can be increased pretty significantly with a sufficient application of cash, which is why the mini and mainframe systems can actually do it, but there are limits beyond which cash just won't get you) -- Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Thread notes
On Sun, 04 Jan 2004 17:53, Dan Sugalski wrote; > Given that it's not a SMP, massively out of order NUMA system with > delayed writes... no. 'Fraid not. Sorry to be pedantic, but I always thought that the NU in NUMA implied a contradiction of the S in SMP! "NUMA MP" or "SMP", what does it mean to have *both* ? -- Sam Vilain, [EMAIL PROTECTED] What would life be if we had no courage to attempt anything ? VINCENT van GOGH
Re: Threads Design. A Win32 perspective.
At 9:27 AM +1300 1/5/04, Sam Vilain wrote: On Sat, 03 Jan 2004 20:51, Luke Palmer wrote; > Parrot is platform-independent, but that doesn't mean we can't > take advantage of platform-specific instructions to make it faster > on certain machines. Indeed, this is precisely what JIT is. > But a lock on every PMC is still pretty heavy for those non-x86 > platforms out there, and we should avoid it if we can. So implement threading on architectures that don't support interrupt masking with completely user-space threading (ie, runloop round-robin) like Ruby does. *That* is available on *every* platform. Interrupt masking and a proper threading interface can be considered a prerequisite for threads of any sort under Parrot, the same way an ANSI C89-compliant compiler is a requirement. Platforms that can't muster at least thread spawning, mutexes, and condition variables don't get threads, and don't have to be considered. (You can, it's just not required, and you'd be hard-pressed to find anything outside the embedded realm that doesn't support at least that level of functionality, and I'm OK if there are no threads on the Gameboy port) -- Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Threads Design. A Win32 perspective.
At 3:17 PM -0500 1/4/04, Uri Guttman wrote: > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: DS> And don't forget the libraries that are picky about which thread calls DS> into them -- there are some that require that the thread that created DS> the handle for the library be the thread that calls into the library DS> with that handle. (Though luckily those are pretty rare) And of course DS> the non-reentrant libraries that require a global library lock for all DS> calls otherwise the library state gets corrupted. DS> Aren't threads fun? :) hence my love for events and forked procs. Forks make some of this worse. There are more libraries that don't work with connections forked across processes than libraries that don't work with calls from a different thread. -- Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Threads: Time to get the terminology straight
I think some of the massive back and forth that's going on is in part due to terminology problems, which are in part causing some conceptual problems. So, for the moment, let's agree on the following things: *) MUTEX - this is a low level, under the hood, not exposed to users, thing that can be locked. They're non-recursive, non-read/write, exclusive things. When a thread gets a mutex, any other attempt to get that mutex will block until the owning thread releases the mutex. The platform-native lock construct will be used for this. *) LOCK - This is an exposed-to-HLL-code thing that can be locked. Only PMCs can be locked, and the lock may or may not be recursive or read/write. *) CONDITION VARIABLE - the "sleep until something pings me" construct. Useful for queue construction, always associated with a MUTEX. *) RENDEZVOUS POINT - A HLL version of a condition variable. *not* associated with a lock -- these are standalone. Note that the mutex/condition association's a POSIX limitation, and POSIX threads is what we have on some platforms. If you want to propose abstracting it away, go for it. The separation doesn't buy us anything, though it's useful in other circumstances. *) INTERPRETER - those bits of the Parrot_Interp structure that are absolutely required to be thread-specific. This includes the current register sets and stack pointers, as well as security context information. Basically if a continuation captures it, it's the interpreter. *) INTERPRETER ENVIRONMENT - Those bits of the Parrot_Interp structure that aren't required to be thread-specific (though I'm not sure there are any) *PLUS* anything pointed to that doesn't have to be thread-specific. The environment includes the global namespaces, pads, stack chunks, memory allocation regions, arenas, and whatnots. Just because the pointer to the current pad is thread-specific doesn't mean the pad *itself* has to be. It can be shared. *) INDEPENDENT THREAD - A thread that has no contact *AT ALL* with the internal data of any other thread in the current process. Independent threads need no synchronization for anything other than what few global things we have. And the fewer the better, though alas we can't have none at all. Note that independent threads may still communicate back and forth by passing either atomic things (ints, floats, and pointers) or static buffers that can become the property of the destination thread. *) SHARED THREAD - A thread that's part of a group of threads sharing a common interpreter environment. Anyway, there's some terminology. It doesn't solve the design problem, but hopefully it'll help everyone talk the same language. Remember that everything from the wrapped OS interface on up is up for grabs -- while we're not going to build our own mutexes or thread scheduler, everything that's been implemented or designed to date can be changed with sufficient good reason. (Though, again, the more you want to change the more spectacular the design has to be) -- Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Threads Design. A Win32 perspective.
On Sat, 03 Jan 2004 20:51, Luke Palmer wrote; > Parrot is platform-independent, but that doesn't mean we can't > take advantage of platform-specific instructions to make it faster > on certain machines. Indeed, this is precisely what JIT is. > But a lock on every PMC is still pretty heavy for those non-x86 > platforms out there, and we should avoid it if we can. So implement threading on architectures that don't support interrupt masking with completely user-space threading (ie, runloop round-robin) like Ruby does. *That* is available on *every* platform. -- Sam Vilain, [EMAIL PROTECTED] Seeing a murder on television... can help work off one's antagonisms. And if you haven't any antagonisms, the commercials will give you some. -- Alfred Hitchcock
Re: Threads Design. A Win32 perspective.
> "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: DS> And don't forget the libraries that are picky about which thread calls DS> into them -- there are some that require that the thread that created DS> the handle for the library be the thread that calls into the library DS> with that handle. (Though luckily those are pretty rare) And of course DS> the non-reentrant libraries that require a global library lock for all DS> calls otherwise the library state gets corrupted. DS> Aren't threads fun? :) hence my love for events and forked procs. i even have a solution to that very problem by forking the DBI (or other nonthreaded lib) process and communicating to that via messages. but we still have to support threads. just gonna be messy and i will prolly rarely use them (especially since we will have a core event loop and async i/o (which will prolly use kernel threads (according to dan) but not be parrot threads ) ) (end of lisp text) :-/ uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org
Re: Threads Design. A Win32 perspective.
On Jan 3, 2004, at 8:59 PM, Gordon Henriksen wrote: On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote: Transparent interlocking of VHLL fat structures performed automatically by the VM itself. No need for :shared or lock(). Completely specious, and repeatedly proven unwise. Shouldn't even be pursued. Atomic guarantees on collections (or other data structures) are rarely meaningful; providing them is simply a waste of time. Witness the well-deserved death of Java's synchronized Vector class in favor of ArrayList. The interpreter absolutely shouldn't crash due to threading errors—it should protect itself using standard techniques—but it would be a mistake for parrot to mandate that all ops and PMCs be thread-safe. What are these standard techniques? The JVM spec does seem to guarantee that even in the absence of proper locking by user code, things won't go completely haywire, but I can't figure out how this is possible without actual locking. (That is, I'm wondering if Java is doing something clever.) For instance, inserting something into a collection will often require updating more than one memory location (especially if the collection is out of space and needs to be grown), and I can't figure out how this could be guaranteed not to completely corrupt internal state in the absence of locking. (And if it _does_ require locking, then it seems that the insertion method would in fact then be synchronized.) So my question is, how do JVMs manage to protect internal state in the absence of locking? Or do they? JEff
Re: Problem during "make test"
Leopold Toetsch wrote: Harry Jackson <[EMAIL PROTECTED]> wrote: Can someone tell me if there is an error in the code below. The code is fine. it repeatedly from the command line it sometimes freezes ie it prints the contents of the array and then just stops and I need to do a CTRL-C to get back to the command line. Your are sure that there is no hardware problem? Run memcheck for a couple of hours for example. I managed to compile gcc which is a fairly good indication that my hardware is ok but you never know. I will try memtest86 and see how it goes. They are the same. The first one is PASM syntax, the second is PIR syntax. E.g. running your "imc trouble" code $ parrot -o- hj.imc _MAIN: new P16, 31 # .PerlArray new P17, 36 # .PerlString set P16, 10 set P16[0], "Zero" ... yields the generated PASM code (with variables names allocated to Parrot registers). I tried that as well, it spits out identical PASM each time but on the odd occasion I need to use CTRL-C to get back to the shell. H
Re: Threads Design. A Win32 perspective.
At 12:05 PM -0800 1/4/04, Jeff Clites wrote: On Jan 4, 2004, at 5:47 AM, Leopold Toetsch wrote: Elizabeth Mattijsen <[EMAIL PROTECTED]> wrote: When you use an external library in Perl, such as e.g. libxml, you have Perl data-structures and libxml data-structures. The Perl data-structures contain pointers to the libxml data-structures. In comes the starting of an ithread and Perl clones all of the Perl data-structures. But it copies _only_ does things it knows about. And thus leaves the pointers to the libxml data-structures untouched. Now you have 2 Perl data-structures that point to the _same_ libxml data-structures. Voila, instant sharing. I see. Our library loading code should take care of that. On thread creation we call again the _init code, so that the external lib can prepare itself to be used from multiple threads. But don't ask me about details ;) But I think we'll never be able to make this work as the user would initially expect. For instance if we have a DBI implementation, and some PMC is holding an external reference to a database cursor for an open transaction, then we can't properly duplicate the necessary state to make the copy of the PMC work correctly (that is, independently). (And I'm not saying just that we can't do it from within parrot, I'm saying the native database libraries can't do this.) And don't forget the libraries that are picky about which thread calls into them -- there are some that require that the thread that created the handle for the library be the thread that calls into the library with that handle. (Though luckily those are pretty rare) And of course the non-reentrant libraries that require a global library lock for all calls otherwise the library state gets corrupted. Aren't threads fun? :) -- Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Threads Design. A Win32 perspective.
On Jan 4, 2004, at 5:47 AM, Leopold Toetsch wrote: Elizabeth Mattijsen <[EMAIL PROTECTED]> wrote: When you use an external library in Perl, such as e.g. libxml, you have Perl data-structures and libxml data-structures. The Perl data-structures contain pointers to the libxml data-structures. In comes the starting of an ithread and Perl clones all of the Perl data-structures. But it copies _only_ does things it knows about. And thus leaves the pointers to the libxml data-structures untouched. Now you have 2 Perl data-structures that point to the _same_ libxml data-structures. Voila, instant sharing. I see. Our library loading code should take care of that. On thread creation we call again the _init code, so that the external lib can prepare itself to be used from multiple threads. But don't ask me about details ;) But I think we'll never be able to make this work as the user would initially expect. For instance if we have a DBI implementation, and some PMC is holding an external reference to a database cursor for an open transaction, then we can't properly duplicate the necessary state to make the copy of the PMC work correctly (that is, independently). (And I'm not saying just that we can't do it from within parrot, I'm saying the native database libraries can't do this.) So some objects such as this would always have to end up shared (or else non-functional in the new thread), which is bad for users because they have to be concerned with what objects are backed by native libraries and which ones cannot be made to conform to each of our thread styles. That seems like a major caveat. JEff
Re: NCI callback functions
At 8:19 PM +0100 1/4/04, Leopold Toetsch wrote: Its a bit complicated and brain-mangling, the more in the absence of any examples, but the current design in pdd16 seems to lack some flexibility and is IMHO missing the proper handling of the (library-provided) external_data. The latter will be passed to the Sub somehow, but what then? Well... The current system's simple on purpose, because making sure all the possible callback function signatures are supported would just be a massive pain in the neck. (Not that we're not going well into the nuts category with the current NCI setup--nci.o is 143K on my system right now--but there are limits even for me :) Before we go extend things any more, let's get the current system fleshed out some (heck, let's get it working!) and then use it as a base to proceed. The first order of business is for me to get pdd16's examples a bit better fleshed out so folks know what I'm talking about, and then go from there. -- Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
[ANNOUNCE] Devel::Cover 0.32
This release fixes a few bugs and introduces the concept of runs in the database whereby data for tests are stored separately and merged later when needed. I'm hoping this will speed things up somewhat. - Actually include do test. - Create run concept in database. - Belatedly remove check for Template. - Add branch_return_sub test. - Add finalise_conditions() to collect previously missed coverage. - Fix incorrect coverage results associated with "and" conditions. - Add all_versions utility script. - Put /usr/bin/perl on all shebang lines. A couple of tests fail on Win32, but they are just rounding errors in the percentages. I'll look at that later. Enjoy, -- Paul Johnson - [EMAIL PROTECTED] http://www.pjcj.net
Re: Problem during "make test"
Harry Jackson <[EMAIL PROTECTED]> wrote: > Can someone tell me if there is an error in the code below. The code is fine. > it repeatedly from the command line it sometimes freezes ie it prints > the contents of the array and then just stops and I need to do a CTRL-C > to get back to the command line. Your are sure that there is no hardware problem? Run memcheck for a couple of hours for example. > ...I have noticed that > set a[0], "one" > or > a[0] = "one" > appear to do the same thing. I cannot confirm that they do due to the > bug above. They are the same. The first one is PASM syntax, the second is PIR syntax. E.g. running your "imc trouble" code $ parrot -o- hj.imc _MAIN: new P16, 31 # .PerlArray new P17, 36 # .PerlString set P16, 10 set P16[0], "Zero" ... yields the generated PASM code (with variables names allocated to Parrot registers). > Harry leo
NCI callback functions
Its a bit complicated and brain-mangling, the more in the absence of any examples, but the current design in pdd16 seems to lack some flexibility and is IMHO missing the proper handling of the (library-provided) external_data. The latter will be passed to the Sub somehow, but what then? Here is (I think) a more flexible approach: 1) A new opcode "callback" (or "register_cb") or such, which is working like the current dlfunc opcode: callback (out Pcb, in Psub, in Sig) Pcb ... NCI function object for that callback function Psub ... Parrot Sub PMC to be called on behalf of that C callback Sig ... Signature of the C-callback function Sig allows additionally one special signature char "U" user-data, which is "Z" in pdd16, but I can remember ser-data better ;) So void (*PQnoticeProcessor)(void *, const char*) would have Sig "vUt" and call a Parrot function f(P, S). Pdd16 type C callback is e.g. "vpU". 2) Actually registering the callback. dlfunc (out Pfunc, in Plib, "func_with_cb", "vCU") .pcc_begin prototpyed .arg Pcb .arg P_user_data .nci_call Pfunc That is instead of passing in the callback and the Parrot Sub ("CY" in pdd16) the PMC obtained from 1) is passed with signature "C". The calling signature matches again the C-function which we call. When now calling this function the action behind the scene is the same: The passed user_data PMC is combined with the callback PMC obtained from 1) and passed on to the C function. When the C function is doing the callback, the NCI-stub generated in 1) is called, which extracts the Parrot subroutine from the passed user data and passes on the original user PMC and finally calls the PASM callback function. But as the generated NCI stub in 1) knows the callback signature, this scheme should be appropriate for all callback functons, that have at least one "void *" user parameter to be passed on transparently. Comments welcome, leo
Re: Threads Design. A Win32 perspective.
At 11:59 PM -0500 1/3/04, Gordon Henriksen wrote: On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote: Transparent interlocking of VHLL fat structures performed automatically by the VM itself. No need for :shared or lock(). Completely specious, and repeatedly proven unwise. Shouldn't even be pursued. Erm... that turns out not to be the case. A lot. (Yeah, I know, I said I wasn't paying attention) An interpreter *must* lock any shared data structure, including PMCs, when accessing them. Otherwise they may be in an inconsistent state when being accessed, which will lead to data corruption or process crashing, which is unacceptable. These locks do not have to correspond to HLL-level locks, though it'd be a reasonable argument that they share the same mutex. -- Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Problem during "make test"
Dan Sugalski wrote: Let us know either way -- if upgrading gcc works then we're going to have to figure out how RH/GCC2.96 is breaking things so we can make it not happen. :( I have now upgraded gcc to 3.3.2 and I am getting the same error. We are still freezing during test. I have also noticed something that might be my crap "imc" or related to the problem Can someone tell me if there is an error in the code below. When I run it repeatedly from the command line it sometimes freezes ie it prints the contents of the array and then just stops and I need to do a CTRL-C to get back to the command line. .pcc_sub _MAIN prototyped .param pmc argv .local PerlArray a a = new PerlArray .local PerlString s s = new PerlString a = 10 a[0] = "Zero" a[1] = "One" a[2] = "Two" a[3] = "Three" a[4] = "Four" a[5] = "Five" a[6] = "Six" a[7] = "Seven" a[8] = "Eight" s = a[2] print "\n" print s print "\n" end .end I have also tried the above code using the "set" syntax and I get the same problem. Are there any recommended examples of IMC in the source tree and which docs are the most recent. I have noticed that there are a lot of different ways of doing things (typical perl). I am trying to pick it up from the FAQ, some examples and the docs but its an uphill struggle. For instance I have noticed that set a[0], "one" or a[0] = "one" appear to do the same thing. I cannot confirm that they do due to the bug above. I have got to the point where I am trying to put rows from Postgres into arrays and this is slowing me down a bit. Harry
Re: Threads Design. A Win32 perspective.
At 14:47 +0100 1/4/04, Leopold Toetsch wrote: Elizabeth Mattijsen <[EMAIL PROTECTED]> wrote: > When you use an external library in Perl, such as e.g. libxml, you have Perl data-structures and libxml data-structures. The Perl > data-structures contain pointers to the libxml data-structures. > In comes the starting of an ithread and Perl clones all of the Perl data-structures. But it copies _only_ does things it knows about. And thus leaves the pointers to the libxml data-structures untouched. Now you have 2 Perl data-structures that point to the _same_ libxml > data-structures. Voila, instant sharing. I see. Our library loading code should take care of that. On thread creation we call again the _init code, so that the external lib can prepare itself to be used from multiple threads. But don't ask me about details ;) What you need, is basically being able to: - register a class method to be called on cloning - register an object method that is called whenever an _object_ is cloned The CLONE sub that Perl5 has, is the class method. The object method is missing from Perl (Thread::Bless is a way to remedy this problem). I don't know what the _init code does, but judging by its name, its not giving enough info to be able to properly clone an object with external data structures. Liz
Re: Threads Design. A Win32 perspective.
Elizabeth Mattijsen <[EMAIL PROTECTED]> wrote: > When you use an external library in Perl, such as e.g. libxml, you > have Perl data-structures and libxml data-structures. The Perl > data-structures contain pointers to the libxml data-structures. > In comes the starting of an ithread and Perl clones all of the Perl > data-structures. But it copies _only_ does things it knows about. > And thus leaves the pointers to the libxml data-structures untouched. > Now you have 2 Perl data-structures that point to the _same_ libxml > data-structures. Voila, instant sharing. I see. Our library loading code should take care of that. On thread creation we call again the _init code, so that the external lib can prepare itself to be used from multiple threads. But don't ask me about details ;) > Liz leo
Re: Extendiers interface
Mattia Barbon <[EMAIL PROTECTED]> wrote: > AFAIR nothing but Parrot sources should #include parrot/parrot.h. > The public interface is available through parrot/embed.h and > parrot/extend.h. Correct? Yep. But by far not all necessary interface functions & types are done. > Mattia leo
Re: Thread Question and Suggestion -- Matt
Matt Fowles <[EMAIL PROTECTED]> wrote: > Why not have the threads that share everything share interpreters. We > can have these threads be within the a single interpreter thus > eliminating the need for complicated GC locking and resource sharing > complexity. Because all of these threads will be one kernel level > thread, they will not actually run concurrently and there will be no > need to lock them. We will have to implement a rudimentary scheduler in > the interpreter, but I don't think that is actually that hard. Jeff already answered that. Above model is e.g. implemented in Ruby. But we want preemptive threads that can take advantage of mulitple processors. > Matt leo
Re: Thread Question and Suggestion -- Matt
Jeff Clites <[EMAIL PROTECTED]> wrote: > On Jan 3, 2004, at 5:24 PM, Matt Fowles wrote: >> Why must each thread have its own interpreter? > The short answer is that the bulk of the state of the virtual machine > (including, and most importantly, its registers and register stacks) > needs to be per-thread, since it represents the "execution context" > which is logically thread-local. Yep. A struct Parrot_Interp has all the information to run one thread of execution. When you start a new VM thread, you need a new Parrot_Interp to run the code. But it depends on the thread type how this new Parrot_Interp is created. The range is from all is new except the opcode stream (type 1 - nothing shared thread) to only register + stacks + some more is distinct for type 4 - the shared everything case. Perl5 doesn't have a real interpreter structure, its mainly a bunch of globals. But when compiled with threads enabled tons of macros convert these to the thread context, which is then passed around as first argument of API calls - mostly (that's at least how I understand the src). This thread context is our interpreter structure with all the necessary information or state to run a piece of code as the only one or as a distinct thread. > That said, I do think we have a terminology problem, ... > ... It would be clearer to say that we > have two "threads" in one "interpreter", and just note that almost all > of our state lives in the "thread" structure. (That would mean that the > thing which is being passed into all of our API would be called the > thread, not the interpreter, Yep. But the thing passed around happens to be named interpreter, so that's our thread state, if you run single-threaded or not doesn't matter. A thread-enabled interpreter is created by filling one additional structure "thread_data" with thread-specific items like thread handle or thread ID. But anyway the state is called interpreter. > JEff leo
Re: Thread Question and Suggestion -- Matt
On Sat, Jan 03, 2004 at 08:24:06PM -0500, Matt Fowles wrote: > All~ > > I have a naive question: > > Why must each thread have its own interpreter? > > > I understand that this suggestion will likely be disregarded because of > the answer to the above question. But here goes anyway... > > Why not have the threads that share everything share interpreters. We > can have these threads be within the a single interpreter thus > eliminating the need for complicated GC locking and resource sharing > complexity. Because all of these threads will be one kernel level > thread Why on earth would they be all one kernel-level thread? -- Monto Blanco... scorchio!
Re: Threads Design. A Win32 perspective.
On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote: Transparent interlocking of VHLL fat structures performed automatically by the VM itself. No need for :shared or lock(). Completely specious, and repeatedly proven unwise. Shouldn't even be pursued. Atomic guarantees on collections (or other data structures) are rarely meaningful; providing them is simply a waste of time. Witness the well-deserved death of Java's synchronized Vector class in favor of ArrayList. The interpreter absolutely shouldn't crash due to threading errors—it should protect itself using standard techniques—but it would be a mistake for parrot to mandate that all ops and PMCs be thread-safe. The details of threaded programming cannot be hidden from the programmer. It's tempting to come up with clever ways to try, but the engine really has to take a back seat here. Smart programmers will narrow the scope of potential conflicts by reducing sharing of data structures in their threaded programs. Having done so, any atomicity guarantees on individual objects proves to be wasted effort: It will be resented by parrot's users as needless overhead, not praised. Consider the potential usage cases. 1. All objects in a non-threaded program. 2. Unshared objects in a threaded program. 3. Shared objects in a threaded program. The first two cases will easily comprise 99% of all usage. In only the third case are synchronized objects even conceivably useful, and even then the truth of the matter is that they are of extremely limited utility: Their guarantees are more often than not too fine-grained to provide the high-level guarantees that the programmer truly needs. In light of this, the acquisition of a mutex (even a mutex that's relatively cheap to acquire) to push an element onto an array, or to access a string's data—well, it stops looking so good. That said, the interpreter can't be allowed to crash due to threading errors. It must protect itself. But should a PerlArray written to concurrently from 2 threads guarantee its state make sense at the end of the program? I say no based upon precedent; the cost is too high. — Gordon Henriksen [EMAIL PROTECTED]
Extendiers interface
Hello, I have some questions about which part of the actual headers are internal, which ones are for embedders and which ones are for extenders. AFAIR nothing but Parrot sources should #include parrot/parrot.h. The public interface is available through parrot/embed.h and parrot/extend.h. Correct? * types parrot/embed.h uses Parrot_Interp, Parrot_String, ... parrot/extend.h uses Parrot_INTERP, Parrot_STRING. I like the former more (metched Parrot_Int, Parrot_Float), but the two interfaces must at least agree on type names, whatever they are. * ParrotIO It is currently not available. I think that API functions (PIO_open, PIO_read, ...) should be available for embedders (maybe not all af them, but at least some ought to), while the layer stuff should be available to extenders. Correct? Should they be available as PIO_* or as Parrot_io_*/Parrot_IO_*? * custom PMC and vtables I assume custom PMC are for extenders. If this is correct, the vtable structure needs to be available to extenders, together with APIs for accessing PMCs flags, data pointer, etc (shoud that use macros as in Parrot-provided PMCs (PMC_data), or function calls?) * Various functions: accessing globals, Parrot_runops_fromc*, mem_sys_*, Parrot_load_bytecode and registering IMCC compilers are some examples of functions that (I think) ought to be available in one form or the other, but aren't. Should they be? Thanks! Mattia
Re: Threads Design. A Win32 perspective.
At 00:49 +0100 1/4/04, Leopold Toetsch wrote: Elizabeth Mattijsen <[EMAIL PROTECTED]> wrote: > Indeed. But as soon as there is something special such as a datastructure external to Perl between threads (which happens automatically "shared" automatically, because Perl doesn't know about > the datastructure, Why is it shared automatically? Do you have an example for that? When you use an external library in Perl, such as e.g. libxml, you have Perl data-structures and libxml data-structures. The Perl data-structures contain pointers to the libxml data-structures. In comes the starting of an ithread and Perl clones all of the Perl data-structures. But it copies _only_ does things it knows about. And thus leaves the pointers to the libxml data-structures untouched. Now you have 2 Perl data-structures that point to the _same_ libxml data-structures. Voila, instant sharing. With disastrous results. Because as soon as the thread ends, the cloned Perl object in the thread goes out of scope. Perl then calls the DESTROY method on the object, which then frees up the libxml data-structures. That's what it's supposed to do. Meanwhile, back in the original thread, the pointers in the Perl object now point at freed memory, rather than a live libxml data-structure. And chaos ensues sooner or later. Of course, chaos could well ensue before the thread is ended, because both threads think they have exclusive access to the libxml data-structure. Hope this explanation made sense. > ... so the cloned objects point to the same memory address), then you're in trouble. Simply because you now have multiple DESTROYs called on the same external data-structure. If the function of the DESTROY is to free the memory of the external data-structure, you're in trouble as soon as the first thread is > done. ;-( Maybe that DOD/GC can help here. A shared object can and will be destroyed only, when the last holder of that object has released it. But do you see now how complicated this can become if thread === interpreter? Liz
Re: Thread Question and Suggestion -- Matt
Matt Fowles wrote: I understand if this suggestion is dismissed for violating the rules, but I would like an answer to the question simply because I do not know the answer. The most admiral reason for asking a question and I doubt it will be dismissed. H