Fwd: Re: Parallelism and Concurrency was Re: Ideas for aObject-Belongs-to-Thread (nntp: message 4 of 20) threading model (nntp: message 20 of 20 -lastone!-) (nntp: message 13 of 20)
--- Forwarded message --- From: nigelsande...@btconnect.com To: Dave Whipp - d...@whipp.name +nntp+browseruk+e66dbbe0cf.dave#whipp.n...@spamgourmet.com Cc: Subject: Re: Parallelism and Concurrency was Re: Ideas for aObject-Belongs-to-Thread (nntp: message 4 of 20) threading model (nntp: message 20 of 20 -lastone!-) (nntp: message 13 of 20) Date: Mon, 17 May 2010 22:31:45 +0100 On Mon, 17 May 2010 20:33:24 +0100, Dave Whipp - dave_wh...@yahoo.com +nntp+browseruk+2dcf7cf254.dave_whipp#yahoo@spamgourmet.com wrote: From that statement, you do not appear to understand the subject matter of this thread: Perl 6 concurrency model. Actually, the reason for my post was that I fear that I did understand the subject matter of the thread: seems to me that any reasonable discussion of perl 6 concurrency should not be too focused on pthreads-style threading. Okay. Now we're at cross-purposes about the heavily overloaded term threading. Whilst GPUs overload the term threading for their internal operations, they are for the most part invisible to applications programmer. And quite different to the 100,000 threads demos in the Go and Erlang documentation to which I referred. The latter being MIMD algorithms, and significantly harder to find applications for than, SIMD algorithms which are commonplace and well understood. My uses of the terms threading and threads are limited specifically to MIMD threading of two forms: Kernel threading: pthreads, Win32/64 threads etc. User-space threading: green threads; coroutines; goroutines; Actors; etc. See below for why I've been limiting myself to these two definitions. OpenCL/Cuda are not exotic $M hardware: they are available (and performant) on any PC (or Mac) that is mainstream or above. Millions of threads is not a huge number: its one thread per pixel on a 720p video frame (and I see no reason, other than performance, not to use Perl6 for image processing). If the discussion is stricly limited abstracting remote procedure calls, then I'll back away. But the exclusion of modules that map hyper-operators (and feeds, etc.) to OpenCL from the generic concept of perl6 concurrency seems rather blinkered. FWIW, I absolutely agree with you that the mapping between Perl 6 hyper-operators and (GPU-based or otherwise) SIMD instructions is a natural fit. But, in your post above you said: Pure SIMD (vectorization) is insufficient for many of these workloads: programmers really do need to think in terms of threads (most likely mapped to OpenCL or Cuda under the hood). By which I took you to mean that in-box SIMD (be it x86/x64 CPU or GPU SIMD instruction sets) was insufficient for many of the[se] workloads you were considering. And therefore took you to be suggesting that the Perl 6 should also be catering for the heterogeneous aspects of OpenCL in core. I now realise that you were distinguishing between CPU SIMD instructions and GPU SIMD instructions. But the real point here is Perl 6 doesn't need a threading model to use and benefit from using GPU SIMD. Any bog-standard single-threaded process can benefit from using CUDA or the homogeneous aspect of OpenCL where available, for SIMD algorithms. Their use can be entirely transparent to the language semantics for built-in operations like the hyper-operators. Ideally, the Perl 6 runtime would implement roles for OpenCl or CUDA for hyper-operations; fall back to CPU SIMD instructions; ad fall back again to old-fashioned loops if neither where available. This would all be entirely transparent to the Perl 6 programmer, just as utilising discrete FPUs was transparent to the C programmer back in the day. In an ideal world, Perl 6.0.0.0.0 would ship with just the looping hyper-operator implementation; and it would be down to users loading in an appropriately named Role that matched the hardware's capabilities that would then get transparently picked up and used by the hyper-operations to give them CPU-SIMD or GPU-SIMD as available. Or perhaps these would become perl6 build-time configuration options. The discussion (which originally started outside of this list), was about MIMD threading--the two categories above--in order to utilise the multiple *C*PU cores that are now ubiquitous. For this Perl 6 does need to sort out a threading model. The guts of the discussion has been kernel threading (and mutable shared state) is necessary. The perception being that by using user-threading (on a single core at a time), you avoid the need for and complexities of locking and synchronisation. And one of the (I believe spurious) arguments for the use of user-space (MIMD) threading, is that they are lightweight which allows you to runs thousands of concurrent threads. And it does. I've done it with Erlang right here on my dirt-cheap Intel Core2 Quad Q6600 processor. But, no matter how hard you try, you can never push the CPU utilisation above 25%, because those 100,000 user-threads all run in
Re: Parallelism and Concurrency was Re: Ideas for a (nntp: message (nntp: message 18 of 20) 14 of 20) Object-Belongs-to-Thread threading model
On Tue, 18 May 2010 11:39:04 +0100, Daniel Ruoso dan...@ruoso.com wrote: This is the point I was trying to address, actually. Having *only* explicitly shared variables makes it very cumbersome to write threaded code, specially because explicitly shared variables have a lot of restrictions on what they can be (this is from my experience in Perl 5 and SDL, which was what brought me to the message-passing idea). Well, do not base anything upon the restrictions and limitations of the Perl 5 threads/shared modules. They are broken-by-design in so many ways that they are not a good reference point. That particular restriction--what a :shared var can and cannot hold--is in some cases just an arbitrary restriction for no good reason that I can see. For example: file handles cannot be assigned to :shared vars is totally arbitrary. This can be demonstrated in two ways: 1) If you pass the fileno of the filehandle to a thread and have it dup(2) a copy, then it can use it concurrently with the originating thread without problems--subject to the obvious locking requirements. 2) I've previously hacked the sources to bypass this restrict by adding SVt_PVGV to the switch in the following function: SV * Perl_sharedsv_find(pTHX_ SV *sv) { MAGIC *mg; if (SvTYPE(sv) = SVt_PVMG) { switch(SvTYPE(sv)) { case SVt_PVAV: case SVt_PVHV: case SVt_PVGV: // !!! if ((mg = mg_find(sv, PERL_MAGIC_tied)) mg-mg_virtual == sharedsv_array_vtbl) { return ((SV *)mg-mg_ptr); } break; default: /* This should work for elements as well as they * have scalar magic as well as their element magic */ if ((mg = mg_find(sv, PERL_MAGIC_shared_scalar)) mg-mg_virtual == sharedsv_scalar_vtbl) { return ((SV *)mg-mg_ptr); } break; } } /* Just for tidyness of API also handle tie objects */ if (SvROK(sv) sv_derived_from(sv, threads::shared::tie)) { return (S_sharedsv_from_obj(aTHX_ sv)); } return (NULL); } And with that one change, sharing file/directory handles in Perl 5 became possible and worked. The problem is, GVs can hold far more than just those handles. And many of the glob-modules utilise the other slots in a GV (array/hahs scalaer etc.) for storing state and bless them as objects. At that point--when I tried the change--the was a conflict between the blessing that Shared.XS uses to make sharing working and any other type of blessing. The net result was that whilst the change lifted the restriction upon simple globs, it still didn't work with many of the most useful glob-based module--IO::Socket::*; HTTP::Deamon; etc. I guess that now the sharing of blessed objects has been mage possible, I shoudl try the hack again a see if it would allow those blessed globs to work. Anyway, the point is that the limitations and restrictions of the Perl5 implementation of the iThreads model, should not be considered as fundamental problems with with the iThreads model itself. They aren't. However, interpreters already have to detect closed over variables in order to 'lift' them and extend their lifetimes beyond their natural scope. Actually, the interpreter might choose to to implement the closed-up variables by keeping that entire associated scope when it is still referenced by another value, i.e.: { my $a; { my $b = 1; { $a = sub { $b++ } } } this would happen by the having every lexical scope holding a reference to its outer scope, so when a scope in the middle exits, but some coderef was returned keeping it as its lexical outer, the entire scope would be kept. This means two things: 1) the interpreter doesn't need to detect the closed over variables, so even string eval'ed access to such variables would work (which is, imho, a good thing) You'd have to explain further for me to understand why it is necessary to keep whole scopes around: - in order to make closures accessible from string-eval; - and why that is desirable? 2) all the values in that lexical scope are also preserved with the closure, even if they won't be used (which is a bad thing). Please no! :) This is essentially the biggest problem with the Perl 5 iThreads implementation. It is the *need* (though I have serious doubts that it is actually a need even for Perl 5), to CLONE entire scope stacks every time you spawn a thread that makes them costly to use. Both because of the time it takes to perform the clone at spawn time; and the memory used to keep copies of all that stuff that simply isn't wanted; and in many cases isn't even accessible. AFAIK going by what I can find about the history of iThreads development, this was only done in Perl 5 in order to provide the Windows fork emulation. But as a
Re: Parallelism and Concurrency was Re: Ideas for a (nntp: message (nntp: message 18 of 20) 14 of 20) Object-Belongs-to-Thread threading model
On Tue, 18 May 2010 11:41:08 +0100, Daniel Ruoso dan...@ruoso.com wrote: Em Dom, 2010-05-16 às 19:34 +0100, nigelsande...@btconnect.com escreveu: Interoperability with Perl 5 and is reference counting should not be a high priority in the decision making process for defining the Perl 6 concurrency model. If we drop that requirement then we can simply go to the we-can-spawn-as-many-os-threads-as-we-want model.. I do not see that as a requirement. But, I am painfully aware that I am playing catchup with all the various versions, flavours and colors of Perl6 interpreter. And more importantly, the significance of each of tehm. When I recently started following #perl6 I was blown away (and totally confused) by all the various flavours that the eval bot responded to. The funny thing is that I have a serious soft spot for the timelyness of reference couting GC. And I recently came across a paper on a new RCGC that claimed to address the circular reference problem without resorting to weak references or other labour intensive mechanisms; nor a stop-the-world GC cycle. I scanned the paper and it was essentially a multi-pass coloring scheme, but achived better performaince than most by a) running locally (to scopes I think) so that it had far fewer arenas to scan. b) Using a invative coloring scheme that meant it was O(N) rather than the usual O(N * M) Most of it went over my head, (as is often the case with aademic papers), but it seems real. But I think that is a boat that has long sailed for Perl 6? daniel
Re: Parallelism and Concurrency was Re: Ideas for a Object-Belongs-to-Thread threading model (nntp: message 20 of 20 -last one!-)
On Fri, 14 May 2010 17:35:20 +0100, B. Estrade - estr...@gmail.com +nntp+browseruk+c4c81fb0fa.estrabd#gmail@spamgourmet.com wrote: The future is indeed multicore - or, rather, *many-core. What this means is that however the hardware jockeys have to strap them together on a single node, we'll be looking at the ability to invoke hundreds (or thousands) of threads on a single SMP machine. There are very few algorithms that actually benefit from using even low hundreds of threads, let alone thousands. The ability of Erlang (and go an IO and many others) to spawn 100,000 threads makes an impressive demo for the uninitiated, but finding practical uses of such abilities is very hard. One example cited is that of gaming software that runs each sprite ina separate thread. The claim is that this simplifies code because each sprite only has to respond to situations directly applicable to it, rather than some common sprite handler having to select which sprite to operate upon. But all it does is move the goal posts. You either have to select which sprite to send a message to; or send a message to the sprite handler and have it select the sprite to operate upon. A third technique is to send the message to all the sprites and have then decide if it is applicable to them. But it still requires a loop, and you then have the communications overhead *100,000 + the context witch costs * 100,000. The numbers do not add up. Then, inevitably, *someone will want to strap these together into a cluster, thus making message passing an attractive way to glue related threads together over a network. Getting back to the availability of many threads on a single SMP box, issues of data locality and affinity and thread binding will become of critical importance. Perhaps surprisingly, these are not the issues they once were. Whilst cache misses are horribly expensive, the multi-layered caching in modern CPUs combines with deep pipelines, branch prediction, register renaming and other features in ways that are beyond the ability of the human mind to reason about. For a whirlwind introduction to the complexities, see the short video here: http://www.infoq.com/presentations/click-crash-course-modern-hardware The only way to test the affects is to profile, and most of the research into the effects of cache locality tend to be done in isolation of real-world application mixes. very few machines, even servers of various types, run a single application these days. This is even truer as server virtualisation becomes ubiquitous. Mix in a soupçon of virtual server load-balancing and trying to code for cache locality becomes almost impossible. These issues are closely related to the operating system's capabilities and paging policies, but eventually (hopefully) current, provably beneficial strategies will be available on most platforms. Brett
Re: Ideas for a Object-Belongs-to-Thread threading model (nntp: message 9 of 20)
On Fri, 14 May 2010 10:01:41 +0100, Ruud H.G. van Tol - rv...@isolution.nl +nntp+browseruk+014f2ed3f9.rvtol#isolution...@spamgourmet.com wrote: The support of threading should be completely optional. The threading support should not be active by default. I'd like to understand why you say that? Two reasons I can think of: 1: Performance. The perception that adding support for threading will impact the performance of non-threaded applications. If you don't use threads, the presence of the ability to use them if you need to will not affect you at all. The presence of Unicode support will have a far more measurable affect upon performance. And it will be unavoidable. 2: Complexity. The perception that the presence of threading support will complicate non-threaded apps. Again, the presence of Unicode support adds far more complexity to the mix that that for threading. But with either, if you choose not to use it, you shouldn't even be aware of its presence. Do you believe that Unicode support should be dropped? See also http://www.ibm.com/developerworks/linux/library/l-posix1.html and fathom why Threads are fun reads to me like how a drug dealer lures you to at least try it once. To me, that reads far more like some of the advocacy I've seen for Giving Blood. If your squeamish, get a friend to distract you, or listen to some good music whilst they put the needle in. Rather fork-join! For platforms where fork is native, it doesn't go away just because threads support is present. (Do Perl_6 hyper-operators need pthreads?) Buk.
Re: Ideas for a Object-Belongs-to-Thread threading model (nntp: message 9 of 20)
On Fri, 14 May 2010 15:05:44 +0100, B. Estrade estr...@gmail.com wrote: On Fri, May 14, 2010 at 12:27:18PM +0100, nigelsande...@btconnect.com wrote: On Fri, 14 May 2010 10:01:41 +0100, Ruud H.G. van Tol - rv...@isolution.nl +nntp+browseruk+014f2ed3f9.rvtol#isolution...@spamgourmet.com wrote: The support of threading should be completely optional. The threading support should not be active by default. I'd like to understand why you say that? Two reasons I can think of: 1: Performance. The perception that adding support for threading will impact the performance of non-threaded applications. I think that perhaps he's thinking of overhead associated with spawning and managing threads - even just one...so, if only 1 thread bound to a single core is desired, then I think this is a reasonable and natural thing to want. Maybe the core binding on an SMP box would be the more challenging issue to tackle. Then, again, this is the role of the OS and libnuma (on Linux, anyway)... Hm. Every process gets one thread by default. There is no overhead there. And spawning 1000 (do nothing but sleep) threads takes 0.171 seconds? Buk.
Re: Ideas for a Object-Belongs-to-Thread threading model (nntp: message 5 of 20)
This should be a reply to Daniel Ruoso's post above, but I cannot persuade my nntp reader to reply to a post made before I subscribed here. Sorry On Wed, 12 May 2010 14:16:35 +0100, Daniel Ruoso dan...@ruoso.com wrote: I have 3 main problems with your thinking. 1: You are conflating two fundamentally different views of the problem. a) The Perl 6 programmers semantic view. b) The P6 compiler (writers) implementation view. These two views need to be kept cleanly separated in order that reference implementation does not define the *only possible* implementation. But, it is important that when designing the semantic view, that it is done with a considerable regard for what /can/ be implemented. 2: You appear to be taking your references at face value. For example, you've cited Erlang as one of your reference points. And the Erlang docs describe the units of concurrency as processes; with the parallelism is provided by Erlang and not the host operating system. But, if I run one of the Erlang examples, http://www.erlang.org/examples/small_examples/tetris.erl it uses two procesess: one with 13 OS threads, and the other with 5 OS threads; even if I only run tetris:start(1). Whilst until recently, Erlang did not use OS threads, relying instead upon and internal, user-space scheduler--green threading, though you may find some denials of that by Erlangers because of the unfavorable comparison with Java green threading in Java version 1 thru 4. But recent versions have implemented multiple OS trheads each running a coroutine scheduler. The had to do this in order to achieve SMP scaling. Here is a little salient information: The Erlang VM without SMP support has 1 scheduler which runs in the main process thread. The scheduler picks runnable Erlang processes and IO-jobs from the run-queue and there is no need to lock data structures since there is only one thread accessing them. The Erlang VM with SMP support can have 1 to many schedulers which are run in 1 thread each. The schedulers pick runnable Erlang processes and IO-jobs from one common run-queue. In the SMP VM all shared data structures are protected with locks, the run-queue is one example of a data structure protected with locks. Lock-free at the semantic level is a nice-to-have. But, whenever you have kernel threads talking to each other through shared memory--and you have to have if you are going to achieve SMP scalability--then there will be some form of locking required. All talk of message passing protocols is simply disguising the realities of the implementation. That is not a bad thing from the applications programmer's point of view--nor even the language designer's POV--but it still leaves the problem to be dealt with by the language system implementers. Whilst lock-free queues are possible--there are implementations of these available for Java 5 (which, of necessity, and to great effect, has now moved away from green threads and gone the Kernel threading route.)--they are very, very hardware dependant. Relying as they do upon CAS, which not all processor architectures support and not all languages give adequate access to. For a very interesting, if rather long, insight into some of this, see Cliff Click's video about Fast Wait-free Hashtables: http://www.youtube.com/watch?v=WYXgtXWejRMfeature=player_embedded#! One thing to note if you watch it all the way through is that your claim (in an earlier revision?) that shared memory doesn't scale is incorrect in the light of this video where 786 SMP processors are using a hash for caching at very high speed. 3: By conflating the POVs of the sematic design and implementation, you are in danger of reinventing several bad wheels, badly. a) A green threading scheduler: The Java guys spent a long time trying to get their's right before abandoning it. The Erlang guys have taken a long time tuning their's, but due to Moore's Law running out, they have had to bow to the inevitability of kernel threading. And are now having to go through the pain of understanding and addressing how multiple event driven and cooperative schedulers running under the control of (various) preemptive scheduler(s) interact. Even Haskell has to use kernel threading: In GHC, threads created by forkIO are lightweight threads, and are managed entirely by the GHC runtime. Typically Haskell threads are an order of magnitude or two more efficient (in terms of both time and space) than operating system threads. The downside of having lightweight threads is that only one can run at a time, so if one thread blocks in a foreign call, for example, the other threads cannot continue. The GHC runtime works around this by making use of full OS threads where necessary. When the program is built with the -threaded option (to link against the