Fwd: Re: Parallelism and Concurrency was Re: Ideas for aObject-Belongs-to-Thread (nntp: message 4 of 20) threading model (nntp: message 20 of 20 -lastone!-) (nntp: message 13 of 20)

2010-05-18 Thread nigelsandever



--- Forwarded message ---
From: nigelsande...@btconnect.com
To: Dave Whipp - d...@whipp.name  
+nntp+browseruk+e66dbbe0cf.dave#whipp.n...@spamgourmet.com

Cc:
Subject: Re: Parallelism and Concurrency was Re: Ideas for  
aObject-Belongs-to-Thread (nntp: message 4 of 20) threading model (nntp:  
message 20 of 20 -lastone!-) (nntp: message 13 of 20)

Date: Mon, 17 May 2010 22:31:45 +0100

On Mon, 17 May 2010 20:33:24 +0100, Dave Whipp - dave_wh...@yahoo.com
+nntp+browseruk+2dcf7cf254.dave_whipp#yahoo@spamgourmet.com wrote:


From that statement, you do not appear to understand the subject matter

of this thread: Perl 6 concurrency model.

Actually, the reason for my post was that I fear that I did understand  
the subject matter of the thread: seems to me that any reasonable  
discussion of perl 6 concurrency should not be too focused on  
pthreads-style threading.


Okay. Now we're at cross-purposes about the heavily overloaded term
threading. Whilst GPUs overload the term threading for their internal
operations, they are for the most part invisible to applications
programmer. And quite different to the 100,000 threads demos in the Go and
Erlang documentation to which I referred. The latter being MIMD
algorithms, and significantly harder to find applications for than, SIMD
algorithms which are commonplace and well understood.

My uses of the terms threading and threads are limited specifically to
MIMD threading of two forms:

Kernel threading: pthreads, Win32/64 threads etc.
User-space threading: green threads; coroutines; goroutines; Actors; etc.

See below for why I've been limiting myself to these two definitions.

OpenCL/Cuda are not exotic $M hardware: they are available (and  
performant) on any PC (or Mac) that is mainstream or above. Millions of  
threads is not a huge number: its one thread per pixel on a 720p video  
frame (and I see no reason, other than performance, not to use Perl6 for  
image processing).


If the discussion is stricly limited abstracting remote procedure calls,  
then I'll back away. But the exclusion of modules that map  
hyper-operators (and feeds, etc.) to OpenCL from the generic concept of  
perl6 concurrency seems rather blinkered.





FWIW, I absolutely agree with you that the mapping between Perl 6
hyper-operators and (GPU-based or otherwise) SIMD instructions is a
natural fit. But, in your post above you said:

Pure SIMD (vectorization) is insufficient for many of these workloads:
programmers really do need to think in terms of threads (most likely
mapped to OpenCL or Cuda under the hood).

By which I took you to mean that in-box SIMD (be it x86/x64 CPU or GPU
SIMD instruction sets) was insufficient for many of the[se] workloads
you were considering. And therefore took you to be suggesting that the
Perl 6 should also be catering for the heterogeneous aspects of OpenCL in
core.

I now realise that you were distinguishing between CPU SIMD instructions
and GPU SIMD instructions. But the real point here is Perl 6 doesn't need
a threading model to use and benefit from using GPU SIMD.

Any bog-standard single-threaded process can benefit from using CUDA or
the homogeneous aspect of OpenCL where available, for SIMD algorithms.
Their use can be entirely transparent to the language semantics for
built-in operations like the hyper-operators. Ideally, the Perl 6 runtime
would implement roles for OpenCl or CUDA for hyper-operations; fall back
to CPU SIMD instructions; ad fall back again to old-fashioned loops if
neither where available. This would all be entirely transparent to the
Perl 6 programmer, just as utilising discrete FPUs was transparent to the
C programmer back in the day. In an ideal world, Perl 6.0.0.0.0 would ship
with just the looping hyper-operator implementation; and it would be down
to users loading in an appropriately named Role that matched the
hardware's capabilities that would then get transparently picked up and
used by the hyper-operations to give them CPU-SIMD or GPU-SIMD as
available. Or perhaps these would become perl6 build-time configuration
options.

The discussion (which originally started outside of this list), was about
MIMD threading--the two categories above--in order to utilise the multiple
*C*PU cores that are now ubiquitous. For this Perl 6 does need to sort out
a threading model.

The guts of the discussion has been kernel threading (and mutable shared
state) is necessary. The perception being that by using user-threading (on
a single core at a time), you avoid the need for and complexities of
locking and synchronisation. And one of the (I believe spurious) arguments
for the use of user-space (MIMD) threading, is that they are lightweight
which allows you to runs thousands of concurrent threads.

And it does. I've done it with Erlang right here on my dirt-cheap Intel
Core2 Quad Q6600 processor. But, no matter how hard you try, you can never
push the CPU utilisation above 25%, because those 100,000 user-threads all
run in 

Re: Parallelism and Concurrency was Re: Ideas for a (nntp: message (nntp: message 18 of 20) 14 of 20) Object-Belongs-to-Thread threading model

2010-05-18 Thread nigelsandever

On Tue, 18 May 2010 11:39:04 +0100, Daniel Ruoso dan...@ruoso.com wrote:


This is the point I was trying to address, actually. Having *only*
explicitly shared variables makes it very cumbersome to write threaded
code, specially because explicitly shared variables have a lot of
restrictions on what they can be (this is from my experience in Perl 5
and SDL, which was what brought me to the message-passing idea).



Well, do not base anything upon the restrictions and limitations of the  
Perl 5 threads/shared modules. They are broken-by-design in so many ways  
that they are not a good reference point. That particular  
restriction--what a :shared var can and cannot hold--is in some cases just  
an arbitrary restriction for no good reason that I can see.


For example: file handles cannot be assigned to :shared vars is totally  
arbitrary. This can be demonstrated in two ways:


1) If you pass the fileno of the filehandle to a thread and have it dup(2)  
a copy, then it can use it concurrently with the originating thread  
without problems--subject to the obvious locking requirements.


2) I've previously hacked the sources to bypass this restrict by adding  
SVt_PVGV to the switch in the following function:



SV *
Perl_sharedsv_find(pTHX_ SV *sv)
{
MAGIC *mg;
if (SvTYPE(sv) = SVt_PVMG) {
switch(SvTYPE(sv)) {
case SVt_PVAV:
case SVt_PVHV:
case SVt_PVGV: // !!!
if ((mg = mg_find(sv, PERL_MAGIC_tied))
 mg-mg_virtual == sharedsv_array_vtbl) {
return ((SV *)mg-mg_ptr);
}
break;
default:
/* This should work for elements as well as they
 * have scalar magic as well as their element magic
 */
if ((mg = mg_find(sv, PERL_MAGIC_shared_scalar))
 mg-mg_virtual == sharedsv_scalar_vtbl) {
return ((SV *)mg-mg_ptr);
}
break;
}
}
/* Just for tidyness of API also handle tie objects */
if (SvROK(sv)  sv_derived_from(sv, threads::shared::tie)) {
return (S_sharedsv_from_obj(aTHX_ sv));
}
return (NULL);
}

And with that one change, sharing file/directory handles in Perl 5 became  
possible and worked.


The problem is, GVs can hold far more than just those handles. And many of  
the glob-modules utilise the other slots in a GV (array/hahs scalaer etc.)  
for storing state and bless them as objects. At that point--when I tried  
the change--the was a conflict between the blessing that Shared.XS uses to  
make sharing working and any other type of blessing. The net result was  
that whilst the change lifted the restriction upon simple globs, it still  
didn't work with many of the most useful glob-based module--IO::Socket::*;  
HTTP::Deamon; etc. I guess that now the sharing of blessed objects has  
been mage possible, I shoudl try the hack again a see if it would allow  
those blessed globs to work.


Anyway, the point is that the limitations and restrictions of the Perl5  
implementation of the iThreads model, should not be considered as  
fundamental problems with with the iThreads model itself. They aren't.


However, interpreters already have to detect closed over variables  
in

order to 'lift' them and extend their lifetimes beyond their natural
scope.


Actually, the interpreter might choose to to implement the closed-up
variables by keeping that entire associated scope when it is still
referenced by another value, i.e.:

 { my $a;
   { my $b = 1;
 { $a = sub { $b++ } } }

this would happen by the having every lexical scope holding a reference
to its outer scope, so when a scope in the middle exits, but some
coderef was returned keeping it as its lexical outer, the entire scope
would be kept.

This means two things:

1) the interpreter doesn't need to detect the closed over variables, so
even string eval'ed access to such variables would work (which is, imho,
a good thing)


You'd have to explain further for me to understand why it is necessary to  
keep whole scopes around:

- in order to make closures accessible from string-eval;
- and why that is desirable?



2) all the values in that lexical scope are also preserved with the
closure, even if they won't be used (which is a bad thing).



Please no! :)

This is essentially the biggest problem with the Perl 5 iThreads  
implementation. It is the *need* (though I have serious doubts that it is  
actually a need even for Perl 5), to CLONE entire scope stacks every time  
you spawn a thread that makes them costly to use. Both because of the time  
it takes to perform the clone at spawn time; and the memory used to keep  
copies of all that stuff that simply isn't wanted; and in many cases isn't  
even accessible. AFAIK going by what I can find about the history of  
iThreads development, this was only done in Perl 5 in order to provide the  
Windows fork emulation.


But as a 

Re: Parallelism and Concurrency was Re: Ideas for a (nntp: message (nntp: message 18 of 20) 14 of 20) Object-Belongs-to-Thread threading model

2010-05-18 Thread nigelsandever

On Tue, 18 May 2010 11:41:08 +0100, Daniel Ruoso dan...@ruoso.com wrote:


Em Dom, 2010-05-16 às 19:34 +0100, nigelsande...@btconnect.com escreveu:

Interoperability with Perl 5 and
is reference counting should not be a high priority in the decision  
making

process for defining the Perl 6 concurrency model.


If we drop that requirement then we can simply go to the
we-can-spawn-as-many-os-threads-as-we-want model..



I do not see that as a requirement. But, I am painfully aware that I am  
playing catchup with all the various versions, flavours and colors of  
Perl6 interpreter. And more importantly, the significance of each of tehm.  
When I recently started following #perl6 I was blown away (and totally  
confused) by all the various flavours that the eval bot responded to.


The funny thing is that I have a serious soft spot for the timelyness of  
reference couting GC. And I recently came across a paper on a new RCGC  
that claimed to address the circular reference problem without resorting  
to weak references or other labour intensive mechanisms; nor a  
stop-the-world GC cycle. I scanned the paper and it was essentially a  
multi-pass coloring scheme, but achived better performaince than most by


a) running locally (to scopes I think) so that it had far fewer arenas to  
scan.
b) Using a invative coloring scheme that meant it was O(N) rather than the  
usual O(N * M)


Most of it went over my head, (as is often the case with aademic papers),  
but it seems real. But I think that is a boat that has long sailed for  
Perl 6?




daniel



Re: Parallelism and Concurrency was Re: Ideas for a Object-Belongs-to-Thread threading model (nntp: message 20 of 20 -last one!-)

2010-05-16 Thread nigelsandever
On Fri, 14 May 2010 17:35:20 +0100, B. Estrade - estr...@gmail.com  
+nntp+browseruk+c4c81fb0fa.estrabd#gmail@spamgourmet.com wrote:



The future is indeed multicore - or, rather, *many-core. What this
means is that however the hardware jockeys have to strap them together
on a single node, we'll be looking at the ability to invoke hundreds
(or thousands) of threads on a single SMP machine.


There are very few algorithms that actually benefit from using even low  
hundreds of threads, let alone thousands. The ability of Erlang (and go an  
IO and many others) to spawn 100,000 threads makes an impressive demo for  
the uninitiated, but finding practical uses of such abilities is very hard.


One example cited is that of gaming software that runs each sprite ina  
separate thread. The claim is that this simplifies code because each  
sprite only has to respond to situations directly applicable to it, rather  
than some common sprite handler having to select which sprite to operate  
upon. But all it does is move the goal posts. You either have to select  
which sprite to send a message to; or send a message to  the sprite  
handler and have it select the sprite to operate upon.


A third technique is to send the message to all the sprites and have then  
decide if it is applicable to them. But it still requires a loop, and you  
then have the communications overhead *100,000 + the context witch costs *  
100,000. The numbers do not add up.



Then, inevitably,
*someone will want to strap these together into a cluster, thus making
message passing an attractive way to glue related threads together
over a network.  Getting back to the availability of many threads on a
single SMP box, issues of data locality and affinity and thread
binding will become of critical importance.


Perhaps surprisingly, these are not the issues they once were. Whilst  
cache misses are horribly expensive, the multi-layered caching in modern  
CPUs combines with deep pipelines, branch prediction, register renaming  
and other features in ways that are beyond the ability of the human mind  
to reason about.


For a whirlwind introduction to the complexities, see the short video here:

http://www.infoq.com/presentations/click-crash-course-modern-hardware

The only way to test the affects is to profile, and most of the research  
into the effects of cache locality tend to be done in isolation of  
real-world application mixes. very few machines, even servers of various  
types, run a single application these days. This is even truer as server  
virtualisation becomes ubiquitous. Mix in a soupçon of virtual server  
load-balancing and trying to code for cache locality becomes almost  
impossible.



These issues are closely
related to the operating system's capabilities and paging policies, but
eventually (hopefully) current, provably beneficial strategies will be
available on most platforms.

Brett



Re: Ideas for a Object-Belongs-to-Thread threading model (nntp: message 9 of 20)

2010-05-14 Thread nigelsandever
On Fri, 14 May 2010 10:01:41 +0100, Ruud H.G. van Tol - rv...@isolution.nl  
+nntp+browseruk+014f2ed3f9.rvtol#isolution...@spamgourmet.com wrote:





The support of threading should be completely optional. The threading  
support should not be active by default.


I'd like to understand why you say that?

Two reasons I can think of:

1: Performance. The perception that adding support for threading will  
impact the performance of non-threaded applications.


If you don't use threads, the presence of the ability to use them if you  
need to will not affect you at all.
The presence of Unicode support will have a far more measurable affect  
upon performance. And it will be unavoidable.


2: Complexity. The perception that the presence of threading support will  
complicate non-threaded apps.


Again, the presence of Unicode support adds far more complexity to the mix  
that that for threading.
But with either, if you choose not to use it, you shouldn't even be aware  
of its presence.


Do you believe that Unicode support should be dropped?



See also http://www.ibm.com/developerworks/linux/library/l-posix1.html
and fathom why Threads are fun reads to me like how a drug dealer  
lures you to at least try it once.


To me, that reads far more like some of the advocacy I've seen for Giving  
Blood.
If your squeamish, get a friend to distract you, or listen to some good  
music whilst they put the needle in.




Rather fork-join!


For platforms where fork is native, it doesn't go away just because  
threads support is present.




(Do Perl_6 hyper-operators need pthreads?)



Buk.


Re: Ideas for a Object-Belongs-to-Thread threading model (nntp: message 9 of 20)

2010-05-14 Thread nigelsandever

On Fri, 14 May 2010 15:05:44 +0100, B. Estrade estr...@gmail.com wrote:

On Fri, May 14, 2010 at 12:27:18PM +0100, nigelsande...@btconnect.com  
wrote:
On Fri, 14 May 2010 10:01:41 +0100, Ruud H.G. van Tol -  
rv...@isolution.nl

+nntp+browseruk+014f2ed3f9.rvtol#isolution...@spamgourmet.com wrote:



The support of threading should be completely optional. The threading
support should not be active by default.

I'd like to understand why you say that?

Two reasons I can think of:

1: Performance. The perception that adding support for threading will
impact the performance of non-threaded applications.


I think that perhaps he's thinking of overhead associated with
spawning and managing threads - even just one...so, if only 1 thread
bound to a single core is desired, then I think this is a reasonable
and natural thing to want. Maybe the core binding on an SMP box would
be the more challenging issue to tackle. Then, again, this is the role
of the OS and libnuma (on Linux, anyway)...



Hm. Every process gets one thread by default. There is no overhead there.

And spawning 1000 (do nothing but sleep) threads takes 0.171 seconds?

Buk.


Re: Ideas for a Object-Belongs-to-Thread threading model (nntp: message 5 of 20)

2010-05-13 Thread nigelsandever
This should be a reply to Daniel Ruoso's post above, but I cannot persuade  
my nntp reader

to reply to a post made before I subscribed here. Sorry

On Wed, 12 May 2010 14:16:35 +0100, Daniel Ruoso dan...@ruoso.com wrote:

I have 3 main problems with your thinking.

1: You are conflating two fundamentally different views of the problem.
  a) The Perl 6 programmers semantic view.
  b) The P6 compiler (writers) implementation view.

These two views need to be kept cleanly separated in order that reference
implementation does not define the *only possible* implementation.

But, it is important that when designing the semantic view, that it is done
with a considerable regard for what /can/ be implemented.

2: You appear to be taking your references at face value.

For example, you've cited Erlang as one of your reference points.
And the Erlang docs describe the units of concurrency as processes;
with the parallelism is provided by Erlang and not the host operating  
system.


But, if I run one of the Erlang examples,
http://www.erlang.org/examples/small_examples/tetris.erl
it uses two procesess: one with 13 OS threads, and the other with 5 OS  
threads;

even if I only run tetris:start(1).

Whilst until recently, Erlang did not use OS threads, relying instead upon
and internal, user-space scheduler--green threading, though you may find  
some denials
of that by Erlangers because of the unfavorable comparison with Java green  
threading

in Java version 1 thru 4.

But recent versions have implemented multiple OS trheads each running a  
coroutine scheduler.
The had to do this in order to achieve SMP scaling. Here is a little  
salient information:


The Erlang VM without SMP support has 1 scheduler which runs in the
main process thread. The scheduler picks runnable Erlang processes and
	IO-jobs from the run-queue and there is no need to lock data structures  
since

there is only one thread accessing them.

The Erlang VM with SMP support can have 1 to many schedulers which are
run in 1 thread each. The schedulers pick runnable Erlang processes
and IO-jobs from one common run-queue. In the SMP VM all shared data
structures are protected with locks, the run-queue is one example of a
data structure protected with locks.

Lock-free at the semantic level is a nice-to-have. But, whenever you have  
kernel threads
talking to each other through shared memory--and you have to have if you  
are going to achieve

SMP scalability--then there will be some form of locking required.

All talk of message passing protocols is simply disguising the realities  
of the implementation.
That is not a bad thing from the applications programmer's point of  
view--nor even the language
designer's POV--but it still leaves the problem to be dealt with by the  
language  system implementers.


Whilst lock-free queues are possible--there are implementations of these  
available for Java 5
(which, of necessity, and to great effect, has now moved away from green  
threads and gone the
Kernel threading route.)--they are very, very hardware dependant. Relying  
as they do upon CAS, which
not all processor architectures support and not all languages give  
adequate access to.


For a very interesting, if rather long, insight into some of this, see  
Cliff Click's video about

Fast Wait-free Hashtables:

http://www.youtube.com/watch?v=WYXgtXWejRMfeature=player_embedded#!

One thing to note if you watch it all the way through is that your claim  
(in an earlier revision?)
that shared memory doesn't scale is incorrect in the light of this video  
where 786 SMP processors

are using a hash for caching at very high speed.

3: By conflating the POVs of the sematic design and implementation, you  
are in danger of reinventing

several bad wheels, badly.

a) A green threading scheduler:
	The Java guys spent a long time trying to get their's right before  
abandoning it.


	The Erlang guys have taken a long time tuning their's, but due to Moore's  
Law running out,
	they have had to bow to the inevitability of kernel threading. And are  
now having to go
	through the pain of understanding and addressing how multiple event  
driven and cooperative
	schedulers running under the control of (various) preemptive scheduler(s)  
interact.


Even Haskell has to use kernel threading:

	In GHC, threads created by forkIO are lightweight threads, and are  
managed entirely by the
	GHC runtime. Typically Haskell threads are an order of magnitude or two  
more efficient

(in terms of both time and space) than operating system threads.

	The downside of having lightweight threads is that only one can run at a  
time, so if one thread
	blocks in a foreign call, for example, the other threads cannot continue.  
The GHC runtime works
	around this by making use of full OS threads where necessary. When the  
program is built with the
	-threaded option (to link against the