Re: [Haskell-cafe] Dynamic thread management?

2007-09-17 Thread Hugh Perkins
Couple of thoughts/observations:

- Erlang has a vm, so that would avoid building a vm.

On the downside, erlang is not pure: the message-passing and the io:
commands imply the possibility of side-effects.

Still, it could be good enough for a proof-of-concept?

- implementation as a library function?

Instead of building/modifying a vm to re-implement how map works,
perhaps an easy way of doing this would be to create a new map
function (dpmap? (=Dynamic parallel map), which chats to a scheduler
process, which decides whether to split the map or not.

Downside: constant overhead for all maps, even really small ones that
would otherwise run really fast, and will ultimately be assigned only
a single process anyway.

(note: using erlang terminology here, where process means
essentially thread).

Of course, even modifying the vm to parallelize maps etc would have a
constant overhead, but possibly the overhead could be much smaller if
implemented directly in a vm?

Possibly an approach could be:
- impliment as a new dpmap library function
- someone could optimize this in a vm later
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-22 Thread Neil Bartlett
 multicore box ;-)  It's my main show-stopper right now.  Any clues on
 how to get access to one, eg via ssh?  32-core or higher would be
 favorite ;-) but I guess even just a 4-core or so is enough for
 proof-of-concept?

I think you'll have plenty of work to be before you get to the stage
of needing a box with more than 4 cores. Even a dual core machine is
likely to be enough for initial experimentation, I would say.

 ... or build a minimal vm, enough to get 3 or 4 somewhat interesting
 algorithms / programs to run, and get automatic threading working on a
 couple of targets, eg on maps, and whatever [ x | x - somelist ]
 these things are called.  (folds are a little harder from an
 implementation point of view, so can be a future upgrade).

The other thing to consider is that there are several other VMs out
there, including many under open source licenses, that can be used as
a testbed. Examples include the Java VM, Mono, Parrot, LLVM, etc.

 Would you or Neil fancy being a mentor for this, if I can start to get
 somewhere on it?

Not me! I'm not an academic... I've never had a paper published and
I'm not likely to either.

Regards
Neil
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-22 Thread Brandon Michael Moore
On Wed, Aug 22, 2007 at 04:07:22AM +0100, Hugh Perkins wrote:
 On 8/21/07, Andrew Coppin [EMAIL PROTECTED] wrote:
  I highly doubt that automatic threading will happen any time this decade
  - but I just learned something worth while from reading this email. ;-)
 
 That's an interesting observation.  I cant say I dont believe it, but
 I'm interested to know why (but it could be just a feeling, or an
 observation in time-to-market lead times?).  Are you saying this
 because multicores arent sufficiently widespread or powerful enough
 yet (4-cores doesnt really even make up for the overhead of using
 automatic threading, at least in initial implementations)? or are you
 saying this because you think the technical implementations are not
 sufficiently advanced?

Automatic threading is inherently limited by data dependencies.
You can't run a function that branches on an argument in parallel
with the computation producing that argument. Even with arbitrarily
many processors and no communication overhead there is a limit to
how much parallelism you can squeeze from any given program.

You should read
Feedback Directed Implicit Parallelism
http://research.microsoft.com/~tharris/papers/2007-fdip.pdf
and perhaps
Limits to Implicit Parallelism in Functional Applications
http://www.detreville.org/papers/Limits.pdf

In short, with zero overhead and an oracle for scheduling you can
get a factor of at most 2x to 32x by implicitly parallelizing
existing Haskell code. In practice, with execution overhead it's a
gain of perhaps 10% to 80% on a 4-core system. The experiments in
the first paper are based on a fairly sophisticated implementation
that reduces overhead by using profiling results at compile time
to decide which thunks might be worth evaluating in parallel. For a
fixed benchmark there's probably not much lost by using canned
profiling results instead of adapting at runtime, and in any case
the hard bounds from data dependencies still apply.

You can do a lot better if you expect people to rewrite code,
but automatic threading suggests something completely effortless.
I think you can get much better results if you work on the programming
style in connection with a better runtime.  You can think of data parallel
Haskell as a new programming style with more implicit parallelims,
and the runtime support to exploit it.
 
 I kindof think automatic threading is like 3d graphics: as soon as the
 hardware became sufficiently powerful, 3d graphics became trivial.
 Enough money was thrown at the problem in a very short time by a few
 powerful companies that it was a non-issue.

If you have cores to waste, you might try rewrites like

f x 
=
case x of
  C1 a1 a2 - f (C1 a1 a2)
  C2 b - f (C2 b)
  C3 - f C3

and then speculatively execute several of the case branches.
If you don't throw away too much type information you should
even be able to do it at runtime.

Brandon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-22 Thread ok

On 8/21/07, Andrew Coppin [EMAIL PROTECTED] wrote:
I highly doubt that automatic threading will happen any time this  
decade


Hm.  I happen to have an old Sun C compiler on my old UltraSPARC box.
cc -V = Sun Workshop 6 update 2 C 5.3 2001/05/15.
One of its options is '-xautopar'.  I'll let you guess what that does...

I note that a company called Tilera claim to be shipping a
64 CPU (some flavour of MIPS, I gather) chip.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-22 Thread Hugh Perkins
On 8/22/07, Brandon Michael Moore [EMAIL PROTECTED] wrote:
 Automatic threading is inherently limited by data dependencies.
 You can't run a function that branches on an argument in parallel
 with the computation producing that argument. Even with arbitrarily
 many processors and no communication overhead there is a limit to
 how much parallelism you can squeeze from any given program.

Yes.  Its worse than that in fact, because many real-world problems
will involve functions that have side-effects, but we know the
side-effects dont matter.  The parallelisation algo of course doesnt
know they dont matter (unless we tell it).

Example: imagine we want to copy files from one machine to five
others.  Copying a file has a clear side-effect, but since we're
copying to 5 independent machines, we can copy to each machine in
parallel.  The algo doesnt know that this is ok.


 You should read
 Feedback Directed Implicit Parallelism
 http://research.microsoft.com/~tharris/papers/2007-fdip.pdf
 and perhaps
 Limits to Implicit Parallelism in Functional Applications
 http://www.detreville.org/papers/Limits.pdf

Ok

 In short, with zero overhead and an oracle for scheduling you can
 get a factor of at most 2x to 32x by implicitly parallelizing
 existing Haskell code. In practice, with execution overhead it's a
 gain of perhaps 10% to 80% on a 4-core system.

This is a good argument that it's not enough to prototype on a 4 core
system, but we really need some way to simulate a 1024 core system to
carry out meaningful benchmarks.


 You can do a lot better if you expect people to rewrite code,
 but automatic threading suggests something completely effortless.

Yes, I tend to overuse the word automatic ;-)

 I think you can get much better results if you work on the programming
 style in connection with a better runtime.  You can think of data parallel
 Haskell as a new programming style with more implicit parallelims,
 and the runtime support to exploit it.

Yes, you're right.

 If you have cores to waste, you might try rewrites like

 f x
 =
 case x of
  C1 a1 a2 - f (C1 a1 a2)
  C2 b - f (C2 b)
  C3 - f C3

 and then speculatively execute several of the case branches.
 If you don't throw away too much type information you should
 even be able to do it at runtime.

That's a good observation.  Sortof anti-laziness :-D
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-22 Thread Stefan O'Rear
On Thu, Aug 23, 2007 at 06:27:43AM +0100, Hugh Perkins wrote:
 On 8/22/07, Brandon Michael Moore [EMAIL PROTECTED] wrote:
  Automatic threading is inherently limited by data dependencies.
  You can't run a function that branches on an argument in parallel
  with the computation producing that argument. Even with arbitrarily
  many processors and no communication overhead there is a limit to
  how much parallelism you can squeeze from any given program.
 
 Yes.  Its worse than that in fact, because many real-world problems
 will involve functions that have side-effects, but we know the
 side-effects dont matter.  The parallelisation algo of course doesnt
 know they dont matter (unless we tell it).
 
 Example: imagine we want to copy files from one machine to five
 others.  Copying a file has a clear side-effect, but since we're
 copying to 5 independent machines, we can copy to each machine in
 parallel.  The algo doesnt know that this is ok.

Actually, the algorithm was right, and the intuitive argument is wrong.
Suppose all five computers are actually re-exporting network filesystem
hosted on a sixth server, and the file names alias.  By overruling the
algorithm, you've introduced a corner case bug that will go unnoticed
for years.

  If you have cores to waste, you might try rewrites like
 
  f x
  =
  case x of
   C1 a1 a2 - f (C1 a1 a2)
   C2 b - f (C2 b)
   C3 - f C3
 
  and then speculatively execute several of the case branches.
  If you don't throw away too much type information you should
  even be able to do it at runtime.
 
 That's a good observation.  Sortof anti-laziness :-D

It's called speculative execution, and microprocessors have been doing
it for many years as part of their built-in automatic parallelization
circuitry.

(If you think emulator-based autoparallelization is going to be easy,
consider that it takes a couple million transistors to do a reasonable
job - and a chunk of 90's silicon can already do a million things at
once, so you're at a huge starting disadvantage.  Just my two cents.)

Stefan


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-21 Thread Hugh Perkins
On 8/11/07, Neil Bartlett [EMAIL PROTECTED] wrote:
 You're absolutely right that a dynamic/adaptive approach is the only
 one that will work when the tasks are of unknown size. Whether this
 approach is as easy as you think is open for you to prove. I look
 forward to testing your VM implementation,

Well... obviously migrating Haskell to use a VM is itself non-trivial
;-)  There are two obstacles:
- technical
- political

The technical obstacle means implementing it.  Given that high
performance VMs exist this is largely pure software engineering,
rather than research?

The political obstacle means: pursuading people to use it if it were
written.  If no-one uses it, it wont be maintained, and is basically
pretty useless.  The main reasons why it might not be used are:
- breaks the status quo / de facto standard
- provides no advantage in a single-core environment

Breaking the status quo is not an inconsiderable obstacle, but it
would be broken if there was a real advantage of using automatic
threading, which there is not right now because most machines are
single-cored.  Whilst it's the right time to think about how to
implement things, it's maybe a year or two early to actually implement
it and expect people to use it.

What I think is:
- automatic threading is not really that hard.  Once you've got a pure
FP running in a VM, the actual automatic threading bit is pretty easy
(largely software engineering, not research)
- when machines become multicored, Microsoft will just take F# (which
already runs in a VM I guess? but not sure if it's an FP-dedicated VM,
they might need to build one), and just slot in the automatic
threading bit.

 or at the very least
 reading your paper on the subject ;-)

Writing a paper would be fun.  I think I'm a little out of my depth to
be writing a paper ;-) but just on the off-chance, how does one go
about writing a paper and getting it published?  Does one have to be a
member of an accredited institution, or can one write one as a
freelancer?  If one has to be a member of an accredited institution,
what are the options?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-21 Thread Tim Chevalier
On 8/21/07, Hugh Perkins [EMAIL PROTECTED] wrote:
 On 8/11/07, Neil Bartlett [EMAIL PROTECTED] wrote:
  You're absolutely right that a dynamic/adaptive approach is the only
  one that will work when the tasks are of unknown size. Whether this
  approach is as easy as you think is open for you to prove. I look
  forward to testing your VM implementation,

 Well... obviously migrating Haskell to use a VM is itself non-trivial
 ;-)  There are two obstacles:
 - technical
 - political

 The technical obstacle means implementing it.  Given that high
 performance VMs exist this is largely pure software engineering,
 rather than research?


GHCi, of course, is a bytecode interpreter, so that's sort of like a
VM. You might start by looking at how GHCi works and see what you
would need to change if performance rather than interactivity was your
goal.

 The political obstacle means: pursuading people to use it if it were
 written.  If no-one uses it, it wont be maintained, and is basically
 pretty useless.  The main reasons why it might not be used are:
 - breaks the status quo / de facto standard
 - provides no advantage in a single-core environment

 Breaking the status quo is not an inconsiderable obstacle, but it
 would be broken if there was a real advantage of using automatic
 threading, which there is not right now because most machines are
 single-cored.  Whilst it's the right time to think about how to
 implement things, it's maybe a year or two early to actually implement
 it and expect people to use it.


I don't think you have to worry too much about the political
obstacles. People want automatic multithreading, and in a year or two
we'll all have multicore boxen. In any case, if you don't solve the
technical problems, the political ones will never surface; if you
build it, people will come, or if they don't, you at least know
something that you wouldn't have known if you didn't build it :-)

 Writing a paper would be fun.  I think I'm a little out of my depth to
 be writing a paper ;-) but just on the off-chance, how does one go
 about writing a paper and getting it published?  Does one have to be a
 member of an accredited institution, or can one write one as a
 freelancer?  If one has to be a member of an accredited institution,
 what are the options?

Anyone can submit a paper to a CS journal or conference. While most
people who do so are affiliated with universities, research labs, or
(more rarely) non-research companies, there are independent
researchers out there, and sometimes you'll notice a paper where
someone is listed by just their name with no affiliation. Conferences
issue calls for papers (you might see some posted on this mailing
list) that give you an idea for the rough format of the paper and
submission guidelines. But really, you'll want to find a mentor who
can give you advice on how to write a paper that will fit the mold.
First come up with a technical result that you believe is
paper-worthy, then find other people to talk to who can confirm that
opinion and help you get your paper submitted :-)

Cheers,
Tim


-- 
Tim Chevalier * catamorphism.org * Often in error, never in doubt
I cannot remember a time when I did not take it as understood that
everybody has at least two, if not twenty-two, sides to
him.--Robertson Davies
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-21 Thread Hugh Perkins
On 8/21/07, Tim Chevalier [EMAIL PROTECTED] wrote:
 I don't think you have to worry too much about the political
 obstacles. People want automatic multithreading, and in a year or two
 we'll all have multicore boxen. In any case, if you don't solve the
 technical problems, the political ones will never surface; if you
 build it, people will come, or if they don't, you at least know
 something that you wouldn't have known if you didn't build it :-)

Ok, that's good encouragement.  Practical question: I dont have a
multicore box ;-)  It's my main show-stopper right now.  Any clues on
how to get access to one, eg via ssh?  32-core or higher would be
favorite ;-) but I guess even just a 4-core or so is enough for
proof-of-concept?

 GHCi, of course, is a bytecode interpreter, so that's sort of like a
 VM. You might start by looking at how GHCi works and see what you
 would need to change if performance rather than interactivity was your
 goal.

Yes, I guess two approaches are to take GHCi and make it handle
automatic threading, but keeping the interactivity, ie not seeking to
rival ghc in real performance, but merely providing a PoC, ...

... or build a minimal vm, enough to get 3 or 4 somewhat interesting
algorithms / programs to run, and get automatic threading working on a
couple of targets, eg on maps, and whatever [ x | x - somelist ]
these things are called.  (folds are a little harder from an
implementation point of view, so can be a future upgrade).

 Anyone can submit a paper to a CS journal or conference. While most
 people who do so are affiliated with universities, research labs, or
 (more rarely) non-research companies, there are independent
 researchers out there, and sometimes you'll notice a paper where
 someone is listed by just their name with no affiliation.

Again, that's quite encouraging :-)  I'm far too lazy to sign my life
away for 7 years of phd :-D  (unless someone knows anyone looking for
phd students in somewhere exotic like Paris, China or Singapore???),
but working on it in my own time sounds fun.

 First come up with a technical result that you believe is
 paper-worthy

I guess a paperworthy technical result doesnt have to be a fully
fledged Haskell VM with in-depth automatic threading?, but either GHCi
with some simple automatic threading, or a minimal vm implementation,
again with some simple automatic threading?

Basically, we give it one or three algorithms with automatic threading
switched off, time execution, then run it on (ideally 32 or 64 cores
but 4 is ok) a multicore machine, and show that the execution elapsed
time is less?

 But really, you'll want to find a mentor who
 can give you advice on how to write a paper that will fit the mold.
, then find other people to talk to who can confirm that
 opinion and help you get your paper submitted :-)

Would you or Neil fancy being a mentor for this, if I can start to get
somewhere on it?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-21 Thread Andrew Coppin

Tim Chevalier wrote:

Anyone can submit a paper to a CS journal or conference. While most
people who do so are affiliated with universities, research labs, or
(more rarely) non-research companies, there are independent
researchers out there, and sometimes you'll notice a paper where
someone is listed by just their name with no affiliation. Conferences
issue calls for papers (you might see some posted on this mailing
list) that give you an idea for the rough format of the paper and
submission guidelines. But really, you'll want to find a mentor who
can give you advice on how to write a paper that will fit the mold.
First come up with a technical result that you believe is
paper-worthy, then find other people to talk to who can confirm that
opinion and help you get your paper submitted :-)
  


I highly doubt that automatic threading will happen any time this decade 
- but I just learned something worth while from reading this email. ;-)


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-21 Thread Hugh Perkins
On 8/21/07, Andrew Coppin [EMAIL PROTECTED] wrote:
 I highly doubt that automatic threading will happen any time this decade
 - but I just learned something worth while from reading this email. ;-)


That's an interesting observation.  I cant say I dont believe it, but
I'm interested to know why (but it could be just a feeling, or an
observation in time-to-market lead times?).  Are you saying this
because multicores arent sufficiently widespread or powerful enough
yet (4-cores doesnt really even make up for the overhead of using
automatic threading, at least in initial implementations)? or are you
saying this because you think the technical implementations are not
sufficiently advanced?

I kindof think automatic threading is like 3d graphics: as soon as the
hardware became sufficiently powerful, 3d graphics became trivial.
Enough money was thrown at the problem in a very short time by a few
powerful companies that it was a non-issue.

Nevertheless, if I could get a paper out of it before the big
companies notice, that could be fun :-D
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-13 Thread Mitar
Hi!

I am thinking about a model where you would have only n threads on a
n-core (or processor) machine. They would be your worker threads and
you would spawn them only once (at the beginning of the program) and
then just delegate work between them.

On 8/13/07, Jan-Willem Maessen [EMAIL PROTECTED] wrote:
 The problem here is that while Cilk spawns are incredibly cheap,
 they're still more than a simple procedure call (2-10x as expensive
 if my fading memory serves me rightly).  Let's imagine we have a
 nice, parallelizable computation that we've expressed using recursive
 subdivision (the Cilk folks like to use matrix multiplication as an
 example here).  Near the leaves of that computation we still spend
 the majority of our time paying the overhead of spawning.  So we end
 up actually imposing a depth bound, and writing two versions of our
 computation---the one that spawns, which we use for coarse-grained
 computations, and the one that doesn't, which we use when computation
 gets fine-grained.  It makes a really big difference in practice.

But this could be done at the runtime too. If the
lazy-non-evaluated-yet chunk is big then divide it into a few parts
and run each part in its thread. But if the chunk is small (you are at
the end of the evaluation and you already evaluated necessary
subexpressions) you do it in the thread which encounters this
situation (which is probably near the end of the program or the end of
the monadic IO action).

And this line when you choose to delegate or not can be determined at
runtime too.

In combination with some transactional memory or some other trick
which would be behind this delegation this could be probably possible.

We could also hint runtime that the function would probably take a
long time to compute (even if it is lazy) with making a type for such
functions which would signal this. Of course this could also make
things worse if used improperly. But sometimes you know that you will
be running the map of time-consuming function.

Yes, you have parMap but the problem I saw with it (and please correct
me if I am wrong) is that it spawns a new thread for every application
of the function to the element? But what if functions are small? Then
this is quite an overhead. And you have to know this in advance if you
want to use something else than the default parMap which is not always
possible (if we are passing a function as an argument to the function
which calls map).

For example:

calculate f xs = foldl (+) 0 $ map f xs -- or foldr, I am not sure

And I would like to see something like this: it gets to the point when
we need to evaluate this function call, for some big f and some
big list of xs, so the thread which gets to it starts evaluating the
first value and when it starts with another (it is recursive call so
it is a similar evaluation) it sees that the other thread is empty
and the function would probably take a long time (it speculates) so it
delegates it there and continues with the third element that is a
dummy recursive call to the function, in this case foldl (dummy
because it did not really evaluate everything at the previous level).
Now, if the second thread finishes first, it goes to the next element
(recursive call) but sees that it is already (still) evaluating, so it
gets to the fourth. Otherwise, it the first thread finishes first just
goes to the next element.

This would be some kind of speculative evaluating. If the CPUs know
how to do that why would not we at the higher level?

It would be also an interesting lazy evaluation of the, in this
example, foldl function. The system would make a recursive call but
would just go another call deeper if it finds that it is impossible
(because it is still evaluating somewhere else) to evaluate it. And at
every level it finishes it will check previous levels if it can
collapse them (and maybe prematurely finish the whole thing).

It would be like that I unwind the loop (in this case recursion),
evaluate everything (on many threads) and then (try to) calculate the
value. If it finishes prematurely ... sad, but for big lists and
big functions it would be a saver. And the size of this window
(unwinding) would be a heuristic speculative function of number of
possible threads (cores, processors) and of information how long have
previous evaluations of the same function taken.

I really messed this explanation up. Or maybe it is completely wrong
and this is why it looks like a mess to me. :-)


Mitar
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-13 Thread Jan-Willem Maessen


On Aug 13, 2007, at 2:53 PM, Mitar wrote:


Hi!

I am thinking about a model where you would have only n threads on a
n-core (or processor) machine. They would be your worker threads and
you would spawn them only once (at the beginning of the program) and
then just delegate work between them.

On 8/13/07, Jan-Willem Maessen [EMAIL PROTECTED] wrote:

The problem here is that while Cilk spawns are incredibly cheap,
they're still more than a simple procedure call (2-10x as expensive
if my fading memory serves me rightly).  Let's imagine we have a
nice, parallelizable computation that we've expressed using recursive
subdivision (the Cilk folks like to use matrix multiplication as an
example here).  Near the leaves of that computation we still spend
the majority of our time paying the overhead of spawning.  So we end
up actually imposing a depth bound, and writing two versions of our
computation---the one that spawns, which we use for coarse-grained
computations, and the one that doesn't, which we use when computation
gets fine-grained.  It makes a really big difference in practice.


But this could be done at the runtime too. If the
lazy-non-evaluated-yet chunk is big then divide it into a few parts
and run each part in its thread. But if the chunk is small (you are at
the end of the evaluation and you already evaluated necessary
subexpressions) you do it in the thread which encounters this
situation (which is probably near the end of the program or the end of
the monadic IO action).


I didn't make my point very well.  The hard part is determining  
exactly when a chunk is big or small without actually computing  
its value.  Consider recursive fib (which I use only because it's  
easy to understand, and has been used as an example of this problem  
by the Cilk implementors):


fib n = if n = 1 then n else fib (n-1) + fib (n-2)

Imagine we're computing (fib 30).  We can divide and conquer; plenty  
of parallelism there!  But do most calls to fib represent enough work  
to justify running them in parallel?  No, because most of the calls  
are to (fib 0) or (fib 1)!  We should only pay the spawn cost up to a  
certain bound --- say n = 5 --- and then run serially for smaller  
n.  This has a dramatic effect on how fast fib runs, but of course  
the best possible choice of bound is going to be machine-dependent.


We can instrument our program and have some chance of doing OK for  
fib :: Int - Int; it's not at all obvious what to do for:

   myFunction :: [Int] - (Int - Bool) - [Frob]

In effect, I end up needing to write myFunction with a built-in bound  
on computation, and I need to do it in such a way that the underlying  
systems knows that one branch should be serial and the other branch  
parallel.  This is the problem I was attempting to allude to above.   
Yes, we can decide on a function-by-function or callsite-by-callsite  
basis that enough work is being done to justify parallelism.  But  
the answer to this question is often maybe, or even no (as in fib).



Yes, you have parMap but the problem I saw with it (and please correct
me if I am wrong) is that it spawns a new thread for every application
of the function to the element?



But what if functions are small? Then
this is quite an overhead. And you have to know this in advance if you
want to use something else than the default parMap which is not always
possible (if we are passing a function as an argument to the function
which calls map).

For example:

calculate f xs = foldl (+) 0 $ map f xs -- or foldr, I am not sure


You seem to be arguing that we can pick the right implementation of  
map and fold (serial or parallel) here if we only knew that xs  
was big enough and f expensive enough.


I agree.  But that begs the question: let's assume calculate is a  
function that's called from numerous places, with a mixture of big  
and small arguments.  Now I need two versions of calculate, and I  
need to decide at each call site whether to call big calculate or  
small calculate.  We also need to make sure any eagerness we  
introduce is semantically sound, too, but I think we've got a pretty  
good handle on that part in practice between my work on resource- 
bounded evaluation in Eager Haskell, Rob Ennals' work on eager  
evaluation in GHC, and the Singh  Harris paper.


That isn't to say that any of this is impossible---but it's going to  
take a while to figure out how to get it right [it'd be a great Ph.D.  
project to even knock off the low-hanging fruit like fib and  
recursive matrix multiply].


Meanwhile, we also need to work hard educating programmers how to  
write code that'll run in parallel in the first place, and giving  
them the libraries that'll make it easy.


-Jan-Willem Maessen



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-13 Thread Jan-Willem Maessen


On Aug 11, 2007, at 12:35 PM, Brian Hurt wrote:



You guys might also want to take a look at the Cilk programming  
language, and how it managed threads.  If you know C, learning Cilk  
is about 2 hours of work, as it's C with half a dozen extra  
keywords and a few new concepts.  I'd love to see Cilk - C +  
Haskell as a programming language.


It was called pH, and we (meaning Alejandro Caro and myself)  
implemented it back in the mid/late 90's using Lennart Augustsson's  
hbcc front end (which he hacked to add a bunch of pH-specific  
syntax).  Arvind and Nikhil wrote a textbook on pH programming.


There are two problems, still: one is that laziness means you can't  
actually prove you need something until very close to the time you  
actually want it.  By the time I know that I'm adding f x to g y,  
it's probably too late to usefully run them in parallel (unless  
they're both *very* large).  We used eager evaluation in pH---to the  
point that we actually gave up the ability to manipulate infinite  
lazy data structures.  In NDP they've done much the same thing, first  
instrumenting the program to see that the eagerness they introduce  
won't disrupt execution.  Even the par annotation has this feel: we  
are telling the implementation that it's OK to do some computation  
even if it isn't yet obvious that we'll need the results.


The second problem is controlling the overhead.  More on this below.

The key idea of Cilk is that it's easier to deparallelize than it  
is to parallelize, especially automatically.  So the idea is that  
the program is written incredibly parallel, with huge numbers of  
microthreads, which are (on average) very cheap to spawn.  The  
runtime then deparallelizes the microthreads, running multiple  
microthreads sequentially within a single real thread (a worker  
thread).  Microthreads that live their entire life within a single  
real thread are cheap to spawn (as in not much more expensive than  
a normal function call cheap).


The problem here is that while Cilk spawns are incredibly cheap,  
they're still more than a simple procedure call (2-10x as expensive  
if my fading memory serves me rightly).  Let's imagine we have a  
nice, parallelizable computation that we've expressed using recursive  
subdivision (the Cilk folks like to use matrix multiplication as an  
example here).  Near the leaves of that computation we still spend  
the majority of our time paying the overhead of spawning.  So we end  
up actually imposing a depth bound, and writing two versions of our  
computation---the one that spawns, which we use for coarse-grained  
computations, and the one that doesn't, which we use when computation  
gets fine-grained.  It makes a really big difference in practice.


The programmer is free to use this trick in any programming  
language.  But we haven't yet figured out a way to *avoid* the need  
to do so.  This continues to worry me to this day, because making the  
right choices is black magic and specific to a particular combination  
of algorithm and machine.


That said, there is some reason for optimism: the overhead of  
creating work in Cilk is comparable to the overhead of creating a  
thunk in Haskell.


The problem that Cilk runs into is that it's, well, C.  It doesn't  
deal with contention at well at all- a single microthread blocking  
blocks the whole worker thread- meaning, among other things, that  
you can have false deadlocks, where one microthread blocks on  
another microthread in the same real thread, and thus acts like  
it's deadlocked even though it really isn't.


This is actually a fundamental problem with the threading model:  
there is no guarantee of fairness using work stealing, so if you do  
something that requires fair scheduling you get into a lot of trouble  
fast.  It's not fair to blame C for this.  You have to be very  
careful to define the interaction between fair IO-style threads and  
unfair get-my-work-done threads.


You have greatly increased the likelyhood of raceconditions as well  
(mutable data and multithreading just don't mix).  Plus you have  
all the normal fun you have with C bugs- wild pointers, buffer over  
runs, etc.


This, however, *is* C's fault. :-)

More on pH: we got our programs to scale, but had troubles going past  
8 threads.  We found ourselves limited by a non-parallel GC (my  
fault, but labor-intensive to get right) and the absence of  
parallelism in the underlying algorithms.  For the latter problem  
there simply is no magic bullet.


-Jan-Willem Maessen



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-11 Thread Neil Bartlett
Hugh,

I certainly think it would be wrong to declare that NDP is doomed to
failure... not because you would be making an enemy of SPJ (I'm pretty
sure you wouldn't!) but because it actually aims to solves a less
ambitious problem: the problem of parallelising the SAME task applied
to different data, rather than a collection of arbitrary tasks.
Because the task does not change, we know that e.g. taking half the
data cuts the amount of work in half. Therefore an up-front scheduling
approach can work.

If you fast forward to about 42 minutes into the London HUG video, you
see that Simon talks about the problem of parallelizing (f x) + (g y),
and says that he spent quite a lot of time in the eighties trying to
make this game go fast [..] and the result was pretty much abject
failure.

You're absolutely right that a dynamic/adaptive approach is the only
one that will work when the tasks are of unknown size. Whether this
approach is as easy as you think is open for you to prove. I look
forward to testing your VM implementation, or at the very least
reading your paper on the subject ;-)

Regards,
Neil


On 8/11/07, Hugh Perkins [EMAIL PROTECTED] wrote:
 On 8/11/07, Thomas Conway [EMAIL PROTECTED] wrote:
  There are many papers about this in the Parallel Logic Programming
  area. It is commonly called Embarrassing Parallelism.

 Ah, I wasnt very precise ;-)  I didnt mean I dont understand the
 problem; I meant I dont understand why people think it is difficult to
 solve ;-)  (And then I tried to explain by examples why it is easy,
 but it is true that my communication sucks ;-) )

  you'll only get a benefit if you can break xs into chunks
 of e.g. 10^3 elements or more, and more like 10^5 or more for more
 usual 'f's.

 Actually, the number of elements is irrelevant.  The only measure that
 is important is how long the function is taking to execute, and
 whether it is parellizable.

 Example, the following only has 3 elements but will take a while to run:

 strPrtLn $ sum $ map getNumberPrimes [1024, 2048, 4096 ]

 The following has 10 million elements but runs quickly:

 strPrtLn $ sum $ map (+1) [1..1000 ]

 In the second, we start the function running, in a single thread, and
 after a second, the function has already finished, so great! Were
 done!

 In the first, we start the function running, in a single thread.
 After a second the function is still running, so we look at what is
 taking the time and whether it is parallelizable.

 Turns out the vm has been chugging away on the map for the last
 second, and that that maps are parallelizeable, so we split the map
 into Math.Min( numberelements, number cores) pieces, which on a
 1024-core machine, given we have 3 elements, is Math.Min( 1024, 3 ),
 which is 3.

 So, we assign each of the 3 elements of the map to a thread.  So, now
 we're using 3 of our 64 cores.

 A second later, each thread is still chugging away at its element, so
 we think, ok, maybe we can parallelize more, because we still have 61
 threads/cores available, so now we look at the getNumberOfPrimes
 function itself, and continue the above type of analysis.

 This continues until all 64 cores have been assigned some work to do.

 Whenever a thread finishes, we look around for some work for it to do.
  If there is some, we assign it.  If there isnt, we look around for an
 existing thread that is doing something parallelizeable, and split it
 into two pieces, giving one to the old thread, and one to the
 available thread.

 Not sure why this is perceived as difficult (added the word
 perceived this time ;-) ).  I think the main barrier to
 understanding why it is easy is understanding that this needs to be
 done from a VM, at runtime.  It is not possible to do it statically at
 compilation, but I really need to finish watching SPJ's video before
 declaring that SPJ's proposal is doomed to fail ;-)  Not least,
 probably not good to make an enemy of SPJ ;-)
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-11 Thread Brian Hurt


You guys might also want to take a look at the Cilk programming language, 
and how it managed threads.  If you know C, learning Cilk is about 2 hours 
of work, as it's C with half a dozen extra keywords and a few new 
concepts.  I'd love to see Cilk - C + Haskell as a programming language.


The key idea of Cilk is that it's easier to deparallelize than it is to 
parallelize, especially automatically.  So the idea is that the program is 
written incredibly parallel, with huge numbers of microthreads, which are 
(on average) very cheap to spawn.  The runtime then deparallelizes the 
microthreads, running multiple microthreads sequentially within a single 
real thread (a worker thread).  Microthreads that live their entire life 
within a single real thread are cheap to spawn (as in not much more 
expensive than a normal function call cheap).


The problem that Cilk runs into is that it's, well, C.  It doesn't deal 
with contention at well at all- a single microthread blocking blocks the 
whole worker thread- meaning, among other things, that you can have false 
deadlocks, where one microthread blocks on another microthread in the 
same real thread, and thus acts like it's deadlocked even though it really 
isn't.  You have greatly increased the likelyhood of raceconditions as 
well (mutable data and multithreading just don't mix).  Plus you have all 
the normal fun you have with C bugs- wild pointers, buffer over runs, etc.


All of which you avoid if you replace the C aspects of Cilk with Haskell. 
With STM you avoid the deadlock problem entirely, and with immutable data 
(except for carefully coded monads) you avoid the whole race condition 
problem.  Plus all the normal advantages Haskell has over C.


Just my $0.02.

Brian

On Sat, 11 Aug 2007, Neil Bartlett wrote:


Hugh,

I certainly think it would be wrong to declare that NDP is doomed to
failure... not because you would be making an enemy of SPJ (I'm pretty
sure you wouldn't!) but because it actually aims to solves a less
ambitious problem: the problem of parallelising the SAME task applied
to different data, rather than a collection of arbitrary tasks.
Because the task does not change, we know that e.g. taking half the
data cuts the amount of work in half. Therefore an up-front scheduling
approach can work.

If you fast forward to about 42 minutes into the London HUG video, you
see that Simon talks about the problem of parallelizing (f x) + (g y),
and says that he spent quite a lot of time in the eighties trying to
make this game go fast [..] and the result was pretty much abject
failure.

You're absolutely right that a dynamic/adaptive approach is the only
one that will work when the tasks are of unknown size. Whether this
approach is as easy as you think is open for you to prove. I look
forward to testing your VM implementation, or at the very least
reading your paper on the subject ;-)

Regards,
Neil


On 8/11/07, Hugh Perkins [EMAIL PROTECTED] wrote:

On 8/11/07, Thomas Conway [EMAIL PROTECTED] wrote:

There are many papers about this in the Parallel Logic Programming
area. It is commonly called Embarrassing Parallelism.


Ah, I wasnt very precise ;-)  I didnt mean I dont understand the
problem; I meant I dont understand why people think it is difficult to
solve ;-)  (And then I tried to explain by examples why it is easy,
but it is true that my communication sucks ;-) )


you'll only get a benefit if you can break xs into chunks

of e.g. 10^3 elements or more, and more like 10^5 or more for more
usual 'f's.

Actually, the number of elements is irrelevant.  The only measure that
is important is how long the function is taking to execute, and
whether it is parellizable.

Example, the following only has 3 elements but will take a while to run:

strPrtLn $ sum $ map getNumberPrimes [1024, 2048, 4096 ]

The following has 10 million elements but runs quickly:

strPrtLn $ sum $ map (+1) [1..1000 ]

In the second, we start the function running, in a single thread, and
after a second, the function has already finished, so great! Were
done!

In the first, we start the function running, in a single thread.
After a second the function is still running, so we look at what is
taking the time and whether it is parallelizable.

Turns out the vm has been chugging away on the map for the last
second, and that that maps are parallelizeable, so we split the map
into Math.Min( numberelements, number cores) pieces, which on a
1024-core machine, given we have 3 elements, is Math.Min( 1024, 3 ),
which is 3.

So, we assign each of the 3 elements of the map to a thread.  So, now
we're using 3 of our 64 cores.

A second later, each thread is still chugging away at its element, so
we think, ok, maybe we can parallelize more, because we still have 61
threads/cores available, so now we look at the getNumberOfPrimes
function itself, and continue the above type of analysis.

This continues until all 64 cores have been assigned some work to do.

Whenever a 

Re: [Haskell-cafe] Dynamic thread management?

2007-08-11 Thread Andrew Coppin

Brian Hurt wrote:
The key idea of Cilk is that it's easier to deparallelize than it is 
to parallelize, especially automatically.  So the idea is that the 
program is written incredibly parallel, with huge numbers of 
microthreads, which are (on average) very cheap to spawn.  The runtime 
then deparallelizes the microthreads, running multiple microthreads 
sequentially within a single real thread (a worker thread).  
Microthreads that live their entire life within a single real thread 
are cheap to spawn (as in not much more expensive than a normal 
function call cheap).


That's so absurd, it might even work!

I quite like the concept though - rather than asking what can we make 
parallel?, you say what should we do serially? That sounds like an 
easier question to answer...


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-11 Thread Sebastian Sylvan
On 11/08/07, Brian Hurt [EMAIL PROTECTED] wrote:

 You guys might also want to take a look at the Cilk programming language,
 and how it managed threads.  If you know C, learning Cilk is about 2 hours
 of work, as it's C with half a dozen extra keywords and a few new
 concepts.  I'd love to see Cilk - C + Haskell as a programming language.

 The key idea of Cilk is that it's easier to deparallelize than it is to
 parallelize, especially automatically.  So the idea is that the program is
 written incredibly parallel, with huge numbers of microthreads, which are
 (on average) very cheap to spawn.  The runtime then deparallelizes the
 microthreads, running multiple microthreads sequentially within a single
 real thread (a worker thread).  Microthreads that live their entire life
 within a single real thread are cheap to spawn (as in not much more
 expensive than a normal function call cheap).

 The problem that Cilk runs into is that it's, well, C.  It doesn't deal
 with contention at well at all- a single microthread blocking blocks the
 whole worker thread- meaning, among other things, that you can have false
 deadlocks, where one microthread blocks on another microthread in the
 same real thread, and thus acts like it's deadlocked even though it really
 isn't.  You have greatly increased the likelyhood of raceconditions as
 well (mutable data and multithreading just don't mix).  Plus you have all
 the normal fun you have with C bugs- wild pointers, buffer over runs, etc.

 All of which you avoid if you replace the C aspects of Cilk with Haskell.
 With STM you avoid the deadlock problem entirely, and with immutable data
 (except for carefully coded monads) you avoid the whole race condition
 problem.  Plus all the normal advantages Haskell has over C.


How is this any better than using par in Haskell?

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-11 Thread Brian Hurt



On Sat, 11 Aug 2007, Sebastian Sylvan wrote:


How is this any better than using par in Haskell?


Mainly how the threads are actually scheduled.  Mind you, I'm an 
*incredible* Haskell newbie, so take all of my comments with a 5-pound 
salt block, but as I understand how the current implementation of parallel 
Haskell works, all threads spawned go into a communal heap.  When a thread 
blocks, it's put back into the communal queue, and a new thread is 
selected to run by the worker thread.


In Cilk, the task structure is more tree-like.  When thread A spawns 
thread B, the worker thread stops executing thread A and starts executing 
thread B.  When thread B exits, the worker thread then resumes thread A. 
So in any given point in time, the worker thread has a stack of processes 
waiting for subthreads to complete so they can be resumed- not unlike 
function calls in other languages, or nested lazy blocks in Haskell.


When a worker thread runs out of work to do, when it has no more threads 
to execute, it picks another worker thread at random, and asks the other 
worker thread for work to do.  The other worker thread (assuming it has 
work to do) then picks the microthread at the bottom of the resume stack, 
that is farthest away from being resumed, and passes that microthread's 
state to the original worker thread.


From the user program's perspective, this is no different from the current 
implementation.  Threads get spawned, get executed in some unspecified 
order, and then complete.


What's interesting (to me, at least) are the optimization opportunities 
this model provides that the shared-queue model doesn't.  Consider the 
following structural model: we have a two-tiered heap.  Each worker thread 
has it's own local heap, which only microthreads it is managing can see. 
Plus there is a global heap with objects that all microthreads in all 
worker threads can see.  Objects are originally allocated in the local 
heap.  When a microthread to start being executed on another worker 
thread, all objects it can see (reach, in the conservative GC sense) are 
promoted to the global heap.


The advantage of all of this is that now we can easily distinguish between 
objects that can be seen from other real threads, and those that can't. 
All of the mutable data structures- tvars, lazy thunks, everything- can be 
much more cheaply implemented if you know that no other thread can access 
them.


Take the example of a parallel map, where I'm using a tvar in my map 
function.  The likelyhood is that all of the (short-lived) microthreads 
I'm spawning will execute on the same worker thread- and that thus the 
tvar will never leave the local heap, and thus can be optimized down to 
something close to a single-threaded mvar.  I have, via optimization, 
turned a parallel map and a synchronized tvar back into something 
approaching a serial map and an mvar.


Brian

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-10 Thread Andrew Coppin

Hugh Perkins wrote:


- parallelism must be quite coarse to offset overheads
   (which I think is the problem with expecting things like map
and fold
to parallelised automagically; they're just too small grained for
it to
be worthwhile)


Someone else said that.  I dont understand  what you mean.


If you do 1+2+3+4, it's not very efficient to spawn a new thread, have 
that compute 1+2, compute 3+4 in the current thread, synchronise, and do 
the final addition. You waste more time starting and stopping threads 
than doing useful work. If you're going to spawn a new thread, it needs 
to do lots of work for it to be worth it. And this is kinda what makes 
this stuff so tricky - between lazyness and the Halting Problem, it's 
hard to work out what is worth sparking a thread for...



Imagine I have a map like this:

map f [1..10]

... and I have 64 cores..

So I divide the 10 elements in my list by 64, and give each chunk 
of 1000-ish elements to each thread.  Easy I think?


...you realise that that's a linked-list you have there, right? ;-)

Now, if you had an *array*, and you know that f does a constant amount 
of work for all elements, and that all array elements will eventually be 
needed...


Note that NDP is doing this too, *but NDP is trying to do this for 
assymetric elements at compile-time, which is generally impossible*.  
I propose doing it at runtime, inside the VM, which is trivial and 
optimal.


Parallel and trivial in the same sentence? Hmm, I wonder... how did this 
problem not get solved 20 years ago? Oh, wait, maybe because it's harder 
than it looks. ;-)


It certainly looks *easier* to partition the work at run-time than 
compile-time. But trivial? I doubt it. (For a start, and autosensing 
magic you add mustn't slow the program down more than the parallelism 
it's supposed to be generating!)



It's trivial to improve on NDP at runtime.  For example:

- we have a map of 64 elements, so we give one element to each thread
- all of the pieces but one finishes really quickly (say, within 1 
second)

- so, we start to look at what is happening within the remaining piece
- turns out it contains another map, so we split that map amongst our 
idle threads

- and so on

Easy, flexible.


Seems to require running some sort of monitor process which can 
somehow see what operation(s) a thread is trying to do so it can split 
up the work and share it around. That's going to add overhead. Now 
that's OK so long as the gains outweight the overhead...



- lazy evaluation makes it harder


Not at runtime ;-)


O RLY?

The problem with things being lazy is that you don't know whether some 
data is needed until the moment you actually come to need it. OTOH, 
ideally you'd want to know way in advance so that you can compute it in 
parallel and it will already be ready when you come to use it. But 
with lazyness, you can't [easily] do that. Doing the resting at run-time 
isn't helping you here...



Basically, it's harder than it sounds.


No ;-)   The more I talk about this the more confident I feel that 
this is an easy, optimal solution.


How does the saying go? Anything is easy for the man who does not have 
to implement it...


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Dynamic thread management?

2007-08-10 Thread Bayley, Alistair
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Hugh Perkins
 
 Not many replies on this thread?  Am I so wrong that no-one's 
 even telling me?  I find it hard to believe that if there 
 were obvious errors in the proposition that anyone would 
 resist pointing them out to me ;-)

Well, I vaguely recall a study which analysed a number of programs and
determined that there was not a lot of inherent parallelism in them, so
the effort to automate parallelism wouldn't have much payoff.  Wait a
sec...

This isn't it, but seems relevant:
http://www.detreville.org/papers/Limits.pdf

Here you go:
http://citeseer.ist.psu.edu/tremblay95impact.html


 So, that leaves a couple of possibilites: some people are 
 agreeing, but see no point in saying; or noone cares, because 
 we all only have 1 or 2 core machines.

Well, the Harris/Singh paper summarises the common problems:
 - not many programs are inherently parallel
 - parallelism must be quite coarse to offset overheads
   (which I think is the problem with expecting things like map and fold
to parallelised automagically; they're just too small grained for it to
be worthwhile)
 - lazy evaluation makes it harder

Basically, it's harder than it sounds.

This is where NDP looks good. Although you have to explicitly label
parallel operations, you get to use your functional tools like map and
fold, etc.

Alistair
*
Confidentiality Note: The information contained in this message,
and any attachments, may contain confidential and/or privileged
material. It is intended solely for the person(s) or entity to
which it is addressed. Any review, retransmission, dissemination,
or taking of any action in reliance upon this information by
persons or entities other than the intended recipient(s) is
prohibited. If you received this in error, please contact the
sender and delete the material from any computer.
*
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-10 Thread Jan-Willem Maessen


On Aug 10, 2007, at 9:31 AM, Hugh Perkins wrote:

Not many replies on this thread?  Am I so wrong that no-one's even  
telling me?  I find it hard to believe that if there were obvious  
errors in the proposition that anyone would resist pointing them  
out to me ;-)


So, that leaves a couple of possibilites: some people are agreeing,  
but see no point in saying; or noone cares, because we all only  
have 1 or 2 core machines.


I'm going to kindof run with the second possibility for now.   
However, I do believe it's the right time to solve this, what with  
64-core Niagara's around the corner and so on.


What would be neat would be a way to test solutions on simulated  
1024-core machines, using a single-core machine.  Are there any  
utilities or virtual environments around that might make this kind  
of testing feasible?


It's actually difficult to do realistic simulations of large machines  
like this; most of the performance effects you'll see depend on the  
behavior of the cache and memory subsystems, and it's difficult and  
expensive to simulate those well.  So, for example, you could use  
something like Simics to simulate a 1024-core machine, but it'd be  
expensive (Simics costs money), slow (100x? slower than ordinary  
execution) and non-parallel (so you wouldn't be able to run it on a  
currently-extant multiprocessor box in the hopes of speeding up the  
simulation).  Between the simulator slowdown and the fact that you're  
simulating 1024 cores using only 1 thread, you can expect to wait a  
long time for simulation results.


Also, these things tend to require an awful lot of care and feeding.

[Full disclosure: I don't personally work with Simics or its ilk, but  
my colleagues do.]


-Jan-Willem Maessen



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-10 Thread Hugh Perkins
On 8/10/07, Bayley, Alistair [EMAIL PROTECTED] wrote:

 Well, the Harris/Singh paper summarises the common problems:
 - not many programs are inherently parallel


If thats the case, then multicores are not going to be very useful.  Where
there's a will there's a way.

What I think is: if maps etc will be parallelized out where necessary by the
runtime, then people will adapt their programs to use them.

Right now, many people use imperative loops in Haskell to get the highest
performance (see the shootout examples for example).  In the future, this
will be shunned in preference of standard map and foldb operations.

- parallelism must be quite coarse to offset overheads
(which I think is the problem with expecting things like map and fold
 to parallelised automagically; they're just too small grained for it to
 be worthwhile)


Someone else said that.  I dont understand  what you mean.

Imagine I have a map like this:

map f [1..10]

... and I have 64 cores..

So I divide the 10 elements in my list by 64, and give each chunk of
1000-ish elements to each thread.  Easy I think?

Note that NDP is doing this too, *but NDP is trying to do this for
assymetric elements at compile-time, which is generally impossible*.  I
propose doing it at runtime, inside the VM, which is trivial and optimal.

More details/examples:

It's trivial to improve on NDP at runtime.  For example:

- we split our map into 64 pieces, give each one to a thread
- one piece finishes earlier than the others - we split one of the other
pieces in two, and give one piece to the finished thread
- and so on...

- reasonably optimal

Another example:

- we have a map of 64 elements, so we give one element to each thread
- all of the pieces but one finishes really quickly (say, within 1 second)
- so, we start to look at what is happening within the remaining piece
- turns out it contains another map, so we split that map amongst our idle
threads
- and so on

Easy, flexible.

- lazy evaluation makes it harder


Not at runtime ;-)

Basically, it's harder than it sounds.


No ;-)   The more I talk about this the more confident I feel that this is
an easy, optimal solution.

The bits that stand out as being hard:

- people making their programs use map, foldb etc - but I dont have to do
that, it will happen automatically.  Its sortof like distributing bread
automatically to people in London, via the free market.  Do you know that
the Russians once asked the Western world who was responsible for
distributing bread to people in London?  Answer: noone is, it happens
automatically ;-)

- migrate Haskell/GHC/Erlang/some-other-FP to have a bytecode/VM layer

- create a 1024-ish-core simulator.  In fact this is the only bit that I
dont have a clear vision for.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-10 Thread Hugh Perkins
Not many replies on this thread?  Am I so wrong that no-one's even telling
me?  I find it hard to believe that if there were obvious errors in the
proposition that anyone would resist pointing them out to me ;-)

So, that leaves a couple of possibilites: some people are agreeing, but see
no point in saying; or noone cares, because we all only have 1 or 2 core
machines.

I'm going to kindof run with the second possibility for now.  However, I do
believe it's the right time to solve this, what with 64-core Niagara's
around the corner and so on.

What would be neat would be a way to test solutions on simulated 1024-core
machines, using a single-core machine.  Are there any utilities or virtual
environments around that might make this kind of testing feasible?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-10 Thread Thomas Conway
On 8/11/07, Hugh Perkins [EMAIL PROTECTED] wrote:
  - parallelism must be quite coarse to offset overheads
 (which I think is the problem with expecting things like map and fold
  to parallelised automagically; they're just too small grained for it to
  be worthwhile)

 Someone else said that.  I dont understand  what you mean.

There are many papers about this in the Parallel Logic Programming
area. It is commonly called Embarrassing Parallelism. Creating a
thread, or even just scheduling a chunk of work for evaluation has
packaging-up costs, synchronization costs, and so on. It is all too
easy for these costs to outweigh the work to be done, so by
parallelizing your code, you make it run slower.

So, if you want to parallelize map f xs, unless f is *really*
expensive, you'll only get a benefit if you can break xs into chunks
of e.g. 10^3 elements or more, and more like 10^5 or more for more
usual 'f's. Some tasks, like Monte Carlo integration are very amenable
to such, but most tasks are not.

cheers,
T.
-- 
Dr Thomas Conway
[EMAIL PROTECTED]

Silence is the perfectest herald of joy:
I were but little happy, if I could say how much.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-10 Thread Hugh Perkins
On 8/11/07, Thomas Conway [EMAIL PROTECTED] wrote:
 There are many papers about this in the Parallel Logic Programming
 area. It is commonly called Embarrassing Parallelism.

Ah, I wasnt very precise ;-)  I didnt mean I dont understand the
problem; I meant I dont understand why people think it is difficult to
solve ;-)  (And then I tried to explain by examples why it is easy,
but it is true that my communication sucks ;-) )

 you'll only get a benefit if you can break xs into chunks
of e.g. 10^3 elements or more, and more like 10^5 or more for more
usual 'f's.

Actually, the number of elements is irrelevant.  The only measure that
is important is how long the function is taking to execute, and
whether it is parellizable.

Example, the following only has 3 elements but will take a while to run:

strPrtLn $ sum $ map getNumberPrimes [1024, 2048, 4096 ]

The following has 10 million elements but runs quickly:

strPrtLn $ sum $ map (+1) [1..1000 ]

In the second, we start the function running, in a single thread, and
after a second, the function has already finished, so great! Were
done!

In the first, we start the function running, in a single thread.
After a second the function is still running, so we look at what is
taking the time and whether it is parallelizable.

Turns out the vm has been chugging away on the map for the last
second, and that that maps are parallelizeable, so we split the map
into Math.Min( numberelements, number cores) pieces, which on a
1024-core machine, given we have 3 elements, is Math.Min( 1024, 3 ),
which is 3.

So, we assign each of the 3 elements of the map to a thread.  So, now
we're using 3 of our 64 cores.

A second later, each thread is still chugging away at its element, so
we think, ok, maybe we can parallelize more, because we still have 61
threads/cores available, so now we look at the getNumberOfPrimes
function itself, and continue the above type of analysis.

This continues until all 64 cores have been assigned some work to do.

Whenever a thread finishes, we look around for some work for it to do.
 If there is some, we assign it.  If there isnt, we look around for an
existing thread that is doing something parallelizeable, and split it
into two pieces, giving one to the old thread, and one to the
available thread.

Not sure why this is perceived as difficult (added the word
perceived this time ;-) ).  I think the main barrier to
understanding why it is easy is understanding that this needs to be
done from a VM, at runtime.  It is not possible to do it statically at
compilation, but I really need to finish watching SPJ's video before
declaring that SPJ's proposal is doomed to fail ;-)  Not least,
probably not good to make an enemy of SPJ ;-)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Dynamic thread management?

2007-08-09 Thread Donald Bruce Stewart
hughperkins:
 
Haskell/FP seems to have solved the hardest bit of
threading, which is making it obvious which bits of a
program are parallelizable, and which are not.
Remains to actually parallelize out the programs.  Am I
being naive or is this trivial?

Is there some reason why we cant just start a function
running in a single thread, whilst running profiling, then
after a while we check which bits of the function are taking
time to run, and are parellizable, and we parallelize those
out?

Perhaps have a look at this new paper:

Feedback directed implicit parallelism in Haskell
http://research.microsoft.com/~tharris/papers/2007-fdip.pdf

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe