Re: parallel garbage collection performance

2012-06-27 Thread Simon Marlow

On 26/06/2012 00:42, Ryan Newton wrote:

However, the parallel GC will be a problem if one or more of your
cores is being used by other process(es) on the machine.  In that
case, the GC synchronisation will stall and performance will go down
the drain.  You can often see this on a ThreadScope profile as a big
delay during GC while the other cores wait for the delayed core.
  Make sure your machine is quiet and/or use one fewer cores than
the total available.  It's not usually a good idea to use
hyperthreaded cores either.


Does it ever help to set the number of GC threads greater than
numCapabilities to over-partition the GC work?  The idea would be to
enable some load balancing in the face of perturbation from external
load on the machine...

It looks like GHC 6.10 had a "-g" flag for this that later went away?


The GC threads map one-to-one onto mutator threads now (since 6.12). 
This change was crucial for performance, before that we hardly ever got 
any speedup from parallel GC because there was no guarantee of locality.


I don't think it would help to have more threads.  The load-balancing is 
already done with work-stealing, it isn't statically partitioned.


Cheers,
Simon


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: parallel garbage collection performance

2012-06-25 Thread John Lato
Thanks very much for this information.  My observations match your
recommendations, insofar as I can test them.

Cheers,
John

On Mon, Jun 25, 2012 at 11:42 PM, Simon Marlow  wrote:
> On 19/06/12 02:32, John Lato wrote:
>>
>> Thanks for the suggestions.  I'll try them and report back.  Although
>> I've since found that out of 3 not-identical systems, this problem
>> only occurs on one.  So I may try different kernel/system libs and see
>> where that gets me.
>>
>> -qg is funny.  My interpretation from the results so far is that, when
>> the parallel collector doesn't get stalled, it results in a big win.
>> But when parGC does stall, it's slower than disabling parallel gc
>> entirely.
>
>
> Parallel GC is usually a win for idiomatic Haskell code, it may or may not
> be a good idea for things like Repa - I haven't done much analysis of those
> types of programs yet.  Experiment with the -A flag, e.g. -A1m is often
> better than the default if your processor has a large cache.
>
> However, the parallel GC will be a problem if one or more of your cores is
> being used by other process(es) on the machine.  In that case, the GC
> synchronisation will stall and performance will go down the drain.  You can
> often see this on a ThreadScope profile as a big delay during GC while the
> other cores wait for the delayed core.  Make sure your machine is quiet
> and/or use one fewer cores than the total available.  It's not usually a
> good idea to use hyperthreaded cores either.
>
> I'm also seeing unpredictable performance on a 32-core AMD machine with
> NUMA.  I'd avoid NUMA for Haskell for the time being if you can.  Indeed you
> get unpredictable performance on this machine even for single-threaded code,
> because it makes a difference on which node the pages of your executable are
> cached (I heard a rumour that Linux has some kind of a fix for this in the
> pipeline, but I don't know the details).
>
>
>> I had thought the last core parallel slowdown problem was fixed a
>> while ago, but apparently not?
>
>
> We improved matters by inserting some "yield"s into the spinlock loops.
>  This helped a lot, but the problem still exists.
>
> Cheers,
>        Simon
>
>
>
>> Thanks,
>> John
>>
>> On Tue, Jun 19, 2012 at 8:49 AM, Ben Lippmeier  wrote:
>>>
>>>
>>> On 19/06/2012, at 24:48 , Tyson Whitehead wrote:
>>>
 On June 18, 2012 04:20:51 John Lato wrote:
>
> Given this, can anyone suggest any likely causes of this issue, or
> anything I might want to look for?  Also, should I be concerned about
> the much larger gc_alloc_block_sync level for the slow run?  Does that
> indicate the allocator waiting to alloc a new block, or is it
> something else?  Am I on completely the wrong track?


 A total shot in the dark here, but wasn't there something about really
 bad
 performance when you used all the CPUs on your machine under Linux?

 Presumably very tight coupling that is causing all the threads to stall
 everytime the OS needs to do something or something?
>>>
>>>
>>> This can be a problem for data parallel computations (like in Repa). In
>>> Repa all threads in the gang are supposed to run for the same time, but if
>>> one gets swapped out by the OS then the whole gang is stalled.
>>>
>>> I tend to get best results using -N7 for an 8 core machine.
>>>
>>> It is also important to enable thread affinity (with the -qa) flag.
>>>
>>> For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg
>>>
>>> Ben.
>>>
>>>
>>
>> ___
>> Glasgow-haskell-users mailing list
>> Glasgow-haskell-users@haskell.org
>> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
>
>

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: parallel garbage collection performance

2012-06-25 Thread Ryan Newton
>
> However, the parallel GC will be a problem if one or more of your cores is
> being used by other process(es) on the machine.  In that case, the GC
> synchronisation will stall and performance will go down the drain.  You can
> often see this on a ThreadScope profile as a big delay during GC while the
> other cores wait for the delayed core.  Make sure your machine is quiet
> and/or use one fewer cores than the total available.  It's not usually a
> good idea to use hyperthreaded cores either.
>

Does it ever help to set the number of GC threads greater than
numCapabilities to over-partition the GC work?  The idea would be to enable
some load balancing in the face of perturbation from external load on the
machine...

It looks like GHC 6.10 had a "-g" flag for this that later went away?

  -Ryan
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: parallel garbage collection performance

2012-06-25 Thread Simon Marlow

On 19/06/12 02:32, John Lato wrote:

Thanks for the suggestions.  I'll try them and report back.  Although
I've since found that out of 3 not-identical systems, this problem
only occurs on one.  So I may try different kernel/system libs and see
where that gets me.

-qg is funny.  My interpretation from the results so far is that, when
the parallel collector doesn't get stalled, it results in a big win.
But when parGC does stall, it's slower than disabling parallel gc
entirely.


Parallel GC is usually a win for idiomatic Haskell code, it may or may 
not be a good idea for things like Repa - I haven't done much analysis 
of those types of programs yet.  Experiment with the -A flag, e.g. -A1m 
is often better than the default if your processor has a large cache.


However, the parallel GC will be a problem if one or more of your cores 
is being used by other process(es) on the machine.  In that case, the GC 
synchronisation will stall and performance will go down the drain.  You 
can often see this on a ThreadScope profile as a big delay during GC 
while the other cores wait for the delayed core.  Make sure your machine 
is quiet and/or use one fewer cores than the total available.  It's not 
usually a good idea to use hyperthreaded cores either.


I'm also seeing unpredictable performance on a 32-core AMD machine with 
NUMA.  I'd avoid NUMA for Haskell for the time being if you can.  Indeed 
you get unpredictable performance on this machine even for 
single-threaded code, because it makes a difference on which node the 
pages of your executable are cached (I heard a rumour that Linux has 
some kind of a fix for this in the pipeline, but I don't know the details).



I had thought the last core parallel slowdown problem was fixed a
while ago, but apparently not?


We improved matters by inserting some "yield"s into the spinlock loops. 
 This helped a lot, but the problem still exists.


Cheers,
Simon




Thanks,
John

On Tue, Jun 19, 2012 at 8:49 AM, Ben Lippmeier  wrote:


On 19/06/2012, at 24:48 , Tyson Whitehead wrote:


On June 18, 2012 04:20:51 John Lato wrote:

Given this, can anyone suggest any likely causes of this issue, or
anything I might want to look for?  Also, should I be concerned about
the much larger gc_alloc_block_sync level for the slow run?  Does that
indicate the allocator waiting to alloc a new block, or is it
something else?  Am I on completely the wrong track?


A total shot in the dark here, but wasn't there something about really bad
performance when you used all the CPUs on your machine under Linux?

Presumably very tight coupling that is causing all the threads to stall
everytime the OS needs to do something or something?


This can be a problem for data parallel computations (like in Repa). In Repa 
all threads in the gang are supposed to run for the same time, but if one gets 
swapped out by the OS then the whole gang is stalled.

I tend to get best results using -N7 for an 8 core machine.

It is also important to enable thread affinity (with the -qa) flag.

For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg

Ben.




___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users



___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: parallel garbage collection performance

2012-06-18 Thread Manuel M T Chakravarty
Bryan O'Sullivan :
> On Mon, Jun 18, 2012 at 9:32 PM, John Lato  wrote:
> 
> I had thought the last core parallel slowdown problem was fixed a
> while ago, but apparently not?
> 
> Simon Marlow has thought so in the not too distant past (since he did the 
> work), if my recollection is correct.

It may very well be fixed for non-dataparallel programs. For dataparallel 
programs the situation is more tricky as we rely on all threads participating 
in a DP computation to be scheduled simultaneously. If one Core is currently 
tied up by the OS, then GHC's RTS can't do anything about that.  As it has no 
concept of gang scheduling (but treats the threads participating in a DP 
computation individually), it also doesn't know that scheduling a subset of the 
threads in the gang is counterproductive.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: parallel garbage collection performance

2012-06-18 Thread Ben Lippmeier

On 19/06/2012, at 13:53 , Ben Lippmeier wrote:

> 
> On 19/06/2012, at 10:59 , Manuel M T Chakravarty wrote:
> 
>> I wonder, do we have a Repa FAQ (or similar) that explain such issues? (And 
>> is easily discoverable?)
> 
> I've been trying to collect the main points in the haddocs for the main 
> module [1], but this one isn't there yet.
> 
> I need to update the Repa tutorial, on the Haskell wiki, and this should also 
> go in it


I also added thread affinity to the Repa FAQ [1].

Ben.

[1] http://repa.ouroborus.net/




___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: parallel garbage collection performance

2012-06-18 Thread Ben Lippmeier

On 19/06/2012, at 10:59 , Manuel M T Chakravarty wrote:

> I wonder, do we have a Repa FAQ (or similar) that explain such issues? (And 
> is easily discoverable?)

I've been trying to collect the main points in the haddocs for the main module 
[1], but this one isn't there yet.

I need to update the Repa tutorial, on the Haskell wiki, and this should also 
go in it

Ben.

[1] 
http://hackage.haskell.org/packages/archive/repa/3.2.1.1/doc/html/Data-Array-Repa.html


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: parallel garbage collection performance

2012-06-18 Thread Bryan O'Sullivan
On Mon, Jun 18, 2012 at 9:32 PM, John Lato  wrote:

>
> I had thought the last core parallel slowdown problem was fixed a
> while ago, but apparently not?
>

Simon Marlow has thought so in the not too distant past (since he did the
work), if my recollection is correct.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: parallel garbage collection performance

2012-06-18 Thread John Lato
Thanks for the suggestions.  I'll try them and report back.  Although
I've since found that out of 3 not-identical systems, this problem
only occurs on one.  So I may try different kernel/system libs and see
where that gets me.

-qg is funny.  My interpretation from the results so far is that, when
the parallel collector doesn't get stalled, it results in a big win.
But when parGC does stall, it's slower than disabling parallel gc
entirely.

I had thought the last core parallel slowdown problem was fixed a
while ago, but apparently not?

Thanks,
John

On Tue, Jun 19, 2012 at 8:49 AM, Ben Lippmeier  wrote:
>
> On 19/06/2012, at 24:48 , Tyson Whitehead wrote:
>
>> On June 18, 2012 04:20:51 John Lato wrote:
>>> Given this, can anyone suggest any likely causes of this issue, or
>>> anything I might want to look for?  Also, should I be concerned about
>>> the much larger gc_alloc_block_sync level for the slow run?  Does that
>>> indicate the allocator waiting to alloc a new block, or is it
>>> something else?  Am I on completely the wrong track?
>>
>> A total shot in the dark here, but wasn't there something about really bad
>> performance when you used all the CPUs on your machine under Linux?
>>
>> Presumably very tight coupling that is causing all the threads to stall
>> everytime the OS needs to do something or something?
>
> This can be a problem for data parallel computations (like in Repa). In Repa 
> all threads in the gang are supposed to run for the same time, but if one 
> gets swapped out by the OS then the whole gang is stalled.
>
> I tend to get best results using -N7 for an 8 core machine.
>
> It is also important to enable thread affinity (with the -qa) flag.
>
> For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg
>
> Ben.
>
>

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: parallel garbage collection performance

2012-06-18 Thread Manuel M T Chakravarty
I wonder, do we have a Repa FAQ (or similar) that explain such issues? (And is 
easily discoverable?)

Manuel


Ben Lippmeier :
> On 19/06/2012, at 24:48 , Tyson Whitehead wrote:
> 
>> On June 18, 2012 04:20:51 John Lato wrote:
>>> Given this, can anyone suggest any likely causes of this issue, or
>>> anything I might want to look for?  Also, should I be concerned about
>>> the much larger gc_alloc_block_sync level for the slow run?  Does that
>>> indicate the allocator waiting to alloc a new block, or is it
>>> something else?  Am I on completely the wrong track?
>> 
>> A total shot in the dark here, but wasn't there something about really bad 
>> performance when you used all the CPUs on your machine under Linux?
>> 
>> Presumably very tight coupling that is causing all the threads to stall 
>> everytime the OS needs to do something or something?
> 
> This can be a problem for data parallel computations (like in Repa). In Repa 
> all threads in the gang are supposed to run for the same time, but if one 
> gets swapped out by the OS then the whole gang is stalled.
> 
> I tend to get best results using -N7 for an 8 core machine. 
> 
> It is also important to enable thread affinity (with the -qa) flag. 
> 
> For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg
> 
> Ben.
> 
> 
> 
> ___
> Glasgow-haskell-users mailing list
> Glasgow-haskell-users@haskell.org
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: parallel garbage collection performance

2012-06-18 Thread Ben Lippmeier

On 19/06/2012, at 24:48 , Tyson Whitehead wrote:

> On June 18, 2012 04:20:51 John Lato wrote:
>> Given this, can anyone suggest any likely causes of this issue, or
>> anything I might want to look for?  Also, should I be concerned about
>> the much larger gc_alloc_block_sync level for the slow run?  Does that
>> indicate the allocator waiting to alloc a new block, or is it
>> something else?  Am I on completely the wrong track?
> 
> A total shot in the dark here, but wasn't there something about really bad 
> performance when you used all the CPUs on your machine under Linux?
> 
> Presumably very tight coupling that is causing all the threads to stall 
> everytime the OS needs to do something or something?

This can be a problem for data parallel computations (like in Repa). In Repa 
all threads in the gang are supposed to run for the same time, but if one gets 
swapped out by the OS then the whole gang is stalled.

I tend to get best results using -N7 for an 8 core machine. 

It is also important to enable thread affinity (with the -qa) flag. 

For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg

Ben.



___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: parallel garbage collection performance

2012-06-18 Thread Tyson Whitehead
On June 18, 2012 04:20:51 John Lato wrote:
> Given this, can anyone suggest any likely causes of this issue, or
> anything I might want to look for?  Also, should I be concerned about
> the much larger gc_alloc_block_sync level for the slow run?  Does that
> indicate the allocator waiting to alloc a new block, or is it
> something else?  Am I on completely the wrong track?

A total shot in the dark here, but wasn't there something about really bad 
performance when you used all the CPUs on your machine under Linux?

Presumably very tight coupling that is causing all the threads to stall 
everytime the OS needs to do something or something?

Cheers!  -Tyson

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


parallel garbage collection performance

2012-06-18 Thread John Lato
Hello,

I have a program that is intermittently experiencing performance
issues that I believe are related to parallel GC, and I was hoping to
get some advice on how I might improve it.  Essentially, any given
execution is either slow or fast (the same executable, without
recompiling), most often slow.  So far I can't find anything that
would trigger either case.  This is with ghc-7.4.2 on 64bit linux.
Here are the statistics from running with -N4 -A8m -s:

slow run:

  16,647,460,328 bytes allocated in the heap
 313,767,248 bytes copied during GC
  17,305,120 bytes maximum residency (22 sample(s))
 601,952 bytes maximum slop
  73 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
  Gen  0  1268 colls,  1267 par8.62s8.00s 0.0063s0.0389s
  Gen  122 colls,22 par0.63s0.60s 0.0275s0.0603s

  Parallel GC work balance: 1.53 (39176141 / 25609887, ideal 4)

MUT time (elapsed)   GC time  (elapsed)
  Task  0 (worker) :0.00s(  0.01s)   0.00s(  0.00s)
  Task  1 (worker) :0.00s( 13.66s)   0.01s(  0.04s)
  Task  2 (bound)  :0.00s( 13.98s)   0.00s(  0.00s)
  Task  3 (worker) :0.00s( 18.14s)   0.16s(  0.44s)
  Task  4 (worker) :0.53s( 17.49s)   1.29s(  4.25s)
  Task  5 (worker) :0.00s( 17.45s)   1.25s(  4.42s)
  Task  6 (worker) :0.00s( 14.98s)   1.75s(  6.90s)
  Task  7 (worker) :0.00s( 21.87s)   0.02s(  0.06s)
  Task  8 (worker) :0.01s( 37.12s)   0.06s(  0.17s)
  Task  9 (worker) :0.00s( 21.41s)   4.88s( 15.99s)
  Task 10 (worker) :0.84s( 43.06s)   1.99s(  8.25s)
  Task 11 (bound)  :6.39s( 51.13s)   0.06s(  0.18s)
  Task 12 (worker) :0.00s(  0.00s)   8.04s( 21.42s)
  Task 13 (worker) :0.43s( 28.38s)   8.14s( 22.94s)
  Task 14 (worker) :5.35s( 29.30s)   5.81s( 22.02s)

  SPARKS: 7 (7 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INITtime0.03s  (  0.01s elapsed)
  MUT time   43.88s  ( 42.71s elapsed)
  GC  time9.26s  (  8.60s elapsed)
  EXITtime0.01s  (  0.01s elapsed)
  Total   time   53.65s  ( 51.34s elapsed)

  Alloc rate374,966,825 bytes per MUT second

  Productivity  82.7% of total user, 86.4% of total elapsed

gc_alloc_block_sync: 1388000
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

-- --
Fast run:

  42,061,441,560 bytes allocated in the heap
 725,062,720 bytes copied during GC
  36,963,480 bytes maximum residency (21 sample(s))
   1,382,536 bytes maximum slop
 141 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
  Gen  0  3206 colls,  3205 par8.34s1.87s 0.0006s0.0089s
  Gen  121 colls,21 par0.76s0.17s 0.0081s0.0275s

  Parallel GC work balance: 1.78 (90535973 / 50955059, ideal 4)

MUT time (elapsed)   GC time  (elapsed)
  Task  0 (worker) :0.00s(  0.00s)   0.00s(  0.00s)
  Task  1 (worker) :0.00s(  0.00s)   0.00s(  0.00s)
  Task  2 (worker) :0.00s( 11.50s)   0.00s(  0.00s)
  Task  3 (worker) :0.00s( 12.40s)   0.00s(  0.00s)
  Task  4 (worker) :0.58s( 12.20s)   0.59s(  0.61s)
  Task  5 (bound)  :0.00s( 12.89s)   0.00s(  0.00s)
  Task  6 (worker) :0.00s( 13.40s)   0.02s(  0.02s)
  Task  7 (worker) :0.00s( 14.66s)   0.00s(  0.00s)
  Task  8 (worker) :0.95s( 14.18s)   0.69s(  0.76s)
  Task  9 (worker) :2.82s( 13.50s)   1.37s(  1.44s)
  Task 10 (worker) :1.72s( 17.59s)   1.07s(  1.16s)
  Task 11 (worker) :3.99s( 24.68s)   0.37s(  0.38s)
  Task 12 (worker) :1.24s( 24.25s)   0.80s(  0.82s)
  Task 13 (bound)  :6.18s( 25.02s)   0.04s(  0.04s)
  Task 14 (worker) :1.46s( 23.42s)   1.59s(  1.65s)
  Task 15 (worker) :0.00s(  0.00s)   0.66s(  0.66s)
  Task 16 (worker) :   11.00s( 23.36s)   1.67s(  1.70s)

  SPARKS: 28 (28 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INITtime0.04s  (  0.02s elapsed)
  MUT time   42.08s  ( 23.02s elapsed)
  GC  time9.10s  (  2.04s elapsed)
  EXITtime0.00s  (  0.00s elapsed)
  Total   time   51.69s  ( 25.09s elapsed)

  Alloc rate987,695,300 bytes per MUT second

  Productivity  82.3% of total user, 169.6% of total elapsed

gc_alloc_block_sync: 164572
whitehole_spin: 0
gen[0].sync: 164
gen[1].sync: 18147

When I record an eventlog and view it with Threadscope, the slow run
shows long frequent pauses for GC, whereas on a fast r