Re: memory fragmentation with ghc-7.6.1

2012-10-01 Thread Ryan Newton
Hi Ben,

I would bet on the same memory issues Simon mentioned.  But... while you're
at it would you mind trying a little experiment to share your work items
through a lockfree queue rather than a TQueue?

   http://hackage.haskell.org/package/lockfree-queue

Under some situations this can yield some benefit.  But again, you didn't
see workers retrying transactions so this is probably not an issue.

  -Ryan

On Mon, Oct 1, 2012 at 4:18 AM, Simon Marlow  wrote:

> Hi Ben,
>
> My guess would be that you're running into some kind of memory bottleneck.
>  Three common ones are:
>
>   (1) total memory bandwidth
>   (2) cache ping-ponging
>   (3) NUMA overheads
>
> You would run into (1) if you were using an allocation area size (-A or
> -H) larger than the L2 cache.  Your stats seem to indicate that you're
> running with a large heap - could that be the case?
>
> (2) happens if you share data a lot between cores.  It can also happen if
> the RTS shares data between cores, but I've tried to squash as much of that
> as I can.
>
> (3) is sadly something that happens on these large AMD machines (and to
> some extent large multicore Intel boxes too).  Improving our NUMA support
> is something we really need to do.  NUMA overheads tend to manifest as very
> unpredictable runtimes.
>
> I suggest using perf to gather some low-level stats about cache misses and
> suchlike.
>
>   http://hackage.haskell.org/**trac/ghc/wiki/Debugging/**
> LowLevelProfiling/Perf
>
> Cheers,
> Simon
>
>
>
> On 29/09/2012 07:47, Ben Gamari wrote:
>
>> Simon Marlow  writes:
>>
>>  On 28/09/12 17:36, Ben Gamari wrote:
>>>
 Unfortunately, after poking around I found a few obvious problems with
 both the code and my testing configuration which explained the
 performance drop. Things seem to be back to normal now. Sorry for the
 noise! Great job on the new codegen.

>>>
>>> That's good to hear, thanks for letting me know!
>>>
>>>  Of course!
>>
>> That being said, I have run in to a bit of a performance issue which
>> could be related to the runtime system. In particular, as I scale up in
>> thread count (from 6 up to 48, the core count of the machine) in my
>> program[1] (test data available), I'm seeing the total runtime increase,
>> as well as a corresponding increase in CPU-seconds used. This despite
>> the RTS claiming consistently high (~94%) productivity. Meanwhile
>> Threadscope shows that nearly all of my threads are working busily with
>> very few STM retries and no idle time. This in an application which
>> should scale reasonably well (or so I believe). Attached below you will
>> find a crude listing of various runtime statistics over a variety of
>> thread counts (ranging into what should probably be regarded as the
>> absurdly large).
>>
>> The application is a parallel Gibbs sampler for learning probabilistic
>> graphical models. It involves a set of worker threads (updateWorkers)
>> pulling work units off of a common TQueue. After grabbing a work unit,
>> the thread will read a reference to the current global state from an
>> IORef. It will then begin a long-running calculation, resulting in a
>> small value (forced to normal form with deepseq) which it then
>> communicates back to a global update thread (diffWorker) in the form of
>> a lambda through another TQueue. The global update thread then maps the
>> global state (the same as was read from the IORef earlier) through this
>> lambda with atomicModifyIORef'. This is all implemented in [2].
>>
>> I do understand that I'm asking a lot of the language and I have been
>> quite impressed by how well Haskell and GHC have stood up to the
>> challenge thusfar. That being said, the behavior I'm seeing seems a bit
>> strange. If synchronization overhead were the culprit, I'd expect to
>> observe STM retries or thread blocking, which I do not see (by eye it
>> seems that STM retries occur on the order of 5/second and worker threads
>> otherwise appear to run uninterrupted except for GC; GHC event log
>> from a 16 thread run available here[3]). If GC were the problem, I would
>> expect this to be manifested in the productivity, which it is clearly
>> not. Do you have any idea what else might be causing such extreme
>> performance degradation with higher thread counts? I would appreciate
>> any input you would have to offer.
>>
>> Thanks for all of your work!
>>
>> Cheers,
>>
>> - Ben
>>
>>
>> [1] 
>> https://github.com/bgamari/**bayes-stack/v2
>> [2] https://github.com/bgamari/**bayes-stack/blob/v2/**
>> BayesStack/Core/Gibbs.hs
>> [3] 
>> http://goldnerlab.physics.**umass.edu/~bgamari/RunCI.**eventlog
>>
>>
>>
>> Performance of Citation Influence model on lda-handcraft data set
>> 1115 arcs, 702 nodes, 50 items p

Re: Installing binary tarball fails on Linux

2012-10-01 Thread Ganesh Sittampalam
On 01/10/2012 12:05, Simon Marlow wrote:

> This probably means that you have packages installed in your ~/.cabal
> from a 32-bit GHC and you're using a 64-bit one, or vice-versa.  To
> avoid this problem you can configure cabal to put built packages into a
> directory containing the platform name.

How does one do this? I ran into this problem a while ago and couldn't
figure it out:
http://stackoverflow.com/questions/12393750/how-can-i-configure-cabal-to-use-different-folders-for-32-bit-and-64-bit-package

Ganesh


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Installing binary tarball fails on Linux

2012-10-01 Thread Simon Marlow

On 28/09/2012 00:47, Andrea Vezzosi wrote:

On Thu, Feb 23, 2012 at 10:27 PM, Johan Tibell  wrote:

The 64-bit version work so I guess that means that a 32-bit GHC/64-bit
OS combination only works on OS X?


Did you get it working on linux?

I'm getting the same error myself:
...
ghc-cabal: Bad interface file: dist-install/build/GHC/Classes.hi
magic number mismatch: old/corrupt interface file? (wanted 33214052, got
129742)
make[1]: *** [install_packages] Error 1
make: *** [install] Error 2


This probably means that you have packages installed in your ~/.cabal 
from a 32-bit GHC and you're using a 64-bit one, or vice-versa.  To 
avoid this problem you can configure cabal to put built packages into a 
directory containing the platform name.


Cheers,
Simon



___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: memory fragmentation with ghc-7.6.1

2012-10-01 Thread Simon Marlow

Hi Ben,

My guess would be that you're running into some kind of memory 
bottleneck.  Three common ones are:


  (1) total memory bandwidth
  (2) cache ping-ponging
  (3) NUMA overheads

You would run into (1) if you were using an allocation area size (-A or 
-H) larger than the L2 cache.  Your stats seem to indicate that you're 
running with a large heap - could that be the case?


(2) happens if you share data a lot between cores.  It can also happen 
if the RTS shares data between cores, but I've tried to squash as much 
of that as I can.


(3) is sadly something that happens on these large AMD machines (and to 
some extent large multicore Intel boxes too).  Improving our NUMA 
support is something we really need to do.  NUMA overheads tend to 
manifest as very unpredictable runtimes.


I suggest using perf to gather some low-level stats about cache misses 
and suchlike.


  http://hackage.haskell.org/trac/ghc/wiki/Debugging/LowLevelProfiling/Perf

Cheers,
Simon


On 29/09/2012 07:47, Ben Gamari wrote:

Simon Marlow  writes:


On 28/09/12 17:36, Ben Gamari wrote:

Unfortunately, after poking around I found a few obvious problems with
both the code and my testing configuration which explained the
performance drop. Things seem to be back to normal now. Sorry for the
noise! Great job on the new codegen.


That's good to hear, thanks for letting me know!


Of course!

That being said, I have run in to a bit of a performance issue which
could be related to the runtime system. In particular, as I scale up in
thread count (from 6 up to 48, the core count of the machine) in my
program[1] (test data available), I'm seeing the total runtime increase,
as well as a corresponding increase in CPU-seconds used. This despite
the RTS claiming consistently high (~94%) productivity. Meanwhile
Threadscope shows that nearly all of my threads are working busily with
very few STM retries and no idle time. This in an application which
should scale reasonably well (or so I believe). Attached below you will
find a crude listing of various runtime statistics over a variety of
thread counts (ranging into what should probably be regarded as the
absurdly large).

The application is a parallel Gibbs sampler for learning probabilistic
graphical models. It involves a set of worker threads (updateWorkers)
pulling work units off of a common TQueue. After grabbing a work unit,
the thread will read a reference to the current global state from an
IORef. It will then begin a long-running calculation, resulting in a
small value (forced to normal form with deepseq) which it then
communicates back to a global update thread (diffWorker) in the form of
a lambda through another TQueue. The global update thread then maps the
global state (the same as was read from the IORef earlier) through this
lambda with atomicModifyIORef'. This is all implemented in [2].

I do understand that I'm asking a lot of the language and I have been
quite impressed by how well Haskell and GHC have stood up to the
challenge thusfar. That being said, the behavior I'm seeing seems a bit
strange. If synchronization overhead were the culprit, I'd expect to
observe STM retries or thread blocking, which I do not see (by eye it
seems that STM retries occur on the order of 5/second and worker threads
otherwise appear to run uninterrupted except for GC; GHC event log
from a 16 thread run available here[3]). If GC were the problem, I would
expect this to be manifested in the productivity, which it is clearly
not. Do you have any idea what else might be causing such extreme
performance degradation with higher thread counts? I would appreciate
any input you would have to offer.

Thanks for all of your work!

Cheers,

- Ben


[1] https://github.com/bgamari/bayes-stack/v2
[2] https://github.com/bgamari/bayes-stack/blob/v2/BayesStack/Core/Gibbs.hs
[3] http://goldnerlab.physics.umass.edu/~bgamari/RunCI.eventlog



Performance of Citation Influence model on lda-handcraft data set
1115 arcs, 702 nodes, 50 items per node average
100 sweeps in blocks of 10, 200 topics
Running with +RTS -A1G
ghc-7.7 9c15249e082642f9c4c0113133afd78f07f1ade2

Cores  User time (s)  Walltime (s)   CPU %   Productivity
== =  =  ==  =
2  488.66 269.41 188%93.7%
3  533.43 195.28 281%94.1%
4  603.92 166.94 374%94.3%
5  622.40 138.16 466%93.8%
6  663.73 123.00 558%94.2%
7  713.96 114.17 647%94.0%
8  724.66 101.98 736%93.7%
9  802.75 100.59 826%.
10 865.05 97.69  917%.
11 966.97 99.09  1010%   .
12 1238.42114.28 1117%
13 1242.43106.53 1206%
14 1428.59112.48 1310%
15