[Haskell-cafe] Re: Optimizing a high-traffic network architecture

2006-01-05 Thread Simon Marlow

Bulat Ziganshin wrote:

Hello Simon,

Thursday, December 15, 2005, 4:53:27 PM, you wrote:

SM The 3k threads are still GC'd, but they are not actually *copied* during
SM GC.

SM It'll increase the memory overhead per thread from 2k (1k * 2 for
SM copying) to 4k (4k block, no overhead for copying).

Simon, why not to include this in the base package? either change
something so that a 1k-threads will be not copied during GC, or at
least increment default stack size? this will improve performance of
other hyper-threaded programs. memory expenses seems not so great


Because it doesn't always improve things.  This is a slightly modified 
version of the cheap concurrency benchmark from the shootout, first 
without tweaking -k:


 ./threads003 1 +RTS -sstderr
./threads003 1 +RTS -sstderr
 93,908,920 bytes allocated in the heap
159,724,208 bytes copied during GC (scavenged)
  1,559,376 bytes copied during GC (not scavenged)
 10,415,848 bytes maximum residency (4 sample(s))

177 collections in generation 0 (  1.05s)
  4 collections in generation 1 (  0.02s)

 21 Mb total memory in use

  INIT  time0.00s  (  0.00s elapsed)
  MUT   time1.28s  (  1.28s elapsed)
  GCtime1.06s  (  1.09s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time2.35s  (  2.37s elapsed)

  %GC time  45.3%  (45.9% elapsed)

  Alloc rate73,149,011 bytes per MUT second

  Productivity  54.7% of total user, 54.1% of total elapsed

and now tweaking -k (using -k6k, because this is a 64-bit machine and 
storage manager blocks are 8k):


  ./threads003 1 +RTS -sstderr -k6k
./threads003 1 +RTS -sstderr -k6k
168,837,736 bytes allocated in the heap
109,203,160 bytes copied during GC (scavenged)
  1,497,728 bytes copied during GC (not scavenged)
 71,180,464 bytes maximum residency (2 sample(s))

156 collections in generation 0 (  1.06s)
  2 collections in generation 1 (  0.01s)

 86 Mb total memory in use

  INIT  time0.00s  (  0.00s elapsed)
  MUT   time2.48s  (  2.58s elapsed)
  GCtime1.08s  (  1.08s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time3.56s  (  3.65s elapsed)

  %GC time  30.3%  (29.5% elapsed)

  Alloc rate68,007,748 bytes per MUT second

  Productivity  69.7% of total user, 67.9% of total elapsed

My hypothesis is that when we give each thread its own memory block, all 
the thread stacks occupy the same cache lines and we end up with a lot 
more cache misses (notice it's the MUT time that increased, not the GC 
time).


Test program attached, if anyone's interested in digging further.

Cheers,
Simon
-- $Id: message-ghc-2.code,v 1.3 2005/09/17 04:36:26 bfulgham Exp $
-- The Great Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- Contributed by Einar Karttunen
-- Modified by Simon Marlow

-- This is the shootout cheap concurrency benchmark, modified
-- slightly.  Modification noted below (***) to add more concurrency
-- and make a speedup on multiple processors available.

-- Creates 500 threads arranged in a sequence where each takes a value
-- from the left, adds 1, and passes it to the right (via MVars).
-- N more threads pump zeros in at the left.  A sub-thread
-- takes N values from the right and sums them.
-- 

import Control.Concurrent
import Control.Monad
import System

thread :: MVar Int - MVar Int - IO ()
thread inp out = do x - takeMVar inp; putMVar out $! x+1; thread inp out

spawn cur _ = do next - newEmptyMVar
 forkIO $ thread cur next
 return next

main = do n - getArgs = readIO.head
  s - newEmptyMVar
  e - foldM spawn s [1..500]
  f - newEmptyMVar
  forkIO $ replicateM n (takeMVar e) = putMVar f . sum
  replicateM n (forkIO $ putMVar s 0)
-- ***replicateM n (putMVar s 0)
  takeMVar f

-- vim: ts=4 ft=haskell
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Optimizing a high-traffic network architecture

2006-01-05 Thread Joel Reymont

My apologies if this has been described somewhere but what is MUT time?

Also, isn't 30% GC a bit high? This is something that totally  
surprised me when I first saw it as my program was spending 60-70% on  
GC.


Is there a good low % number that should be used as a benchmark?

Thanks, Joel

On Jan 5, 2006, at 10:01 AM, Simon Marlow wrote:


  INIT  time0.00s  (  0.00s elapsed)
  MUT   time2.48s  (  2.58s elapsed)
  GCtime1.08s  (  1.08s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time3.56s  (  3.65s elapsed)

  %GC time  30.3%  (29.5% elapsed)

  Alloc rate68,007,748 bytes per MUT second

  Productivity  69.7% of total user, 67.9% of total elapsed

My hypothesis is that when we give each thread its own memory  
block, all the thread stacks occupy the same cache lines and we end  
up with a lot more cache misses (notice it's the MUT time that  
increased, not the GC time).


--
http://wagerlabs.com/





___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Optimizing a high-traffic network architecture

2006-01-05 Thread Simon Marlow

Joel Reymont wrote:

My apologies if this has been described somewhere but what is MUT time?


MUTator time, i.e. the time spent doing real work by your program.  (the 
term mutator isn't used so much these days, but it comes from the view 
of a functional program as a graph, and the engine that mutates or 
reduces the graph is called the mutator.)


Also, isn't 30% GC a bit high? This is something that totally  surprised 
me when I first saw it as my program was spending 60-70% on  GC.


It is a little high, yes.  GC time on a 64-bit machine tends to be 
higher because there are twice as many bits to shuffle around.



Is there a good low % number that should be used as a benchmark?


As a rule of thumb I usually consider 30% to be a bit on the high side, 
it would be time to investigate with profiling and try to reduce 
residency or allocation rate.


Cheers,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe