[Haskell-cafe] Re: Joels Time Leak

2006-01-05 Thread Simon Marlow

Bulat Ziganshin wrote:

Hello Simon,

Wednesday, January 04, 2006, 7:33:22 PM, you wrote:



The minimum time between context switches is 20 milliseconds.




SM> Sure, there's no reason why we couldn't do this.  Of course, even
SM> idle Haskell processes will be ticking away in the background, so
SM> there's a reason not to make the interval too short.  What do  
SM> you think is reasonable?


Simon, the talk is about changing GHC _tick_, which is a _minimal_
possible context switch interval



SM> Yes, I know.

in this case, why you say that idle processes will be ticking in
background? it's entirely up to program writers/users if they wants to
decrease this interval. by default switches will occur in the same 20
ms, lesser tick will have no impact on ordinary programs as long as
default switching time will be not changed


It will have an impact, because the tick signal has to be delivered to 
the process and the signal handler run.  The impact on CPU time is 
small, but could be noticeable if we were to choose too small an 
interval.  Furthermore, the ticker prevents idle Haskell processes from 
being completely swapped out (that problem already exists).


Cheers,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Joels Time Leak

2006-01-04 Thread Simon Marlow

Bulat Ziganshin wrote:


Tuesday, January 03, 2006, 7:43:21 PM, you wrote:



The minimum time between context switches is 20 milliseconds.

Is there any good reason why 0.02 seconds is the best that you can get
here? Couldn't GHC's internal timer tick at a _much_ faster rate (like
50-100µs or so)?



SM> Sure, there's no reason why we couldn't do this.  Of course, even
SM> idle Haskell processes will be ticking away in the background, so
SM> there's a reason not to make the interval too short.  What do  
SM> you think is reasonable?


Simon, the talk is about changing GHC _tick_, which is a _minimal_
possible context switch interval. so, we want to decrease this tick
and retain current 20 ms _default_ switch interval. this will make
possible to decrease switch interval for programs that really need it,
which is currently entirely impossible.


Yes, I know.

Cheers,
Simon


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Re[2]: [Haskell-cafe] Re: Joels Time Leak

2006-01-04 Thread Joel Reymont

This is my latest version. Based on Don's tweaks.

{-# INLINE sequ #-}

sequ :: (b -> a) -> PU a -> (a -> PU b) -> PU b
sequ a b c | a `seq` b `seq` c `seq` False = undefined
sequ f pa k = PU fn1 fn2 fn3
where
  {-# INLINE fn1 #-}
  fn1 ptr b =
  case f b of
a -> case k a of
   pb -> do ptr' <- appP pa ptr a
appP pb ptr' b
  {-# INLINE fn2 #-}
  fn2 ptr = do (a, ptr') <- appU pa ptr
   case k a of pb -> appU pb ptr'
  {-# INLINE fn3 #-}
  fn3 b = case f b of
a -> case k a of
   pb -> do sz1 <- appS pa a
sz2 <- appS pb b
return $! sz1 + sz2

On Jan 4, 2006, at 4:18 PM, Bulat Ziganshin wrote:


are you tried to inline it? and all other pickling combinators

the problem is what when you write

put (Cmd a b) = do putWord16 a; putWord32 b

and inline putWord16/putWord32, you can be sure that you will get
sequencing for free. but what is a pickling combinators? it's a
high-order functions, which combines drivers for simple types like
Word16 to final driver which can read entire Command. the principial
question - will this final driver be interpreted, i.e. executed as a
large number of enclosed calls to pickler combination functions, or it
will be compiled, i.e. executed as simple sequence of getByte calls,
which then builds the final value. in first case your program will
spend all its time in these combinator funtions calls, so you will not
have much effect from using Ptrs in elementary picklers


--
http://wagerlabs.com/





___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Joels Time Leak

2006-01-04 Thread Simon Marlow

Bulat Ziganshin wrote:


Tuesday, January 03, 2006, 7:43:21 PM, you wrote:



The minimum time between context switches is 20 milliseconds.

Is there any good reason why 0.02 seconds is the best that you can get
here? Couldn't GHC's internal timer tick at a _much_ faster rate (like
50-100µs or so)?



SM> Sure, there's no reason why we couldn't do this.  Of course, even
SM> idle Haskell processes will be ticking away in the background, so
SM> there's a reason not to make the interval too short.  What do  
SM> you think is reasonable?


Simon, the talk is about changing GHC _tick_, which is a _minimal_
possible context switch interval. so, we want to decrease this tick
and retain current 20 ms _default_ switch interval. this will make
possible to decrease switch interval for programs that really need it,
which is currently entirely impossible.


Yes, I know.

Cheers,
Simon



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: [Haskell-cafe] Re: Joels Time Leak

2006-01-04 Thread Bulat Ziganshin
Hello Joel,

Wednesday, January 04, 2006, 12:42:24 AM, you wrote:

JR> contribute to my delays and timeouts. There are also quite a few
JR> unanswered questions at the moment (why is 'sequ' slow?

are you tried to inline it? and all other pickling combinators

the problem is what when you write

put (Cmd a b) = do putWord16 a; putWord32 b

and inline putWord16/putWord32, you can be sure that you will get
sequencing for free. but what is a pickling combinators? it's a
high-order functions, which combines drivers for simple types like
Word16 to final driver which can read entire Command. the principial
question - will this final driver be interpreted, i.e. executed as a
large number of enclosed calls to pickler combination functions, or it
will be compiled, i.e. executed as simple sequence of getByte calls,
which then builds the final value. in first case your program will
spend all its time in these combinator funtions calls, so you will not
have much effect from using Ptrs in elementary picklers

because of this issue i said you that pickler combinators can't
guarantee performance, in constrast to Binary package which uses an
artless approach to combine individual "picklers" - separate get and
put fucntions so that each one becomes an straightforward imperative
program as opposite to tuple carrying several functional values

JR> does the scheduler need to be tuned?)

try something like this:

forever
  recvPacket
  withMVar global
unzip
unpickle
runScript
sendAnswer
  yield

it will ensure that commands will be processed sequentially; i think
it's the best you can do in this program: all tasks inside lock are
cpu-bound, so it is better to finish them in one thread before going
to another

in production code you will also need to guard whole withMVar block
with small timeout (say, 0.02s)


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Joels Time Leak

2006-01-04 Thread Simon Marlow

Joel Reymont wrote:


I don't think CPU usage is the issue. An individual thread will take  a 
fraction of a second to deserialize a large packet.


It's a combination of CPU usage by the pickler and GC load.

Those 50k packets take 0.03 seconds to unpickle (version of unstuff.hs
to measure that is attached).  With 100 threads running, even with a
completely fair scheduler the time taken for one thread to unpickle that
packet is going to be 3 seconds.

If optimisation is turned on, the time to unpickle that packet goes down
to 0.007 seconds (on my machine).  So, it should be more like 400
threads before a fair scheduler would run into problems, but that
doesn't take into account GC load, which increases with more threads
running, so in fact you still run into trouble with 50 threads.  You can
reduce the GC load using RTS options: eg. +RTS -H256m and that will
reduce the number of timeouts you get.

GHC's scheduler may not be completely fair, but I haven't found anything
gratuitous in my investigations.  Sometimes the time between context
switches is more like 0.04 seconds instead of 0.02 seconds, and I still
don't understand exactly why, but that's not a serious issue.  Somtimes
a thread is unlucky enough to have to do a major GC during its
timeslice, so it doesn't get its fair share of CPU, but the effect is
random and therefore amortized (this isn't a realtime system, after all).

What can you do about this?

   (a) improve performance of the unpickler.  As you say, a 2x boost
   here will double the number of threads you can run in parallel
   in the time limits you have.

   (b) try to reduce your heap residency, which will reduce GC load
   which again means you can run more threads in parallel without
   hitting your limits.

   (b) try to manage your latency better, by limiting the number of
   threads that try to unpickle in parallel.  You may reduce
   the GC load this way, too.

Or failing that, just get faster hardware.  Or more CPUs, and use GHC's
new SMP support.

I'm surprised if your real application is this CPU-bound, though.  The
network communication latency should mean you can run a lot more
threads, provided you can improve that pickler so it isn't the bottleneck.

I don't have a lot of time to investigate the unpickling code in detail,
but I have worked on similar problems in the past and I know that the
unpickler in GHC is very fast, for example.  It is derived from the
original nhc98 interface, with tweaks by me to improve performance, and
later NewBinary was derived from it.  I haven't measured NewBinary's
performance relative to GHC's Binary library, but I don't expect there
to be much difference.

Cheers,
Simon

module Main where

import System.IO
import System.Time
import System.Environment
import Control.Monad
import Control.Concurrent
import Control.Exception
import Foreign
import Pickle
import Endian
import Util
import ZLib
import Records
import Prelude hiding (read)
import Text.Printf

main = 
do args <- getArgs
   process (head args) 100
   waitToFinish

{-# NOINLINE lock #-}
lock :: MVar ()
lock = unsafePerformIO $ newMVar ()

trace s = withMVar lock $ const $ putStrLn s

process _ 0 = return ()
process file n = 
do h <- openBinaryFile file ReadMode
   forkChild $ read_ h 
   process file (n - 1)

read_ :: Handle -> IO ()
read_ h = 
do cmd <- read h (\_ -> return ()) -- lots of ALERTs
   -- you should not get any alerts if you pass in trace
   -- below and comment the line above. the lock synch seems 
   -- to have a magical effect
   -- cmd <- read h trace
   eof <- hIsEOF h
   unless eof $ read_ h 

read :: Handle -> (String -> IO ()) -> IO Command
read h trace =
do TOD time1 _ <- getClockTime 
   allocaBytes 4 $ \p1 ->
   do hGetBuf h p1 4
  TOD time2 _ <- getClockTime 
  (size', _) <- unpickle endian32 p1 0
  TOD time3 _ <- getClockTime 
  let size = fromIntegral $ size' - 4
  allocaBytes size $ \packet -> 
  do TOD time4 _ <- getClockTime 
 hGetBuf h packet size
 TOD time5 _ <- getClockTime 
 cmd <- unstuff packet 0 size
 TOD time6 _ <- getClockTime 
 trace $ "read: " ++ cmdDesc cmd ++ ": " 
   ++ show (time6 - time1) ++ "s: "
   ++ show (time2 - time1) ++ "s, "
   ++ show (time3 - time2) ++ "s, "
   ++ show (time4 - time3) ++ "s, "
   ++ show (time5 - time4) ++ "s, "
   ++ show (time6 - time5) ++ "s"
 when (time6 - time5 > 3) $
  fail $ "RED ALERT: time: " ++ show (time6 - time5) 
   ++ "s, size: " ++ show size' 
   ++ ", cmd

Re[2]: [Haskell-cafe] Re: Joels Time Leak

2006-01-04 Thread Bulat Ziganshin
Hello Simon,

Tuesday, January 03, 2006, 7:43:21 PM, you wrote:

>> The minimum time between context switches is 20 milliseconds.
>> 
>> Is there any good reason why 0.02 seconds is the best that you can get
>> here? Couldn't GHC's internal timer tick at a _much_ faster rate (like
>> 50-100µs or so)?

SM> Sure, there's no reason why we couldn't do this.  Of course, even
SM> idle Haskell processes will be ticking away in the background, so
SM> there's a reason not to make the interval too short.  What do  
SM> you think is reasonable?

Simon, the talk is about changing GHC _tick_, which is a _minimal_
possible context switch interval. so, we want to decrease this tick
and retain current 20 ms _default_ switch interval. this will make
possible to decrease switch interval for programs that really need it,
which is currently entirely impossible.


-C[]:
Sets the context switch interval to  seconds. A context switch will
occur at the next heap block allocation after the timer expires (a
heap block allocation occurs every 4k of allocation). With -C0 or -C,
context switches will occur as often as possible (at every heap block
allocation). By default, context switches occur every 20ms
milliseconds. Note that GHC's internal timer ticks every 20ms, and the
context switch timer is always a multiple of this timer, so 20ms is
the maximum granularity available for timed context switches.  


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Re: Joels Time Leak

2006-01-04 Thread Simon Marlow
On 03 January 2006 17:32, Chris Kuklewicz wrote:

> Thanks for the answer, but I should I written a longer comment. I have
> added such a longer comment below:
> 
> Simon Marlow wrote:
>> Chris Kuklewicz wrote:
>> 
>>> Another comment: between 1000's of threads and writing a custom
>>> continuation based scheduler, what about using a thread pool?  Does
>>> anyone have a library with a "fork-IO-Pool" command?
>> 
>> 
>> You don't need a thread pool, because threads are so cheap.  Thread
>> pools are just a workaround for lack of lightweight concurrency.
>> 
>> Cheers,
>> Simon
>> 
> 
> Since the round-robin scheduler has (0.02 * N) seconds of delay for N
> therads, then one could trade off latency between time spent waiting
> for the thread pool to start a job and time spend running the job and
> getting interrupted.
> 
> In the limit of 1 worker thread, all the latency is waiting to get
> run, and there are no interruptions, so the time taken *while
> running* is very short.  With 10 threads, there can be a delay to
> start, and each interruption adds 0.2 seconds to the job's run time
> once it has started. 
> 
> For a server, the client requests queue up and wait for room in the
> thread pool, and the pool is kept small enough that the
> round-robin-schedular-delay keeps requests from timing out while being
> serviced.  Otherwise 1000 client requests would cause 20 seconds of
> reschedule penalty for all threads and they could all timeout.  With a
> thread pool, one can drop threads that have been waiting for too long
> instead of running them, so those threads will timeout. But the pool
> keeps servicing at least some of the client requests on time.
> 
> All hypothetical to me, of course.

I see - you want to use a thread pool as a way of restricting
parallelism, so as to control latency.  Good idea, it would make a
useful library.

Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Joels Time Leak

2006-01-03 Thread Joel Reymont
The timeleak code is just a repro case. In real life I'm reading from  
sockets as opposed to a file.


All I'm trying to do is run poker bots. They talk to the server and  
play poker. Of course some events are more important than others, a  
request to make a bet is more important than, say, a table update. I  
do need to run as many poker bots as I can.


I think that my customer's goal of 4,000 bots is unattainable in a  
single app. It's probably possible per machine. Overall, I find this  
too complex to manage with Haskell as there are many factors that can  
contribute to my delays and timeouts. There are also quite a few  
unanswered questions at the moment (why is 'sequ' slow? does the  
scheduler need to be tuned?) that leave me scratching my head.


On Jan 3, 2006, at 9:17 PM, S. Alexander Jacobson wrote:

You should be grouping incoming events into queues by expected  
workload/event.  Then you can give the client fairly reliable  
information about how long it will have to wait based on the size  
of the queue on which event is waiting.


And if you have no way to differentiate between event workloads a  
priori then you really can't be giving clients response guarantees  
and need to rethink your business logic.


--
http://wagerlabs.com/





___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Joels Time Leak

2006-01-03 Thread S. Alexander Jacobson

Joel,

In most cases, it just doesn't make sense to run 1000 threads 
simultaneously that are all bottlenecked on the same resource (e.g. 
CPU/memory) See e.g. http://www.eecs.harvard.edu/~mdw/proj/seda/


You should be grouping incoming events into queues by expected 
workload/event.  Then you can give the client fairly reliable 
information about how long it will have to wait based on the size of 
the queue on which event is waiting.


And if you have no way to differentiate between event workloads a 
priori then you really can't be giving clients response guarantees and 
need to rethink your business logic.


FYI: I actually created a Haskell application server based on this 
logic called HAppS (see http://happs.org) and am in the process of 
getting binaryIO added to it.


-Alex-

__
S. Alexander Jacobson tel:917-770-6565 http://alexjacobson.com






On Tue, 3 Jan 2006, Joel Reymont wrote:


Simon,

I don't think CPU usage is the issue. An individual thread will take a 
fraction of a second to deserialize a large packet. The issue is that, as you 
pointed out, you can get alerts even with 50 threads. Those fractions of a 
second add up in a certain way that's detrimental to the performance of the 
app.


The timeleak code uses Ptr Word8 to pickle which should be very efficient. I 
believe the delay comes from the way 'sequ' is compiled by GHC. I'll take the 
liberty of quoting Andrew Kennedy (your colleague from MS Research) who wrote 
the picklers:


--
My original pickler implementation was for SML. It was used in the MLj 
compiler, and is still used in the SML.NET compiler, and has acceptable 
performance (few ms pickling/unpickling for typical intermediate language 
object files). I must admit that I've not used the Haskell variant in anger. 
Apart from the inherent slowdown associated with laziness, is there a 
particular reason for poor performance?

--

'sequ' by itself does not seem like a big deal but when used to model records 
it builds a large nested lambda-list and I don't think that list is being 
compiled efficiently. I would appreciate if you could look at that and issue 
a verdict now that Andrew cofirms using the picklers in a real-life 
environment and w/o major problems.


Suppose I chose a different implementation of binary IO and disposed of 
pickler combinators.  Suppose I gained a 2x speed-up by doing so. I would now 
be getting alerts with 100 threads instead of 50, no? That's still far from 
ideal.


Joel

On Jan 3, 2006, at 4:43 PM, Simon Marlow wrote:

The reason things are the way they are is that a large number of *running* 
threads is not a workload we've optimised for.  In fact, Joel's program is 
the first one I've seen with a lot of running threads, apart from our 
testsuite.  And I suspect that when Joel uses a better binary I/O 
implementation a lot of that CPU usage will disappear.


--
http://wagerlabs.com/





___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Joels Time Leak

2006-01-03 Thread Tomasz Zielonka
On Tue, Jan 03, 2006 at 02:30:53PM +, Simon Marlow wrote:
> I measured the time taken to unpickle those large 50k packets as 0.3 
> seconds on my amd64 box (program compiled *without* optimisation), so 
> the thread can get descheduled twice during while unpickling a large 
> packet, giving a >4s delay with 100 threads running.

I might have made an error when counting the packets. I simply placed a
putStrLn in "read", but some of the packets are nested
(SrvCompressedCommands), so "read" is called more than once for a
top-level packet.

Best regards
Tomasz

-- 
I am searching for programmers who are good at least in
(Haskell || ML) && (Linux || FreeBSD || math)
for work in Warsaw, Poland
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Joels Time Leak

2006-01-03 Thread Chris Kuklewicz
Thanks for the answer, but I should I written a longer comment. I have
added such a longer comment below:

Simon Marlow wrote:
> Chris Kuklewicz wrote:
> 
>> Another comment: between 1000's of threads and writing a custom
>> continuation based scheduler, what about using a thread pool?  Does
>> anyone have a library with a "fork-IO-Pool" command?
> 
> 
> You don't need a thread pool, because threads are so cheap.  Thread
> pools are just a workaround for lack of lightweight concurrency.
> 
> Cheers,
> Simon
> 

Since the round-robin scheduler has (0.02 * N) seconds of delay for N
therads, then one could trade off latency between time spent waiting for
the thread pool to start a job and time spend running the job and
getting interrupted.

In the limit of 1 worker thread, all the latency is waiting to get run,
and there are no interruptions, so the time taken *while running* is
very short.  With 10 threads, there can be a delay to start, and each
interruption adds 0.2 seconds to the job's run time once it has started.

For a server, the client requests queue up and wait for room in the
thread pool, and the pool is kept small enough that the
round-robin-schedular-delay keeps requests from timing out while being
serviced.  Otherwise 1000 client requests would cause 20 seconds of
reschedule penalty for all threads and they could all timeout.  With a
thread pool, one can drop threads that have been waiting for too long
instead of running them, so those threads will timeout. But the pool
keeps servicing at least some of the client requests on time.

All hypothetical to me, of course.

-- 
Chris
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Joels Time Leak

2006-01-03 Thread Joel Reymont

Simon,

I don't think CPU usage is the issue. An individual thread will take  
a fraction of a second to deserialize a large packet. The issue is  
that, as you pointed out, you can get alerts even with 50 threads.  
Those fractions of a second add up in a certain way that's  
detrimental to the performance of the app.


The timeleak code uses Ptr Word8 to pickle which should be very  
efficient. I believe the delay comes from the way 'sequ' is compiled  
by GHC. I'll take the liberty of quoting Andrew Kennedy (your  
colleague from MS Research) who wrote the picklers:


--
My original pickler implementation was for SML. It was used in the  
MLj compiler, and is still used in the SML.NET compiler, and has  
acceptable performance (few ms pickling/unpickling for typical  
intermediate language object files). I must admit that I've not used  
the Haskell variant in anger. Apart from the inherent slowdown  
associated with laziness, is there a particular reason for poor  
performance?

--

'sequ' by itself does not seem like a big deal but when used to model  
records it builds a large nested lambda-list and I don't think that  
list is being compiled efficiently. I would appreciate if you could  
look at that and issue a verdict now that Andrew cofirms using the  
picklers in a real-life environment and w/o major problems.


Suppose I chose a different implementation of binary IO and disposed  
of pickler combinators.  Suppose I gained a 2x speed-up by doing so.  
I would now be getting alerts with 100 threads instead of 50, no?  
That's still far from ideal.


Joel

On Jan 3, 2006, at 4:43 PM, Simon Marlow wrote:

The reason things are the way they are is that a large number of  
*running* threads is not a workload we've optimised for.  In fact,  
Joel's program is the first one I've seen with a lot of running  
threads, apart from our testsuite.  And I suspect that when Joel  
uses a better binary I/O implementation a lot of that CPU usage  
will disappear.


--
http://wagerlabs.com/





___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Joels Time Leak

2006-01-03 Thread Sebastian Sylvan
On 1/3/06, Simon Marlow <[EMAIL PROTECTED]> wrote:
> On 03 January 2006 15:37, Sebastian Sylvan wrote:
>
> > On 1/3/06, Simon Marlow <[EMAIL PROTECTED]> wrote:
> >> Tomasz Zielonka wrote:
> >>> On Thu, Dec 29, 2005 at 01:20:41PM +, Joel Reymont wrote:
> >>>
>  Why does it take a fraction of a second for 1 thread to unpickle
>  and several seconds per thread for several threads to do it at the
>  same time? I think this is where the mistery lies.
> >>>
> >>>
> >>> Have you considered any of this:
> >>>
> >>> - too big memory pressure: more memory means more frequent and more
> >>>   expensive GCs, 1000 threads using so much memory means bad cache
> >>> performance - a deficiency of GHC's thread scheduler - giving too
> >>>   much time one thread steals it from others (Simons, don't get
> >>>   angry at me - I am probably wrong here ;-)
> >>
> >> I don't think there's anything really strange going on here.
> >>
> >> The default context switch interval in GHC is 0.02 seconds, measured
> >> in CPU time by default.  GHC's scheduler is stricly round-robin, so
> >> therefore with 100 threads in the system it can be 2 seconds between
> >> a thread being descheduled and scheduled again.
> >
> > According to this:
> > http://www.haskell.org/ghc/docs/latest/html/users_guide/sec-using-parallel.html#parallel-rts-opts
> >
> > The minimum time between context switches is 20 milliseconds.
> >
> > Is there any good reason why 0.02 seconds is the best that you can get
> > here? Couldn't GHC's internal timer tick at a _much_ faster rate (like
> > 50-100µs or so)?
>
> Sure, there's no reason why we couldn't do this.  Of course, even idle 
> Haskell processes will be ticking away in the background, so there's a reason 
> not to make the interval too short.  What do you think is reasonable?

Not sure. Could it be configurable via a command line flag? If the
profiler could report the % of time spent doing context switches (or
maybe it already does?) the user could fine tune this to his liking.

For the (hypothetical) real-time simulation app I would *guess* that
something along the lines of 500µs would be more than enough to not
introduce any unnecessary lag in rendering (seeing as the target frame
time would be around 15ms, and you'd want to have a good amount of
context switches to allow some of the next frame to be computed in
parallell to all the render-surface optimizations etc. for the current
frame).

But then again, there may be other apps which need it to be even
lower.. So a command line flag sure would be nice.

/S

--
Sebastian Sylvan
+46(0)736-818655
UIN: 44640862
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Re: Joels Time Leak

2006-01-03 Thread Simon Marlow
On 03 January 2006 15:37, Sebastian Sylvan wrote:

> On 1/3/06, Simon Marlow <[EMAIL PROTECTED]> wrote:
>> Tomasz Zielonka wrote:
>>> On Thu, Dec 29, 2005 at 01:20:41PM +, Joel Reymont wrote:
>>> 
 Why does it take a fraction of a second for 1 thread to unpickle
 and several seconds per thread for several threads to do it at the
 same time? I think this is where the mistery lies.
>>> 
>>> 
>>> Have you considered any of this:
>>> 
>>> - too big memory pressure: more memory means more frequent and more
>>>   expensive GCs, 1000 threads using so much memory means bad cache 
>>> performance - a deficiency of GHC's thread scheduler - giving too
>>>   much time one thread steals it from others (Simons, don't get
>>>   angry at me - I am probably wrong here ;-)
>> 
>> I don't think there's anything really strange going on here.
>> 
>> The default context switch interval in GHC is 0.02 seconds, measured
>> in CPU time by default.  GHC's scheduler is stricly round-robin, so
>> therefore with 100 threads in the system it can be 2 seconds between
>> a thread being descheduled and scheduled again.
> 
> According to this:
> http://www.haskell.org/ghc/docs/latest/html/users_guide/sec-using-parallel.html#parallel-rts-opts
> 
> The minimum time between context switches is 20 milliseconds.
> 
> Is there any good reason why 0.02 seconds is the best that you can get
> here? Couldn't GHC's internal timer tick at a _much_ faster rate (like
> 50-100µs or so)?

Sure, there's no reason why we couldn't do this.  Of course, even idle Haskell 
processes will be ticking away in the background, so there's a reason not to 
make the interval too short.  What do you think is reasonable?

> Apart from meaning big trouble for applications with a large number of
> threads (such as Joels) it'll also make life difficult for any sort of
> real-time application. For instance if you want to use HOpenGL to
> render a simulation engine and you split it up into tons of concurrent
> processes (say one for each dynamic entity in the engine), the 20ms
> granularity would make it quite hard to achieve 60 frames per second
> in that case...

The reason things are the way they are is that a large number of *running* 
threads is not a workload we've optimised for.  In fact, Joel's program is the 
first one I've seen with a lot of running threads, apart from our testsuite.  
And I suspect that when Joel uses a better binary I/O implementation a lot of 
that CPU usage will disappear.

Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Joels Time Leak

2006-01-03 Thread Chris Kuklewicz
General follow-up questions:

Would adding Control.Concurrent.yield commands cause a context switch
more often than every 0.02 seconds?

Is there any command in GHC to allow a thread to prevent itself from
being rescheduled while computing something?

Another comment: between 1000's of threads and writing a custom
continuation based scheduler, what about using a thread pool?  Does
anyone have a library with a "fork-IO-Pool" command?

-- 
Chris

Simon Marlow wrote:
> Tomasz Zielonka wrote:
> 
>> On Thu, Dec 29, 2005 at 01:20:41PM +, Joel Reymont wrote:
>>
>>> Why does it take a fraction of a second for 1 thread to unpickle and 
>>> several seconds per thread for several threads to do it at the same 
>>> time? I think this is where the mistery lies.
>>
>>
>>
>> Have you considered any of this:
>>
>> - too big memory pressure: more memory means more frequent and more
>>   expensive GCs, 1000 threads using so much memory means bad cache
>>   performance
>> - a deficiency of GHC's thread scheduler - giving too much time one
>>   thread steals it from others (Simons, don't get angry at me - I am
>>   probably wrong here ;-)
> 
> 
> I don't think there's anything really strange going on here.
> 
> The default context switch interval in GHC is 0.02 seconds, measured in
> CPU time by default.  GHC's scheduler is stricly round-robin, so
> therefore with 100 threads in the system it can be 2 seconds between a
> thread being descheduled and scheduled again.
> 
> I measured the time taken to unpickle those large 50k packets as 0.3
> seconds on my amd64 box (program compiled *without* optimisation), so
> the thread can get descheduled twice during while unpickling a large
> packet, giving a >4s delay with 100 threads running.
> 
> The actual context switch interval seems to often be larger than 0.2
> seconds; I'm not sure exactly why this is, it might be due to delays in
> the OS delivering the signal.  This does mean that the timeleak program
> reports alerts for as little as 50 threads, though.
> 
> Cheers,
> Simon
> 
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
> 

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Joels Time Leak

2006-01-03 Thread Sebastian Sylvan
On 1/3/06, Simon Marlow <[EMAIL PROTECTED]> wrote:
> Tomasz Zielonka wrote:
> > On Thu, Dec 29, 2005 at 01:20:41PM +, Joel Reymont wrote:
> >
> >>Why does it take a fraction of a second for 1 thread to unpickle and
> >>several seconds per thread for several threads to do it at the same
> >>time? I think this is where the mistery lies.
> >
> >
> > Have you considered any of this:
> >
> > - too big memory pressure: more memory means more frequent and more
> >   expensive GCs, 1000 threads using so much memory means bad cache
> >   performance
> > - a deficiency of GHC's thread scheduler - giving too much time one
> >   thread steals it from others (Simons, don't get angry at me - I am
> >   probably wrong here ;-)
>
> I don't think there's anything really strange going on here.
>
> The default context switch interval in GHC is 0.02 seconds, measured in
> CPU time by default.  GHC's scheduler is stricly round-robin, so
> therefore with 100 threads in the system it can be 2 seconds between a
> thread being descheduled and scheduled again.

According to this:
http://www.haskell.org/ghc/docs/latest/html/users_guide/sec-using-parallel.html#parallel-rts-opts

The minimum time between context switches is 20 milliseconds.

Is there any good reason why 0.02 seconds is the best that you can get
here? Couldn't GHC's internal timer tick at a _much_ faster rate (like
50-100µs or so)?
Apart from meaning big trouble for applications with a large number of
threads (such as Joels) it'll also make life difficult for any sort of
real-time application. For instance if you want to use HOpenGL to
render a simulation engine and you split it up into tons of concurrent
processes (say one for each dynamic entity in the engine), the 20ms
granularity would make it quite hard to achieve 60 frames per second
in that case...

/S

--
Sebastian Sylvan
+46(0)736-818655
UIN: 44640862
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Joels Time Leak

2006-01-03 Thread Simon Marlow

Tomasz Zielonka wrote:

On Thu, Dec 29, 2005 at 01:20:41PM +, Joel Reymont wrote:

Why does it take a fraction of a second for 1 thread to unpickle and  
several seconds per thread for several threads to do it at the same  
time? I think this is where the mistery lies.



Have you considered any of this:

- too big memory pressure: more memory means more frequent and more
  expensive GCs, 1000 threads using so much memory means bad cache
  performance
- a deficiency of GHC's thread scheduler - giving too much time one
  thread steals it from others (Simons, don't get angry at me - I am
  probably wrong here ;-)


I don't think there's anything really strange going on here.

The default context switch interval in GHC is 0.02 seconds, measured in 
CPU time by default.  GHC's scheduler is stricly round-robin, so 
therefore with 100 threads in the system it can be 2 seconds between a 
thread being descheduled and scheduled again.


I measured the time taken to unpickle those large 50k packets as 0.3 
seconds on my amd64 box (program compiled *without* optimisation), so 
the thread can get descheduled twice during while unpickling a large 
packet, giving a >4s delay with 100 threads running.


The actual context switch interval seems to often be larger than 0.2 
seconds; I'm not sure exactly why this is, it might be due to delays in 
the OS delivering the signal.  This does mean that the timeleak program 
reports alerts for as little as 50 threads, though.


Cheers,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe