Re: [Haskell-cafe] Parallel weirdness [new insights]

2008-04-20 Thread Andrew Coppin
OK, well I now have so much data sitting in from of me I don't even know 
*what* I'm seeing any more. I have made several significant discoveries 
though...


Consider the following:

 msort [] = []
 msort [x] = [x]
 msort xs =
   let
 xs0 = msort (split0 xs)
 xs1 = msort (split1 xs)
   in merge xs0 xs1

This takes roughly 14 seconds to sort a list of one million Word32 
values. If I now change the final line to read


 in listSeq rwhnf xs0 `seq` listSeq rwhnf xs1 `seq` merge xs0 xs1

it now takes 8 seconds to do the same job.

Notice that this is still completely sequential execution. It's just 
executing in a different order. (And, at first glance, doing slightly 
more work.) Of all the benchmarks I've performed, I have yet to find 
anything that goes faster than this. If I make it so that xs0 is 
computed in parallel with xs1 instead of in series, then it goes at 
roughly the same speed (but with more variation) if the number of real 
threads is 1. If you add more real threads, execution slows down. (Go 
figure!) I was expecting running parallel at just the top few levels and 
then switching to pure sequential for the lower levels to give the best 
performance. But the numbers I have seem to say that more parallel = 
slower, with 100% sequential giving me the fastest time of all.


The next insight happens when you look at the GC statistics. Both the 
unmarked and the explicitly sequential program are giving me roughly 55% 
GC time and 45% user time. (!!) Obviously this is a Very Bad Thing. I 
discovered that simply by adding -H200m to the command line, I can make 
both programs speed up by about 20% or so. (They then drop down to 
roughly 25% GC time. Adding more RAM doesn't seem to make any difference.)


I had assumed that the explicitly sequential program was faster because 
it was somehow demanding less GC time, but that doesn't appear to be the 
case - both GC time and user time are lower for the explicitly 
sequential version. And adding more heap space doesn't make the (large) 
speed difference go away. Is the strictness of the seq operator making 
GHC come up with different a Core implementation for this function or 
something? I have no idea.


With the extra heap space, the speed difference between the sequential 
and parallel programs becomes smaller. The sequential version *is* still 
faster, however. I have no explanation for why that might be. Adding 
more heap also seems to make the runtimes more variable. (I run each 
test 8 times. One test, the fastest run was 7 seconds and the slowest 
was 11 seconds. That's quite a variation. The sequential algorithm only 
varies by a few milliseconds each time.)


In short, it seems my little sorting algorithm test is *actually* just 
stressing out the GC engine, and I'm really benchmarking different GC 
settings. :-(


Questions:

1. Does running the GC force all threads to stop? I know some GC designs 
do this, but I have no idea how the one GHC implements works.


2. Is the GC still single-threaded? (GHC 6.8.2 here.)

3. Is there any way for a running Haskell program to find out how much 
heap space is currently allocated / used / free? I know you can find out 
how much wall time and CPU time you've used, but I couldn't find 
anything for RAM usage.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness [new insights]

2008-04-20 Thread Brandon S. Allbery KF8NH


On Apr 20, 2008, at 15:41 , Andrew Coppin wrote:

1. Does running the GC force all threads to stop? I know some GC  
designs do this, but I have no idea how the one GHC implements works.


2. Is the GC still single-threaded? (GHC 6.8.2 here.)


Full GC is single-threaded and stops the entire program, yes.  IIRC  
GHC's runtime tries to do incremental GC to minimize the need for a  
full GC.


3. Is there any way for a running Haskell program to find out how  
much heap space is currently allocated / used / free? I know you  
can find out how much wall time and CPU time you've used, but I  
couldn't find anything for RAM usage.


You're looking for heap profiling in the GHC manual.

--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [EMAIL PROTECTED]
system administrator [openafs,heimdal,too many hats] [EMAIL PROTECTED]
electrical and computer engineering, carnegie mellon universityKF8NH


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness [new insights]

2008-04-20 Thread Brandon S. Allbery KF8NH


On Apr 20, 2008, at 15:51 , Brandon S. Allbery KF8NH wrote:

On Apr 20, 2008, at 15:41 , Andrew Coppin wrote:
3. Is there any way for a running Haskell program to find out how  
much heap space is currently allocated / used / free? I know you  
can find out how much wall time and CPU time you've used, but I  
couldn't find anything for RAM usage.


You're looking for heap profiling in the GHC manual.


Wait, no, I misread.  I don't know if or how a running program could  
introspect its own heap usage.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [EMAIL PROTECTED]
system administrator [openafs,heimdal,too many hats] [EMAIL PROTECTED]
electrical and computer engineering, carnegie mellon universityKF8NH


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness [new insights]

2008-04-20 Thread Andrew Coppin

Bulat Ziganshin wrote:

3. Is there any way for a running Haskell program to find out how much
heap space is currently allocated / used / free?



i think it's possible by asking internal RTS vars. SM once suggested
to add to GHC library that provides official way to ask this info but
no volunteer was happen :)
  


The RTS can spit out aggregate data just with a CLI switch (and it 
doesn't appear to affect runtime noticably). You don't even need to 
compile with profiling enabled. This seems to indicate that the data is 
easy to collect, there's just no path for accessing it yet. I'm no GHC 
developer, but from the outside it appears to be a fairly simple 
problem. If I knew anything about the RTS, I'd volunteer myself. But I 
suspect this is one of those jobs that requires knowledge of C... :-(


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness [new insights]

2008-04-20 Thread Bryan O'Sullivan
Bulat Ziganshin wrote:

 yes. multi-threaded GC is planned gor next ghc version, afair

To be clear, it'll be a parallel GC, not a concurrent one.  The former
still stops all threads, but runs the collector on multiple cores at
once.  The latter performs collection while mutator threads are still
running, and is a lot trickier to implement.

(For a fine knee-slapping time, try reading a Java GC tuning guide.)

b
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness

2008-04-19 Thread Denis Bueno
On Sat, Apr 19, 2008 at 10:56 AM, Andrew Coppin
[EMAIL PROTECTED] wrote:
  Can anybody explain any of this behaviour? I have no idea what I'm
 benchmarking, but it certainly doesn't appear to be the performance of a
 parallel merge sort!

It would be much easier to draw sound conclusions if you would post your code.

-- 
 Denis
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness [code]

2008-04-19 Thread Andrew Coppin

Denis Bueno wrote:

It would be much easier to draw sound conclusions if you would post your code.
  


Erm... good point.

See attachments.

module Sort where

import Control.Parallel
import Control.Parallel.Strategies

split0 [] = []
split0 (x:xs) = x : split1 xs

split1 [] = []
split1 (x:xs) = split0 xs

merge xs [] = xs
merge [] ys = ys
merge (x:xs) (y:ys)
  | x  y = x : merge xs (y:ys)
  | otherwise = y : merge (x:xs) ys

msort [] = []
msort [x] = [x]
msort xs =
  let
xs0 = msort (split0 xs)
xs1 = msort (split1 xs)
  in merge xs0 xs1

msortP [] = []
msortP [x] = [x]
msortP xs =
  let
xs0 = msort (split0 xs)
xs1 = msort (split1 xs)
  in seqList rwhnf xs0 `par` seqList rwhnf xs1 `seq` merge xs0 xs1

list = [5,4,6,3,7,2,8,1,9,0]
module Time where

import System.CPUTime

time :: IO () - IO Integer
time fn = do
  t0 - getCPUTime
  fn
  t1 - getCPUTime
  return (t1 - t0)

ps_ms = 10 :: Integer

ps_s  = ps_ms * 1000 :: Integer
module Main where

import Data.Word
import System.IO
import GHC.Conc (numCapabilities)

import Sort
import Time

type Test = (String,[Word32])

random = iterate (\x - 1664525 * x + 1013904223) 7 :: [Word32]

test1m = (1M,take 100 random)
test2m = (2M,take 200 random)

type Algo = (String, [Word32] - [Word32])

algo_seq_msort = (MergeSortSeq, msort)
algo_par_msort = (MergeSortPar, msortP)

dump :: [Word32] - String
dump = unlines . map show

run_tests :: Algo - Test - IO ()
run_tests (name,fn) (title,xs) = do
  echo \n
  
  let f1 = name ++ -- ++ title ++ --In.txt
  echo   $ Writing ' ++ f1 ++ '...; hFlush stdout
  nullT - time (writeFile f1 (dump xs))
  echo $   Took  ++ show (nullT `div` ps_ms) ++  ms.\n
  
  mapM_
(\n - do
  let f2 = name ++ -- ++ title ++ --Out ++ show n ++ .txt
  echo   $ Writing ' ++ f2 ++ '...; hFlush stdout
  sortT - time (writeFile f2 (dump (fn xs)))
  echo $   Took  ++ show (sortT `div` ps_ms) ++  ms.\n
)
[1..8]

echo msg = do
  hPutStr stdout msg
  hPutStr stderr msg

main = do
  echo $ CPU threads =  ++ show numCapabilities ++ .\n
  
  mapM_
(\test -
  mapM_
(\algo - run_tests algo test)
[algo_seq_msort, algo_par_msort]
)
[test1m, test2m]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness

2008-04-19 Thread Bulat Ziganshin
Hello Andrew,

Saturday, April 19, 2008, 6:56:10 PM, you wrote:

 OK, so just for fun, I decided to try implementing a parallel merge sort

coincedence - now i'm writing a parallel compression algorithm, very
much like parallel bzip2, but using ghc, of course

 Weird thing #1: The first time you sort the data, it takes a few
 seconds. The other 7 times, it takes a split second - roughly 100x 
 faster. Wuh?

this looks like disk caching effects. if data are read from disj on
first run and from disk cache on the next runs, this only means that
your algorithm works faster than reading its data from disk

 Weird thing #2: The parallel version runs *faster* than the sequential
 one in all cases - even with SMP disabled! (We're only talking a few 
 percent faster, but still.)

 Weird thing #3: Adding the -threaded compiler option makes
 *everything* run a few percent faster. Even with only 1 OS thread.

there are plenty of reasons: first, -threaded make i/o overlapped
with calculations. second, parallel version may exhibit better cpu
cache behavior - such as processing all data in cache before sending
it back to memory

 Weird thing #4: Adding -N2 makes *everything* slow down a few percent.
 In particular, Task Manager shows only one CPU core in use.

it's easy - your algorithm isn't really parallel, and you just forced
ghc to move it from one core to another. it's overhead of moving data
around :)

 Can anybody explain any of this behaviour? I have no idea what I'm 
 benchmarking, but it certainly doesn't appear to be the performance of a
 parallel merge sort!

there are many subtle effects making optimization much more interesting
than using simple schemas ;)  it's why i like it so much :))


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness

2008-04-19 Thread Murray Gross




I can't offer definite answers to your questions, but I can suggest a few 
issues you should consider:


1. Merge sort doesn't parallelize all that well--when the blocks are 
small, the parallelization overhead is large in comparison with the 
productive work that is to be done, and when the blocks get large, the 
amount of parallelization possible is not great. Quicksort and 
quickersort, of course, suffer from the same issue. The end result is that 
your timings will be heavily dependent on your hardware, software, and the 
properties of the particular data set you use for testing.


2. You need to account for I/O buffering (not only by your OP system in 
RAM, but by your disk controller)--after the first set of I/O operations, 
your data may be in buffers, so subsequent uses may retrieve data from 
buffers rather than from the disk itself. Similarly, you also have to take 
into account paging and cache issues, which could make the first run much 
slower than immediate subsequent runs.


3. A better benchmark would be provided by a counting sort, which does 
parallelize well (O(n * (n/k), where k is the number of processors, and n 
is the number of elements to be sorted). A major advantage of using a 
counting sort for benchmarking is that it runs slowly enough to make it 
relatively easy to compare sequential and parallel timings.


4. Depending on your system defaults, there may also be memory allocation 
issues that need to be taken into account (which could also easily cause 
the first run to be considerably slower than subsequent runs made 
immediately after the first).




Murray Gross
Brooklyn College



On Sat, 19 Apr 2008, Andrew Coppin wrote:

OK, so just for fun, I decided to try implementing a parallel merge sort 
using the seq and par combinators. My plan was to generate some psuedo-random 
data and time how long it takes to sort it. To try to account for lazy 
evaluation, what the program actually does is this:


1. Write the input data to disk without any sorting. (This ought to force it 
to be fully evaluated.)

2. Sort and save the data to disk 8 times. (So I can average the runtimes.)

This is done with two data sets - one with 1 million items, and another with 
2 million rows. Each data set is run through both the purely sequential 
algorithm and a simple parallel one. (Split the list in half, merge-sort each 
half in parallel, and then merge them.)


The results of this little benchmark utterly defy comprehension. Allow me to 
enumerate:


Weird thing #1: The first time you sort the data, it takes a few seconds. The 
other 7 times, it takes a split second - roughly 100x faster. Wuh?


Weird thing #2: The parallel version runs *faster* than the sequential one in 
all cases - even with SMP disabled! (We're only talking a few percent faster, 
but still.)


Weird thing #3: Adding the -threaded compiler option makes *everything* run 
a few percent faster. Even with only 1 OS thread.


Weird thing #4: Adding -N2 makes *everything* slow down a few percent. In 
particular, Task Manager shows only one CPU core in use.


Adding more than 2 OS threads makes everything slow down even further - but 
that's hardly surprising.


Can anybody explain any of this behaviour? I have no idea what I'm 
benchmarking, but it certainly doesn't appear to be the performance of a 
parallel merge sort!


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness

2008-04-19 Thread Jake Mcarthur

On Apr 19, 2008, at 9:56 AM, Andrew Coppin wrote:
Weird thing #3: Adding the -threaded compiler option makes  
*everything* run a few percent faster. Even with only 1 OS thread.


I had a similar thing happen to me once. (http://geekrant.wordpress.com/2007/11/29/holy-shmoly-ghc-does-some-magic-all-by-itself/ 
) It bothered me at the time, but as Simon Marlow said in the comments:


I wouldn’t believe these figures too much. I couldn’t repeat the  
same effects here, but since this is such a tiny fragment of code,  
small effects are magnified, and differences can sometimes appear  
and disappear depending the day of the week.


- Jake McArthur___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness

2008-04-19 Thread Jake Mcarthur

Okay, here are my thoughts:

On Apr 19, 2008, at 9:56 AM, Andrew Coppin wrote:

Weird thing #1: The first time you sort the data, it takes a few  
seconds. The other 7 times, it takes a split second - roughly 100x  
faster. Wuh?


This looks like standard memoization to me. I know, I know, GHC  
doesn't automagically memoize… it still has some behaviors I  
personally would label as a sort of primitive memoization, and this is  
one of them.


I learned about this by losing an argument. ;)

Weird thing #2: The parallel version runs *faster* than the  
sequential one in all cases - even with SMP disabled! (We're only  
talking a few percent faster, but still.)


Weird thing #3: Adding the -threaded compiler option makes  
*everything* run a few percent faster. Even with only 1 OS thread.


I think these are not noteworthy. Weird things happen in benchmarks  
(which is why I have learned not to trust them).


Weird thing #4: Adding -N2 makes *everything* slow down a few  
percent. In particular, Task Manager shows only one CPU core in use.


Then your algorithm must not truly be parallel. That is the only  
explanation I can think of.


- Jake McArthur___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness

2008-04-19 Thread Brandon S. Allbery KF8NH


On Apr 19, 2008, at 11:50 , Andrew Coppin wrote:

Bulat Ziganshin wrote:

there are plenty of reasons: first, -threaded make i/o overlapped
with calculations.


Not with -N1.


Depending on how it's implemented (I not being a ghc guru), possibly  
even with -N1 as long as it's using the thread-capable runtime.   
(Note that make -j2 is known to be optimal on single-processor  
machines, specifically because I/O tends to overlap with CPU.)



second, parallel version may exhibit better cpu
cache behavior - such as processing all data in cache before sending
it back to memory


Again, with -N1, it is *still* only using 1 CPU core.


And again, this may well be an effect of using the thread-*capable*  
runtime.  You can't generally multiplex memory accesses in SMP, so  
you may well want to delay and batch operations to/from main memory  
as much as possible to reduce lock contention for memory access.


Well, based on the results I've seen so far, it seems that  
parallelism is a complete waste of time because it doesn't gain you  
anything. And that doesn't make a lot of sense...


Easy parallelism is still an unsolved problem; naive parallelism  
generally is no better than sequential and often worse, because naive  
parallelism generally fails to account for lock / resource  
contention.  (Note that resource locking will be done by the threaded  
runtime even with only one thread, so you will see some slowdowns  
especially in I/O-related code.)  Haskell can only help you so much  
with this; you need to design your algorithms to parallel properly.


In addition, laziness can result in naive parallelism being a no-op  
because the only thing parallelized is some operation that trivially  
returns a lazy thunk that is later forced in the main thread.   
Careful strictness analysis is necessary in non-strict languages to  
make sure you are actually parallelizing what you want to.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [EMAIL PROTECTED]
system administrator [openafs,heimdal,too many hats] [EMAIL PROTECTED]
electrical and computer engineering, carnegie mellon universityKF8NH


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness

2008-04-19 Thread Brandon S. Allbery KF8NH


On Apr 19, 2008, at 11:53 , Murray Gross wrote:
2. You need to account for I/O buffering (not only by your OP  
system in RAM, but by your disk controller)--after the first set of  
I/O operations, your data may be in buffers, so subsequent uses may  
retrieve data from buffers rather than from the disk itself.  
Similarly, you also have to take into account paging and cache  
issues, which could make the first run much slower than immediate  
subsequent runs.


Note also that, unless you use SCSI or very high-end SATA drives,  
they ignore requests to disable buffering.  (References on request,  
you can probably find them by poking around http://www.pdl.cmu.edu/.   
Short summary:  consumer drives are optimized for benchmarks, not for  
data safety.  This is why early 32-bit Windows releases often lost  
data on shutdown until the shutdown was modified to sleep for 10-15  
seconds.)


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [EMAIL PROTECTED]
system administrator [openafs,heimdal,too many hats] [EMAIL PROTECTED]
electrical and computer engineering, carnegie mellon universityKF8NH


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parallel weirdness

2008-04-19 Thread Andrew Coppin

Bulat Ziganshin wrote:

Hello Brandon,

Saturday, April 19, 2008, 8:24:03 PM, you wrote:

  

contention.  (Note that resource locking will be done by the threaded
runtime even with only one thread, so you will see some slowdowns  
especially in I/O-related code.)



yes, i forget about this. Simon wrote once that locking decrease
performance by a few percents compared to single-threaded runtime
  


...which is why I'm so damned surprised that the threaded RTS is 
*faster* than the monoprocessor RTS [which presumably lacks such locking 
overhead].


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe