On 24/02/2011 13:26, José Pedro Magalhães wrote:
(Forwarding to haskell-cafe)

Hi,

I have a program that computes a matrix of Floats of m rows by n
columns. Computing each Float is relatively expensive. Each line is
completely independent of the others, so I thought I'd try some simple
SMP parallelism on this code:

myFun :: FilePath -> IO ()
myFun fp =
   do fs <- readDataDir fp
      let process f = readFile' f >>= parse
          printLine = putStrLn . foldr (\a b -> show a ++ "\t" ++ b) ""
          runDiff l = [ [ diff x y | y <- l ]
                      | (x,i) <- zip l (map getId fs), myFilter i ]
      ps <- mapM process fs
      sequence_ [ printLine x | x <- runDiff ps _`using` parList rdeepseq_ ]

So, I'm using parList to evaluate the rows in parallel, and fully
evaluating each row. Here are the timings on a Dual Quad Core AMD 2378
@2.4 GHz, ghc-6.12.3, parallel-2.2.0.1:

-N      time (ms)
none    1m50
2       1m33
3       1m35
4       1m22
5       1m11
6       1m06
7       1m45

The increase at 7 is justified by the fact that there were two other
processes running. I don't know how to justify the small increase at N3,
though, but that doesn't matter too much. The problem is that I am not
getting the gains I expected (halving at N2, a third at N3, etc.). Is
this the best one can achieve with this implicit parallelism, or am I
doing something wrong? In particular, is the way I'm printing the
results at the end destroying potential parallel gains?

Any insights on this are appreciated.

I haven't investigated your program in detail, but here's how I would go about analysing it:

  - first compile with -eventlog, run with +RTS -ls, and analyse
    it with ThreadScope.  This will tell you whether the program
    is actually using more than one core, and for what portion of
    its runtime.  Often you find that it is sequential for a large
    fraction of the execution time, such as reading in the input,
    or writing the output.

  - run with +RTS -s, and check the SPARKS.  Make sure you aren't
    generating too many (more than 4096 will be discarded) or too
    few (not enough parallelism).

  - generating large amounts of live data in the heap will tend to
    hurt parallelism, because GC doesn't parallelise well, especially
    when the heap data is linear (i.e. lists rather than trees).
    Look for a lot of time being spent in GC, and a figure for
    "balance" significantly less than the -N value.

Cheers,
        Simon



_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to