Re: [Haskell-cafe] Some quick experiments with GHC 7.0.2 in Intel's Manycore Testing Lab (32 cores)

Simon Marlow Mon, 14 Mar 2011 06:55:33 -0700

Hi José,

On 11/03/2011 14:16, José Pedro Magalhães wrote:


I've played a bit with Intel's Manycore Testing Lab
(http://software.intel.com/en-us/articles/intel-many-core-testing-lab/).
Part of the agreement to use it requires that you report back your
experiences, which I did in an Intel forum post
(http://software.intel.com/en-us/forums/showthread.php?t=81396). I
thought this could be interesting to the Haskell community in general as
well, so I'm reposting here, and pasting the text below for convenience.
I've replaced the images with links.

Is it possible for you to make the code for your benchmarks available?I'd be interested in analysing the results further.

In our testing I've been able to achieve speedups over 20 on 24 coreswith GHC 7.0.2, so there should be no reason in principle that youcouldn't achieve similar results for obviously parallel problems, whichyours seem to be. Some tweaking of GC parameters might be necessary:e.g. I've found that +RTS -A1m helps if your L2 caches are large enough.A good starting point for profiling is ThreadScope, which will tellyou if the program is really trying to use all the cores or not.


Cheers,
        Simon

Cheers,
Pedro

As per the agreement with Intel, I am reporting my experiences with
the Intel Manycore Testing Lab (Linux). This was my first time in
the lab, and I wanted to test GHC's [1] SMP parallelism [2] features.

The first challenge was to actually get GHC to work on the lab.
There was a working version of ghc under /opt/ghc6.13/bin/ghc, but I
really needed GHC 7. So first I built GHC 7.0.2-rc2, which worked
without much trouble.

Next step was to get all the necessary libraries in place. Since the
lab has no direct internet access, cabal-install [3] wouldn't be of
much use. Instead, I downloaded a snapshot of hackage [4] with the
latest version of every package and manually installed the packages
I needed. A bit boring, but doable.

Finally I was ready to compile my programs and test. First thing I
tried was an existing algorithm I had which, at some point, takes a
list of about 500 trees and, for each tree, computes a measure which
is expressed as a floating point number. This is basically a map
over a list transforming each tree into a float. Each operation is
independent of the others, and all require the same input, so it
seems ideal for parallelisation. A quick benchmark revealed the
following running times:

http://dreixel.net/images/perm/ParList.png

(Note the non-linear number of cores at the end of the x-axis.)
Apparently there are performance gains with up to 6 cores; adding
more cores after this makes the total running time worse.

While this might sound bad, do note that all that was necessary to
parallelise this algorithm was a one line change: basically, at the
point where the list of floats @l@ is generated, it is replaced with
@l `using` parList rdeepseq@. This change, together with
recompilation using -threaded, is all that is necessary to
parallelise this program.

Later I performed a more accurate benchmark, this time using the
equality function (take two elements and compare them for equality).
The first step was to parallelise the equality function, which,
again, is a very simple task:

-- Tree datatype
data Tree a = Leaf | Bin a (Tree a) (Tree a)

-- Parallel equality
eqTreePar :: Tree Int -> Tree Int -> Bool
eqTreePar Leaf Leaf = True
eqTreePar (Bin x1 l1 r1) (Bin x2 l2 r2) = x1 == x2 && par l (pseq r
(l && r))
where l = eqTreePar l1 l2
r = eqTreePar r1 r2
eqTreePar _ _ = False

`par` and `pseq` are the two primitives for parallelisation in GHC
[5]. The performance graph follows:

http://dreixel.net/images/perm/ParEq.png

(This time I ran the benchmark several times; the error bars on the
graph are the standard deviations.) Again we get performance
improvements with up to 6 cores, and after that performance
decreases. What I find really nice is the improvement with two
cores, which is almost a 50% decrease in running time. The ratios
for 2 to 4 cores wrt. the running time with 1 core are 0.52, 0.39,
and 0.35, respectively. This is really good for such a simple change
in the source code, and most people only have up to 4 cores anyway.
In any case, the results of this (very preliminary) experiment seem
to indicate that GHC's SMP parallelism is not particularly optimized
for a high number of cores (yet).

I'm planning to explore this line of research further, and I'm
hoping to be able to conduct more experiments in the near future.
Feel free to contact me if you want more information on what I've done.

Cheers,
Pedro

[1] http://www.haskell.org/ghc/
[2]
http://www.haskell.org/ghc/docs/latest/html/users_guide/using-smp.html
[3] http://hackage.haskell.org/package/cabal-install
[4] http://hackage.haskell.org
[5]

http://hackage.haskell.org/packages/archive/parallel/latest/doc/html/Control-Parallel.html

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe



_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Some quick experiments with GHC 7.0.2 in Intel's Manycore Testing Lab (32 cores)

Reply via email to