Re: Some initial results with DPH

2008-09-23 Thread Roman Leshchinskiy

On 23/09/2008, at 14:59, Roman Leshchinskiy wrote:


dotp :: [:Int:] -> [:Int:] -> Int
dotp v w = I.sumP [: (I.*) x y | x <- v, y <- w :]


The way the vectoriser works at the moment, it will repeat the array  
w (lengthP v) times, i.e., create an array of length (lengthP v *  
lengthP w). This is quite unfortunate and needs to be fused away but  
isn't at the moment. The only advice I can give is to stay away from  
array comprehensions for now. They work but are extremely slow. This  
definition should work fine:


dotp v w = I.sumP (zipWithP (I.*) v w)


Actually, I didn't pay attention when I wrote this. The two are not  
equivalent, of course. Only the second one computes the dot product.  
With comprehensions, you'd have to write


dotp v w = [: (I.*) x y | x <- v | y <- w :]

I suspect that will perform reasonably even now.

Roman


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Some initial results with DPH

2008-09-22 Thread Roman Leshchinskiy

Hi Austin,

first of all, thanks a lot for taking the time to report your results!

On 23/09/2008, at 11:48, Austin Seipp wrote:


* The vectorise pass boosts compilation times *a lot*. I don't think
 this is exactly unwarrented since it seems like a pretty complicated
 transformation, but while making the primitive version using just
 the unlifted interface the compilation takes about 1.5 seconds, for
 the vectorised version it's on the order of 15 seconds. For
 something as trivial as this dot-product thing, that's a bit
 of a compilation time, though.


The problem here is not the vectoriser but rather the subsequent  
optimisations. The vectoriser itself is (or should be - I haven't  
really timed it, to be honest) quite fast. It generates very complex  
code, however, which GHC takes a lot of time to optimise. We'll  
improve the output of the vectoriser eventually, but not before it is  
complete. For the moment, there is no solution for this, I'm afraid.



* It's pretty much impossible to use ghc-core to examine the output
 core of the vectorised version - I let it run and before anything
 started showing up in `less` it was already using on the order of
 100mb of memory. If I just add -ddump-simpl to the command line, the
 reason is obvious: the core generated is absolutely huge.


Yes. Again, this is something we'll try to improve eventually.


* For the benchmark included, the vectorised ver. spends about 98% of
 its time from what I can see in the GC before it dies from stack
 overflow. I haven't tried something like +RTS -A1G -RTS yet, though.


IIUC, the code is


dotp :: [:Int:] -> [:Int:] -> Int
dotp v w = I.sumP [: (I.*) x y | x <- v, y <- w :]


The way the vectoriser works at the moment, it will repeat the array w  
(lengthP v) times, i.e., create an array of length (lengthP v *  
lengthP w). This is quite unfortunate and needs to be fused away but  
isn't at the moment. The only advice I can give is to stay away from  
array comprehensions for now. They work but are extremely slow. This  
definition should work fine:


dotp v w = I.sumP (zipWithP (I.*) v w)


* The vectoriser is really, really touchy. For example, the below code
 sample works (from DotPVect.hs):


import Data.Array.Parallel.Prelude.Int as I

dotp :: [:Int:] -> [:Int:] -> Int
dotp v w = I.sumP [: (I.*) x y | x <- v, y <- w :]


This however, does not work:


dotp :: [:Int:] -> [:Int:] -> Int
dotp v w = I.sumP [: (Prelude.*) x y | x <- v, y <- w :]


This is because the vectorised code needs to call the vectorised  
version of (*). Internally, the vectoriser has a hardwired mapping  
from top-level functions to their vectorised versions. That is, it  
knows that it should replace calls to  
(Data.Array.Parallel.Prelude.Int.*) by calls to  
Data.Array.Parallel.Prelude.Base.Int.plusV. There is no vectorised  
version of (Prelude.*), however, and there won't be one until we can  
vectorise the Prelude. In fact, the vectoriser doesn't even support  
classes at the moment. So the rule of thumb is: unless it's in  
Data.Array.Parallel.Prelude or you wrote and vectorised it yourself,  
it will choke the vectoriser.



I also ran into a few other errs relating to the vectoriser dying - if
I can find some I'll reply to this with some results.


Please do! And please keep using DPH and reporting your results, that  
is really useful to us!


FWIW, we'll include some DPH documentation in 6.10 but it still has to  
be written...


Roman


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Some initial results with DPH

2008-09-22 Thread Austin Seipp
(I'm posting this here in the hope people like Manual and Roman will
see it and take some interest.)

I've built the GHC 6.10 beta (that is, The Glorious Glasgow Haskell
Compilation System, version 6.10.0.20080921) mainly so I could play
around with DPH because it's too exciting to ignore any longer.

The program I'm using here is pretty trivial but it's a simple enough
test case, I think. It's really cribbed from dph's examples almost
entirely, I just made some superficial data sets to test on and here
are the results.

The code is attached in a .tar file; you'll need to tweak the
Makefile to point to your GHC instead of mine if you use it.

Things to note:

* The vectorise pass boosts compilation times *a lot*. I don't think
  this is exactly unwarrented since it seems like a pretty complicated
  transformation, but while making the primitive version using just
  the unlifted interface the compilation takes about 1.5 seconds, for
  the vectorised version it's on the order of 15 seconds. For
  something as trivial as this dot-product thing, that's a bit
  of a compilation time, though.
* It's pretty much impossible to use ghc-core to examine the output
  core of the vectorised version - I let it run and before anything
  started showing up in `less` it was already using on the order of
  100mb of memory. If I just add -ddump-simpl to the command line, the
  reason is obvious: the core generated is absolutely huge.
* For the benchmark included, the vectorised ver. spends about 98% of
  its time from what I can see in the GC before it dies from stack
  overflow. I haven't tried something like +RTS -A1G -RTS yet, though.
* The vectoriser is really, really touchy. For example, the below code
  sample works (from DotPVect.hs):

> import Data.Array.Parallel.Prelude.Int as I
> 
> dotp :: [:Int:] -> [:Int:] -> Int
> dotp v w = I.sumP [: (I.*) x y | x <- v, y <- w :]

This however, does not work:

> dotp :: [:Int:] -> [:Int:] -> Int
> dotp v w = I.sumP [: (Prelude.*) x y | x <- v, y <- w :]

That is, just using the version from Prelude causes the vectoriser to
err like so:

> ghc -o vect --make test.hs -fcpr-off -threaded -Odph -funbox-strict-fields 
> -fdph-par
> [1 of 2] Compiling DotPVect ( DotPVect.hs, DotPVect.o )
> *** Vectorisation error ***
> Tycon not vectorised: GHC.Num.:TNum
> make: *** [vect] Error 1
> 

This is a particularly strange occurance; reason being if we dig into
the source code of the dph-par package we see that
Data.Array.Parallel.Prelude.Int simply re-exports
Data.Array.Parallel.Prelude.Base.Int (which is a hidden module, mind
you) which is where (*) is defined, and it is defined like so:

> (+), (-), (*) :: Int -> Int -> Int
> (+) = (P.+)
> (-) = (P.-)
> (*) = (P.*)

Where 'P' is the qualified Prelude import. Am I misunderstanding
something here about how the dph packages are layed out? I think this
is pretty correct and if so it's a really really strange happening to
be honest.

I also ran into a few other errs relating to the vectoriser dying - if
I can find some I'll reply to this with some results.

So far this all seems pretty negative, but on the flip-side...

* The unlifted interface exported by the dph-prim-{par,seq} packages
  is wonderful and already works really well. See
  http://hpaste.org/10621 for an example - super low-GC time, both of
  my cores get used.
* As I have increased the data sets for the dot-product example, my
  cores continue to get used and the GC time stays really, really low
  which is a great thing.
* I've yet to hit any strange compilation err or problem when using
  the primitive packages.
* Strictly speaking the dph-{par,seq} packages seem to expose more to
  code and combinators to those pieces of code using the vectorisation
  pass, but the combinators here are simple, they work and you get
  good results.
* GHC hits a lot of optimizations, with ghc-core you can see thousands
  upon thousands of rules firing all over the place to aggressively
  inline/transform things!

The DPH work done so far seems fantastic and I realize this is all in
the works so everything I say now might not be worth anything
tomorrow, and GHC 6.10 is only shipping with a very limited version of
the system, but I figured some people would like to see initial
results on some small test cases.

I plan on exploiting this package a lot more in the future and testing
it with larger computations and data sets, and when I do I'll be sure
to give the feedback to you guys (once HEAD is steaming along again
and the 6.10 branch has calmed.)

Austin


test1.tar.gz
Description: GNU Zip compressed data
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users