Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-19 Thread Kenneth Hoste


On May 18, 2009, at 15:28 , Claus Reinke wrote:

My current best try uses the uvector package, has two 'vectors' of  
type
(UArr Double)  as input, and relies on the sumU and zipWithU  
functions

which use streaming to compute the result:
dist_fast :: UArr Double - UArr Double - Double
dist_fast p1 p2 = sumDs `seq` sqrt sumDs
   where
   sumDs = sumU ds
   ds= zipWithU euclidean p1 p2
   euclidean x y = d*d
   where
   d = x-y


You'll probably want to make sure that 'euclidian' is specialized to
the types you need (here 'Double'), not used overloaded for 'Num a=a'
(check -ddump-tc, or -ddump-simpl output).


I understand from your later post that is was in fact specialized, but
how do I make sure it _is_ specialized? Can I just add a type signature
in the dist_fast definition for euclidean, or should I define euclidean
outside of dist_fast, with an explicit type signature?
If the latter, won't that hurt performance? Or should marking it INLINE
take care of that?

After that, unrolling the fused fold loop (uvector internal) might  
help

a bit, but isn't there yet:

http://hackage.haskell.org/trac/ghc/ticket/3123
http://hackage.haskell.org/trac/ghc/wiki/Inlining

And even if that gets implemented, it doesn't apply directly to your
case, where the loop is in a library, but you might want to control  
its

unrolling in your client code. Having the loop unrolled by a default
factor (8x or so) should help for loops like this, with little  
computation.


This seems rather serious, and might be one of the bigger reasons why
I'm getting nowhere close to C in terms of performance...
The loop body is ridiculously small, so it would make sense to
unroll it somewhat to help avoid the loop overhead.
However, it seems like GHC isn't able to do that now.

Is there any way to unroll the loop myself, to speed things up?
Seems hard, because I'm using uvector...

K.

--

Kenneth Hoste
Paris research group - ELIS - Ghent University, Belgium
email: kenneth.ho...@elis.ugent.be
website: http://www.elis.ugent.be/~kehoste
blog: http://boegel.kejo.be

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-19 Thread Kenneth Hoste


On May 18, 2009, at 20:54 , Claus Reinke wrote:

As I said, I don't get the fusion if I just add the function above  
to the original Dist.hs, export it and compile the module with '-c  
-O2 -ddump-simpl':

I can't reproduce this.


Interesting. I'm using ghc 6.11.20090320 (windows), uvector-0.1.0.3.  
I attach the modified Dist.hs and its simpl output, created via:


  ghc -c Dist.hs -O2 -ddump-tc -ddump-simpl-stats -ddump-simpl   
Dist.dumps


Perhaps others can confirm the effect? Note that the 'dist_fast' in  
the same module does get fused, so it is not likely an options  
issue. I still suspect that the inlining of the 'Dist.zipWith'  
wrapper in the 'dist_fast_inlined'
'__inline_me' has some significance - it is odd to see inlined code  
in an
'__inline_me' and the fusion rule won't trigger on 'Dist.sumU . Dist. 
$wzipWithU',

right?


As far as I can tell, the dist_fast_inlined doesn't get fused, i.e.  
I'm seeing

zipWithU and sumU being used in it, which is not the case in dist_fast.

This is on OS X/PowerPC, using GHC 6.10.1.


Does the complete program fragment I posted earlier yield the desired
result?


Yes. Note that the original poster also reported slowdown from
use of 'dist_fast_inlined'.


Don, you were defining dist inside the main module, while in our
case the dist functions are defined in a seperate Dist.hs module...
Would that matter?

K.

--

Kenneth Hoste
Paris research group - ELIS - Ghent University, Belgium
email: kenneth.ho...@elis.ugent.be
website: http://www.elis.ugent.be/~kehoste
blog: http://boegel.kejo.be

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-19 Thread Daniel Schüssler
Hello!

On Monday 18 May 2009 14:37:51 Kenneth Hoste wrote:
 I'm mostly interested in the range 10D to 100D

is the dimension known at compile-time? Then you could consider Template 
Haskell. I wrote up some code for generating the vector types and vector 
subtraction/inner product below, HTH. One problem is that I'm using a 
typeclass and apparently you can't make {-# SPECIALISE #-} pragmas with TH, 
so let's hope it is automatically specialised by GHC.

Greetings,
Daniel

TH.hs
--


{-# LANGUAGE TemplateHaskell #-}
{-# OPTIONS -fglasgow-exts #-}

module TH where

import Language.Haskell.TH
import Control.Monad

-- Non-TH stuff
class InnerProductSpace v r | v - r where
innerProduct :: v - v - r

class AbGroup v where
minus :: v - v - v

euclidean x y = case minus x y of
  z - sqrt $! innerProduct z z

-- TH
noContext :: Q Cxt
noContext = return []
strict :: Q Type - StrictTypeQ
strict = liftM ((,) IsStrict)
  
makeVectors :: Int -- ^ Dimension
- Q Type -- ^ Component type, assumed to be a 'Num'
- String -- ^ Name for the generated type
- Q [Dec]
makeVectors n ctyp name0 = do
  -- let's assume ctyp = Double, name = Vector for the comments
  
  -- generate names for the variables we will need
  xs - replicateM n (newName x)
  ys - replicateM n (newName y)
   
  let
  name = mkName name0
  
  -- shorthands for arithmetic expressions; the first takes expressions,
  -- the others take variable names 
  sumE  e1 e2 = infixE (Just   e1)  [|(+)|] (Just   e2)
  varDiffE e1 e2  = infixE (Just (varE e1)) [|(-)|] (Just (varE e2))
  varProdE e1 e2  = infixE (Just (varE e1)) [|(*)|] (Just (varE e2))

  
  conPat vars = conP name (fmap varP vars)
  
  --  data Vector = Vector !Double ... !Double
  theDataD = 
  dataD noContext name [] -- no context, no params
[normalC name (replicate n (strict ctyp))] 
[''Eq,''Ord,''Show] -- 'deriving' clause
  
  innerProdD = 
  --  instance InnerProductSpace Vector Double where ...
  instanceD noContext ( conT ''InnerProductSpace 
`appT` conT name 
`appT` ctyp)  
--  innerProduct = ...
[valD 
 (varP 'innerProduct)
 (normalB
  -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) -
  (lamE [conPat xs, conPat ys]
   -- x1*y1 +  + xn*yn + 0
   (foldl sumE [|0|] $
  zipWith varProdE xs ys)
   ))
   
 [] -- no 'where' clause
]
  
  abGroupD = 
  instanceD noContext ( conT ''AbGroup
`appT` conT name)  
--  minus = ...
[valD 
 (varP 'minus)
 (normalB
  -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) -
  (lamE [conPat xs, conPat ys] 
   -- Vector (x1-y1) ... (xn-yn)
   (foldl appE (conE name) $
  zipWith varDiffE xs ys)
   ))
   
 [] -- no 'where' clause
]
  

  sequence [theDataD,innerProdD,abGroupD]



Main.hs
--
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE MultiParamTypeClasses #-}

module Main where

import TH

$(makeVectors 3 [t|Double|] Vec3)

main = print $ euclidean (Vec3 1 1 1) (Vec3 0 0 0)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-19 Thread Claus Reinke

I understand from your later post that is was in fact specialized, but
how do I make sure it _is_ specialized? 


-ddump-tc seems to give the generalized type, so it seems you'd need
to look at the -ddump-simpl output if you want to know whether a
local function is specialized.

http://www.haskell.org/haskellwiki/Performance/GHC#Looking_at_the_Core

Can I just add a type signature in the dist_fast definition for euclidean, 


If you need it at exactly one type, that is the simplest way. There's
also the SPECIALIZE pragma

http://www.haskell.org/ghc/docs/latest/html/users_guide/pragmas.html#specialize-pragma

and for local and non-exported functions '-O' should enable auto-specialization

http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html



After that, unrolling the fused fold loop (uvector internal) might  
help a bit, but isn't there yet:


http://hackage.haskell.org/trac/ghc/ticket/3123
http://hackage.haskell.org/trac/ghc/wiki/Inlining

And even if that gets implemented, it doesn't apply directly to your
case, where the loop is in a library, but you might want to control  
its unrolling in your client code. Having the loop unrolled by a default
factor (8x or so) should help for loops like this, with little  
computation.


This seems rather serious, and might be one of the bigger reasons why
I'm getting nowhere close to C in terms of performance...


You should be able to get near (a small factor) without it, but not having 
it leaves a substantial gap in performance, especially for simple loop 
bodies (there is also the case of enabling fusion over multiple loop 
iterations, but that may call for proper fix-point fusion).



The loop body is ridiculously small, so it would make sense to
unroll it somewhat to help avoid the loop overhead.
However, it seems like GHC isn't able to do that now.


Apart from the links above, the new backend also has relevant TODO
items: http://hackage.haskell.org/trac/ghc/wiki/BackEndNotes


Is there any way to unroll the loop myself, to speed things up?
Seems hard, because I'm using uvector...


You could 'cabal unpack' uvector, hand-unroll the core loop all
its operations get fused into, then reinstall the modified package..
(perhaps that should be a package configuration option, at least
until GHC gets loop or recursion unrolling - since this would be
a temporary workaround, it would probably not be worth it to 
have multiple package versions with different unroll factors; if
this actually helps uvector users with performance in practice, 
they could suggest it as a patch).


Claus


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-19 Thread Daniel Schüssler
Hi,

meh, I just realised that there is no sensible way to actually 
introduce/eliminate the generated types. I'm attaching a revised version with 
fromList/toList functions. Maybe the vector type should be polymorphic and be 
an instance of Functor, Monad and Foldable? But then we really depend on 
specialisation.

Greetings,
Daniel
{-# LANGUAGE TemplateHaskell #-}
{-# OPTIONS -fglasgow-exts #-}

module TH where

import Language.Haskell.TH
import Control.Monad

-- Non-TH stuff
class InnerProductSpace v r | v - r where
innerProduct :: v - v - r

class AbGroup v where
minus :: v - v - v

class FromToList v r | v - r where
fromList :: [r] - Maybe v
toList :: v - [r]

euclidean x y = case minus x y of
  z - sqrt $! innerProduct z z

-- TH
noContext :: Q Cxt
noContext = return []
strict :: Q Type - StrictTypeQ
strict = liftM ((,) IsStrict)
  
makeVectors :: Int -- ^ Dimension
- Q Type -- ^ Component type, assumed to be a 'Num'
- String -- ^ Name for the generated type
- Q [Dec]
makeVectors n ctyp name0 = do
  -- let's assume ctyp = Double, name = Vector for the comments
  
  -- generate names for the variables we will need
  xs - replicateM n (newName x)
  ys - replicateM n (newName y)
  lst - newName list
   
  let
  name = mkName name0
  
  -- shorthands for arithmetic expressions; the first takes expressions,
  -- the others take variable names 
  sumE  e1 e2 = infixE (Just   e1)  [|(+)|] (Just   e2)
  varDiffE e1 e2  = infixE (Just (varE e1)) [|(-)|] (Just (varE e2))
  varProdE e1 e2  = infixE (Just (varE e1)) [|(*)|] (Just (varE e2))

  
  conPat vars = conP name (fmap varP vars)
  
  --  data Vector = Vector !Double ... !Double
  theDataD = 
  dataD noContext name [] -- no context, no params
[normalC name (replicate n (strict ctyp))] 
[''Eq,''Ord,''Show] -- 'deriving' clause
  
  innerProdD = 
  --  instance InnerProductSpace Vector Double where ...
  instanceD noContext ( conT ''InnerProductSpace 
`appT` conT name 
`appT` ctyp)  
--  innerProduct = ...
[valD 
 (varP 'innerProduct)
 (normalB
  -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) -
  (lamE [conPat xs, conPat ys]
   -- x1*y1 +  + xn*yn + 0
   (foldl sumE [|0|] $
  zipWith varProdE xs ys)
   ))
   
 [] -- no 'where' clause
]
  
  abGroupD = 
  instanceD noContext ( conT ''AbGroup
`appT` conT name)  
--  minus = ...
[valD 
 (varP 'minus)
 (normalB
  -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) -
  (lamE [conPat xs, conPat ys] 
   -- Vector (x1-y1) ... (xn-yn)
   (foldl appE (conE name) $
  zipWith varDiffE xs ys)
   ))
   
 [] -- no 'where' clause
]
  
  fromToListD =
  instanceD noContext ( conT ''FromToList
`appT` conT name
`appT` ctyp)  
[ funD 'fromList
  [ clause [listP $ fmap varP xs]
 (normalB
  ([|Just|] `appE`
   (foldl appE (conE name) $ fmap varE xs)))
[]
  , clause [wildP] (normalB [|Nothing|]) [] -- wrong number 
of elements
  ]
, funD 'toList
  [ clause [conPat xs]
 (normalB
  (listE (fmap varE xs)))
[]]
]
   

  

  sequence [theDataD,innerProdD,abGroupD,fromToListD]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-19 Thread Kenneth Hoste


On May 19, 2009, at 13:24 , Daniel Schüssler wrote:


Hello!

On Monday 18 May 2009 14:37:51 Kenneth Hoste wrote:

I'm mostly interested in the range 10D to 100D


is the dimension known at compile-time? Then you could consider  
Template

Haskell.


In general, no. :-)

It will be known for some applications, but not for others.

I'm more and more amazed what comes into play just to implement
something simple like n-dim. Euclidean distance relatively fast using  
Haskell.


It seems to me that GHC is missing several critical optimizations
(yes, I know, patches welcome) to enable it to emit fast code
for HPC applications.

I'm still a big fan of Haskell, for a variety of reasons, but it seems
like it's not ready yet for the task I had in mind, which is a shame.

Just to be clear, this isn't a flame bait post or anything, just my 2  
cents.


K.


I wrote up some code for generating the vector types and vector
subtraction/inner product below, HTH. One problem is that I'm using a
typeclass and apparently you can't make {-# SPECIALISE #-} pragmas  
with TH,

so let's hope it is automatically specialised by GHC.

Greetings,
Daniel

TH.hs
--


{-# LANGUAGE TemplateHaskell #-}
{-# OPTIONS -fglasgow-exts #-}

module TH where

import Language.Haskell.TH
import Control.Monad

-- Non-TH stuff
class InnerProductSpace v r | v - r where
   innerProduct :: v - v - r

class AbGroup v where
   minus :: v - v - v

euclidean x y = case minus x y of
 z - sqrt $! innerProduct z z

-- TH
noContext :: Q Cxt
noContext = return []
strict :: Q Type - StrictTypeQ
strict = liftM ((,) IsStrict)

makeVectors :: Int -- ^ Dimension
   - Q Type -- ^ Component type, assumed to be a 'Num'
   - String -- ^ Name for the generated type
   - Q [Dec]
makeVectors n ctyp name0 = do
 -- let's assume ctyp = Double, name = Vector for the comments

 -- generate names for the variables we will need
 xs - replicateM n (newName x)
 ys - replicateM n (newName y)

 let
 name = mkName name0

 -- shorthands for arithmetic expressions; the first takes  
expressions,

 -- the others take variable names
 sumE  e1 e2 = infixE (Just   e1)  [|(+)|] (Just   e2)
 varDiffE e1 e2  = infixE (Just (varE e1)) [|(-)|] (Just (varE  
e2))
 varProdE e1 e2  = infixE (Just (varE e1)) [|(*)|] (Just (varE  
e2))



 conPat vars = conP name (fmap varP vars)

 --  data Vector = Vector !Double ... !Double
 theDataD =
 dataD noContext name [] -- no context, no params
   [normalC name (replicate n (strict ctyp))]
   [''Eq,''Ord,''Show] -- 'deriving' clause

 innerProdD =
 --  instance InnerProductSpace Vector Double where ...
 instanceD noContext ( conT ''InnerProductSpace
   `appT` conT name
   `appT` ctyp)
   --  innerProduct = ...
   [valD
(varP 'innerProduct)
(normalB
 -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn)  
-

 (lamE [conPat xs, conPat ys]
  -- x1*y1 +  + xn*yn + 0
  (foldl sumE [|0|] $
 zipWith varProdE xs ys)
  ))

[] -- no 'where' clause
   ]

 abGroupD =
 instanceD noContext ( conT ''AbGroup
   `appT` conT name)
   --  minus = ...
   [valD
(varP 'minus)
(normalB
 -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn)  
-

 (lamE [conPat xs, conPat ys]
  -- Vector (x1-y1) ... (xn-yn)
  (foldl appE (conE name) $
 zipWith varDiffE xs ys)
  ))

[] -- no 'where' clause
   ]


 sequence [theDataD,innerProdD,abGroupD]



Main.hs
--
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE MultiParamTypeClasses #-}

module Main where

import TH

$(makeVectors 3 [t|Double|] Vec3)

main = print $ euclidean (Vec3 1 1 1) (Vec3 0 0 0)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


--

Kenneth Hoste
Paris research group - ELIS - Ghent University, Belgium
email: kenneth.ho...@elis.ugent.be
website: http://www.elis.ugent.be/~kehoste
blog: http://boegel.kejo.be

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-19 Thread Patrick Perry

Hi Kenneth,

I wrote a benchmark similar to yours using the haskell blas library I  
wrote (latest version is on github at http://github.com/patperry/blas/tree/master 
, an older version is on hackage).


The pure Haskell implementation is not very good, and the problem  
seems to be repeated boxing/unboxing of doubles.  It may be possible  
to avoid this by writing a custom fold function for Vectors, but I  
haven't tried it.  Using a vectorized subtraction function along with  
the BLAS1 dnrm2 function, I was able to get performance about  
2.8-5.4x worse than C.  This isn't great, but it's not terrible,  
either.  I didn't have the performance problems you ran in to using  
the FFI.  My benchmark includes:


distVectors1: a native implementation using foldl' and zipWith
distVectors2: a vectorize implementation, defined as @distVectors2 x y  
= norm2 (x-y)@

distVectors3: a custom loop in haskell
distVectors4: an FFI call to C

All 4 functions use the Vector type, provided by the blas package.

I compiled with -O -fexcess-preceision and ran timings for vectors  
of size 10, 100, and 1000.  Here are the results:


Macintosh-62:~/Code/blas/examples (euclidean)$ ./Euclidean 10
n: 10
* distVectors1: .
  1.366ns per iteration / 732029.06 per second.
* distVectors2: .
  1.286ns per iteration / 34.02 per second.
* distVectors3: ..
  0.950ns per iteration / 1052847.93 per second.
* distVectors4: ...
  0.452ns per iteration / 2213752.56 per second.
Macintosh-62:~/Code/blas/examples (euclidean)$ ./Euclidean 100
n: 100
* distVectors1: ..
  12.496ns per iteration / 80025.79 per second.
* distVectors2: 
  2.149ns per iteration / 465311.61 per second.
* distVectors3: ..
  8.728ns per iteration / 114572.53 per second.
* distVectors4: ..
  0.597ns per iteration / 1675765.65 per second.
Macintosh-62:~/Code/blas/examples (euclidean)$ ./Euclidean 1000
n: 1000
* distVectors1: ..
  125.920ns per iteration / 7941.52 per second.
* distVectors2: ..
  10.677ns per iteration / 93661.99 per second.
* distVectors3: ...
  85.533ns per iteration / 11691.42 per second.
* distVectors4: 
  1.962ns per iteration / 509668.17 per second.

The distVectors2 function is computing the norm in a more  
numerically-stable way.  It is also allocating a temporary array to  
store x-y.


I haven't looked over your benchmarking code too thoroughly, but parts  
of it seem suspect to me.  Namely, you are using mapM instead of  
mapM_.  Also, it seems like you are including GC times in your  
benchmarks.


Attached is my code.  You will need the latest blas from github to  
compile it.  You also might want to look into hmatrix.  I have very  
little time for haskell code right now (I'm trying to finish my Ph.D.  
in the next month).  This summer I plan to work more on this stuff,  
and I'll probably make a release in July.



Patrick



Euclidean.hs
Description: Binary data





dist.c
Description: Binary data


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-18 Thread Kenneth Hoste

Hello,

For a while now, I've been trying to come up with a fast Haskell-only
function which implements Euclidean distance in n-dimensional space.

So far, I've been disappointed by the performance of my best effort
in Haskell, compared to C. I'm hoping some of the Haskell experts
and/or performance gurus on this list can help me out on resolving this,
and also maybe shed some light on some strange (imho) things I've run
into.

My current best try uses the uvector package, has two 'vectors' of type
(UArr Double)  as input, and relies on the sumU and zipWithU functions
which use streaming to compute the result:

dist_fast :: UArr Double - UArr Double - Double
dist_fast p1 p2 = sumDs `seq` sqrt sumDs
where
sumDs = sumU ds
ds= zipWithU euclidean p1 p2
euclidean x y = d*d
where
d = x-y

I've been benchmarking this function against various alternatives
using the MicroBench [1] package, which allows to get accurate timings  
of

single function calls.
I've also mimicked the MicroBench approach in pure C, to get comparable
timings for a C-only implementation.
The C-only function is quite straightforward:

double dist(int dim, double* p1, double* p2){

int i;
double d = 0.0;

for(i=0; i  dim; i++){
d += (p1[i] - p2[i])*(p1[i] - p2[i]);
}

return sqrt(d);
}

(Note that the optimizer takes care of the p1-p2 operation
appearing twice in the code).

All code is attached if you'd like to play around with it.

All numbers reported below are using GHC 6.10.2 and gcc 4.3.3
on Linux/x86.

The compilation details can be found in the Makefile attached, but
in short, I'm using -O2 -fexcess-precision or
-O2 -fexcess-precision -fvia-C -optc-O3 with GHC, and -O3 with gcc.



Dist.hs
Description: Binary data


microbench_dist.c
Description: Binary data


microbench_dist.hs
Description: Binary data




dist_c.c
Description: Binary data


dist_c.h
Description: Binary data


Makefile
Description: Binary data



Now the bad news: for small dimensions, e.g. 2D/3D-space,
the dist_fast function is 70-240x slower than a pure C implementation,
depending on the architecture.

For example, in 2D-space on an Intel Pentium 4 (3.0GHz, 1M L2 cache),
a single call to dist_fast takes about 1.75 microseconds (or  
0.0175s),

while a call to dist_c (C implementation of Eucl. dist), takes about
0.025 microseconds (70x slowdown).

On a Core 2 2.5GHz with 6MB L2 this maps to 1.9 and 0.008 microseconds,
resp. (i.e. 240x slower), while on a Core i7 2.66GHz with 8MB L2 the  
numbers

are 1.53 and 0.011 microseconds (i.e. 140x slower).

For larger dimensions, the gap becomes less big, but is still
worrying: 10D: 40-110x; 100D: 10-17x; 1kD: 2.5x-6x.

I'm mostly interested in the range 10D to 100D, so seeing that
Haskell is over 10x and up to 100x slower than C is kind of
making me cry.

I've tried some things to improve on this without much luck,
on the contrary:

*) Marking dist_fast for inlining makes things worse; in general
the inlined version is 2x slower for low dimensionality, and even
5x slower for larger dimensionality.
This was somewhat surprising to me...

*) In a moment of weakness, I used the Foreign Function
Interface to call the dist_c C-only implementation from Haskell.
Unfortunately, there seems to be a lot of overhead in calling
dist_c from Haskell. Most of the performance gain from using
C melts away, and sometimes the performance of the FFI'd
dist_c is 15-30% worse than the native dist_fast version
(especially at low dimensionality).

Only for the largest dimensionalities (10k-100kD), the FFI'd
version reaches the performance of the native C approach.
But, since I'm mostly interested in the 10-100D range, this is
of little use to me.

One thing I noticed is that compiling through C using
-fvia-C -optc-O3 might be a bad idea, depending on your system.

On an Intel Pentium 4 system, -fvia-C -optc-O3 was giving me
a speedup of up 70% (large dim.), while on Core 2 and Core i7
it resulted in a slowdown of 15-20% !
I was using roughly equal versions of GCC with this, i.e. a
self-bootstrapped GCC 4.3.x.


So, my question to the list if simple: how can I get better
performance out of a Haskell-only approach?
Any comments/suggestions are highly appreciated.

I'd prefer a Haskell-only approach, but my main concern is speed.
The Euclidean distance function will be used quite heavily in various  
tools.


I currently have a C-version of some of the tools, but the amount of  
code that is
needed for those tools is becoming ridiculously big. I believe using  
Haskell
will allow me to come up with a more easy to maintain code base.  
However,

I don't want to pay a huge price for this in terms of performance.

greetings,

Kenneth

[1] MicroBench: 
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/microbench

--

Kenneth Hoste
Paris research group - 

Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-18 Thread Claus Reinke

My current best try uses the uvector package, has two 'vectors' of type
(UArr Double)  as input, and relies on the sumU and zipWithU functions
which use streaming to compute the result:

dist_fast :: UArr Double - UArr Double - Double
dist_fast p1 p2 = sumDs `seq` sqrt sumDs
where
sumDs = sumU ds
ds= zipWithU euclidean p1 p2
euclidean x y = d*d
where
d = x-y


You'll probably want to make sure that 'euclidian' is specialized to
the types you need (here 'Double'), not used overloaded for 'Num a=a'
(check -ddump-tc, or -ddump-simpl output).

After that, unrolling the fused fold loop (uvector internal) might help
a bit, but isn't there yet:

http://hackage.haskell.org/trac/ghc/ticket/3123
http://hackage.haskell.org/trac/ghc/wiki/Inlining

And even if that gets implemented, it doesn't apply directly to your
case, where the loop is in a library, but you might want to control its
unrolling in your client code. Having the loop unrolled by a default
factor (8x or so) should help for loops like this, with little computation.

Claus


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-18 Thread Don Stewart
kenneth.hoste:
 Hello,

 For a while now, I've been trying to come up with a fast Haskell-only
 function which implements Euclidean distance in n-dimensional space.

 So far, I've been disappointed by the performance of my best effort
 in Haskell, compared to C. I'm hoping some of the Haskell experts
 and/or performance gurus on this list can help me out on resolving this,
 and also maybe shed some light on some strange (imho) things I've run
 into.

 My current best try uses the uvector package, has two 'vectors' of type
 (UArr Double)  as input, and relies on the sumU and zipWithU functions
 which use streaming to compute the result:

 dist_fast :: UArr Double - UArr Double - Double
 dist_fast p1 p2 = sumDs `seq` sqrt sumDs
 where
 sumDs = sumU ds
 ds= zipWithU euclidean p1 p2
 euclidean x y = d*d
 where
 d = x-y

The problem in your uvector code is the use of lists, rather than uvector
generators. Replace [1..n] with enumFromTo:

import Control.Monad
import System.Environment
import System.IO
import Data.Array.Vector

dist :: UArr Double - UArr Double - Double
dist p1 p2 = sumU (zipWithU euclidean p1 p2)
where
euclidean x y = d*d where d = x-y

main = do
[dim] - map read `fmap` getArgs

print $
  dist
(enumFromToFracU 1.0 dim)
(enumFromToFracU 1.0 dim)

Now the entire thiing will fuse to a loop.

  $s$wfold_s1RR :: Double# - Double# - Double# - Double#

  $s$wfold_s1RR =
\ (sc_s1RH :: Double#)
  (sc1_s1RI :: Double#)
  (sc2_s1RJ :: Double#) -
  case ## sc1_s1RI a5_s1QR of wild4_a1tn {
False -
  case ## sc_s1RH a5_s1QR of wild5_X1vx {
False -
  let {
x1_a1Jg [ALWAYS Just L] :: Double#

x1_a1Jg = -## sc1_s1RI sc_s1RH } in
  $s$wfold_s1RR
(+## sc_s1RH 1.0)
(+## sc1_s1RI 1.0)
(+## sc2_s1RJ (*## x1_a1Jg x1_a1Jg));
True - sc2_s1RJ
  };
True - sc2_s1RJ
  }; } in
case $s$wfold_s1RR 1.0 1.0 0.0 of ww_s1QH { __DEFAULT -
a19 (D# ww_s1QH)

and this assembly:

$ ghc -O2 -fvia-C -optc-O3

  s1T9_info:
  movsd   5(%rbx), %xmm7
  ucomisd %xmm7, %xmm6
  ja  .L15
  ucomisd %xmm7, %xmm5
  jbe .L18
.L14:
.L15:
  movsd   (%rbp), %xmm5
  leaq8(%rbp), %rbp
  jmp *(%rbp)
.L18:
  movapd  %xmm6, %xmm7
  subsd   %xmm5, %xmm7
  mulsd   %xmm7, %xmm7
  addsd   (%rbp), %xmm7
  movsd   %xmm7, (%rbp)
  movsd   .LC0(%rip), %xmm7
  addsd   %xmm7, %xmm5
  addsd   %xmm7, %xmm6
  jmp s1T9_info

Which I'd wager will match the C, which has to allocate the two arrays (GHC
essentially decides it doesn't need the arrays any more.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-18 Thread Claus Reinke

dist_fast :: UArr Double - UArr Double - Double
dist_fast p1 p2 = sumDs `seq` sqrt sumDs
where
sumDs = sumU ds
ds= zipWithU euclidean p1 p2
euclidean x y = d*d
where
d = x-y


You'll probably want to make sure that 'euclidian' is specialized to
the types you need (here 'Double'), not used overloaded for 'Num a=a'
(check -ddump-tc, or -ddump-simpl output).


Sorry about that misdirection - as it happened, I was looking at the 
tc output for 'dist_fast' (euclidean :: forall a. (Num a) = a - a - a), 
but the simpl output for 'dist_fast_inline' .., which uses things like 


   __inline_me ..
   case Dist.sumU (Dist.$wzipWithU ..
   GHC.Num.- @ GHC.Types.Double GHC.Float.$f9 x_aLt y_aLv 

Once I actually add a 'dist_fast_inline_caller', that indirection 
disappears in the inlined code, just as it does for dist_fast itself.


   dist_fast_inlined_caller :: UArr Double - UArr Double - Bool
   dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2  2

However, in the simpl output for 'dist_fast_inline_caller', the
'sumU' and 'zipWithU' still don't seem to be fused - Don?

Claus


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-18 Thread Don Stewart
claus.reinke:
 dist_fast :: UArr Double - UArr Double - Double
 dist_fast p1 p2 = sumDs `seq` sqrt sumDs
 where
 sumDs = sumU ds
 ds= zipWithU euclidean p1 p2
 euclidean x y = d*d
 where
 d = x-y

 You'll probably want to make sure that 'euclidian' is specialized to
 the types you need (here 'Double'), not used overloaded for 'Num a=a'
 (check -ddump-tc, or -ddump-simpl output).

 Sorry about that misdirection - as it happened, I was looking at the tc 
 output for 'dist_fast' (euclidean :: forall a. (Num a) = a - a - a),  
 but the simpl output for 'dist_fast_inline' .., which uses things like 

__inline_me ..
case Dist.sumU (Dist.$wzipWithU ..
GHC.Num.- @ GHC.Types.Double GHC.Float.$f9 x_aLt y_aLv 

 Once I actually add a 'dist_fast_inline_caller', that indirection  
 disappears in the inlined code, just as it does for dist_fast itself.

dist_fast_inlined_caller :: UArr Double - UArr Double - Bool
dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2  2

 However, in the simpl output for 'dist_fast_inline_caller', the
 'sumU' and 'zipWithU' still don't seem to be fused - Don?

All the 'seq's and so on should be unnecessary, and even so, I still get
the expected fusion:

import Control.Monad
import System.Environment
import System.IO
import Data.Array.Vector

{-
dist :: UArr Double - UArr Double - Double
dist p1 p2 = sumU (zipWithU euclidean p1 p2)
where
euclidean x y = d*d where d = x-y
-}

main = do
[dim] - map read `fmap` getArgs

print $
  dist_fast_inlined
(enumFromToFracU 1.0 dim)
(enumFromToFracU 1.0 dim)

dist_fast_inlined :: UArr Double - UArr Double - Double
{-# INLINE dist_fast_inlined #-}
dist_fast_inlined p1 p2 = sumDs `seq` sqrt sumDs
where
sumDs = sumU ds
ds= zipWithU euclidean p1 p2
euclidean x y = d*d
where
d = x-y

{-

19 RuleFired
2 /##
3 SC:$wfold0
5 int2Double#
1 map
1 mapList
3 streamU/unstreamU
2 truncate/Double-Int
1 unpack
1 unpack-list

$s$wfold_s1TB :: Double# - Double# - Double# - Double#

-}
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-18 Thread Claus Reinke
Once I actually add a 'dist_fast_inline_caller', that indirection  
disappears in the inlined code, just as it does for dist_fast itself.


   dist_fast_inlined_caller :: UArr Double - UArr Double - Bool
   dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2  2

However, in the simpl output for 'dist_fast_inline_caller', the
'sumU' and 'zipWithU' still don't seem to be fused - Don?


All the 'seq's and so on should be unnecessary, and even so, I still get
the expected fusion:


As I said, I don't get the fusion if I just add the function above to the 
original Dist.hs, export it and compile the module with '-c -O2 -ddump-simpl':


Dist.dist_fast_inlined_caller =
 \ (w1_s1nb :: Data.Array.Vector.UArr.UArr GHC.Types.Double)
   (w2_s1nc :: Data.Array.Vector.UArr.UArr GHC.Types.Double) -
   case (Dist.$wzipWithU Dist.lvl2 w1_s1nb w2_s1nc)
`cast` (trans
  Data.Array.Vector.UArr.TFCo:R56:UArr
  Data.Array.Vector.UArr.NTCo:R56:UArr
:: Data.Array.Vector.UArr.UArr GHC.Types.Double
 ~
   Data.Array.Vector.Prim.BUArr.BUArr GHC.Types.Double)
   of _
   { Data.Array.Vector.Prim.BUArr.BUArr ipv_s1lb
ipv1_s1lc
ipv2_s1ld -
   letrec {
 $wfold_s1nN :: GHC.Prim.Double#
- GHC.Prim.Int#
- GHC.Prim.Double#
 LclId
 [Arity 2
  Str: DmdType LL]
 $wfold_s1nN =
   \ (ww_s1mZ :: GHC.Prim.Double#) (ww1_s1n3 :: GHC.Prim.Int#) -
 case GHC.Prim.==# ww1_s1n3 ipv1_s1lc of _ {
   GHC.Bool.False -
 $wfold_s1nN
   (GHC.Prim.+##
  ww_s1mZ
  (GHC.Prim.indexDoubleArray#
 ipv2_s1ld (GHC.Prim.+# ipv_s1lb ww1_s1n3)))
   (GHC.Prim.+# ww1_s1n3 1);
   GHC.Bool.True - ww_s1mZ
 }; } in
   case $wfold_s1nN 0.0 0 of ww_s1n7 { __DEFAULT -
   GHC.Prim.## (GHC.Prim.sqrtDouble# ww_s1n7) 2.0
   }
   }

Claus


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-18 Thread Don Stewart
claus.reinke:
 Once I actually add a 'dist_fast_inline_caller', that indirection   
 disappears in the inlined code, just as it does for dist_fast itself.

dist_fast_inlined_caller :: UArr Double - UArr Double - Bool
dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2  2

 However, in the simpl output for 'dist_fast_inline_caller', the
 'sumU' and 'zipWithU' still don't seem to be fused - Don?

 All the 'seq's and so on should be unnecessary, and even so, I still get
 the expected fusion:

 As I said, I don't get the fusion if I just add the function above to the 
 original Dist.hs, export it and compile the module with '-c -O2 
 -ddump-simpl':

I can't reproduce this.

Does the complete program fragment I posted earlier yield the desired
result?

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C

2009-05-18 Thread Claus Reinke
As I said, I don't get the fusion if I just add the function above to the 
original Dist.hs, export it and compile the module with '-c -O2 
-ddump-simpl':


I can't reproduce this.


Interesting. I'm using ghc 6.11.20090320 (windows), uvector-0.1.0.3. 
I attach the modified Dist.hs and its simpl output, created via:


   ghc -c Dist.hs -O2 -ddump-tc -ddump-simpl-stats -ddump-simpl  Dist.dumps

Perhaps others can confirm the effect? Note that the 'dist_fast' in the 
same module does get fused, so it is not likely an options issue. I still 
suspect that the inlining of the 'Dist.zipWith' wrapper in the 'dist_fast_inlined'

'__inline_me' has some significance - it is odd to see inlined code in an
'__inline_me' and the fusion rule won't trigger on 'Dist.sumU . 
Dist.$wzipWithU',
right?


Does the complete program fragment I posted earlier yield the desired
result?


Yes. Note that the original poster also reported slowdown from
use of 'dist_fast_inlined'.

Claus



Dist.dumps
Description: Binary data


Dist.hs
Description: Binary data
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe