Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
On May 18, 2009, at 15:28 , Claus Reinke wrote: My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result: dist_fast :: UArr Double - UArr Double - Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds= zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y You'll probably want to make sure that 'euclidian' is specialized to the types you need (here 'Double'), not used overloaded for 'Num a=a' (check -ddump-tc, or -ddump-simpl output). I understand from your later post that is was in fact specialized, but how do I make sure it _is_ specialized? Can I just add a type signature in the dist_fast definition for euclidean, or should I define euclidean outside of dist_fast, with an explicit type signature? If the latter, won't that hurt performance? Or should marking it INLINE take care of that? After that, unrolling the fused fold loop (uvector internal) might help a bit, but isn't there yet: http://hackage.haskell.org/trac/ghc/ticket/3123 http://hackage.haskell.org/trac/ghc/wiki/Inlining And even if that gets implemented, it doesn't apply directly to your case, where the loop is in a library, but you might want to control its unrolling in your client code. Having the loop unrolled by a default factor (8x or so) should help for loops like this, with little computation. This seems rather serious, and might be one of the bigger reasons why I'm getting nowhere close to C in terms of performance... The loop body is ridiculously small, so it would make sense to unroll it somewhat to help avoid the loop overhead. However, it seems like GHC isn't able to do that now. Is there any way to unroll the loop myself, to speed things up? Seems hard, because I'm using uvector... K. -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.ho...@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
On May 18, 2009, at 20:54 , Claus Reinke wrote: As I said, I don't get the fusion if I just add the function above to the original Dist.hs, export it and compile the module with '-c -O2 -ddump-simpl': I can't reproduce this. Interesting. I'm using ghc 6.11.20090320 (windows), uvector-0.1.0.3. I attach the modified Dist.hs and its simpl output, created via: ghc -c Dist.hs -O2 -ddump-tc -ddump-simpl-stats -ddump-simpl Dist.dumps Perhaps others can confirm the effect? Note that the 'dist_fast' in the same module does get fused, so it is not likely an options issue. I still suspect that the inlining of the 'Dist.zipWith' wrapper in the 'dist_fast_inlined' '__inline_me' has some significance - it is odd to see inlined code in an '__inline_me' and the fusion rule won't trigger on 'Dist.sumU . Dist. $wzipWithU', right? As far as I can tell, the dist_fast_inlined doesn't get fused, i.e. I'm seeing zipWithU and sumU being used in it, which is not the case in dist_fast. This is on OS X/PowerPC, using GHC 6.10.1. Does the complete program fragment I posted earlier yield the desired result? Yes. Note that the original poster also reported slowdown from use of 'dist_fast_inlined'. Don, you were defining dist inside the main module, while in our case the dist functions are defined in a seperate Dist.hs module... Would that matter? K. -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.ho...@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
Hello! On Monday 18 May 2009 14:37:51 Kenneth Hoste wrote: I'm mostly interested in the range 10D to 100D is the dimension known at compile-time? Then you could consider Template Haskell. I wrote up some code for generating the vector types and vector subtraction/inner product below, HTH. One problem is that I'm using a typeclass and apparently you can't make {-# SPECIALISE #-} pragmas with TH, so let's hope it is automatically specialised by GHC. Greetings, Daniel TH.hs -- {-# LANGUAGE TemplateHaskell #-} {-# OPTIONS -fglasgow-exts #-} module TH where import Language.Haskell.TH import Control.Monad -- Non-TH stuff class InnerProductSpace v r | v - r where innerProduct :: v - v - r class AbGroup v where minus :: v - v - v euclidean x y = case minus x y of z - sqrt $! innerProduct z z -- TH noContext :: Q Cxt noContext = return [] strict :: Q Type - StrictTypeQ strict = liftM ((,) IsStrict) makeVectors :: Int -- ^ Dimension - Q Type -- ^ Component type, assumed to be a 'Num' - String -- ^ Name for the generated type - Q [Dec] makeVectors n ctyp name0 = do -- let's assume ctyp = Double, name = Vector for the comments -- generate names for the variables we will need xs - replicateM n (newName x) ys - replicateM n (newName y) let name = mkName name0 -- shorthands for arithmetic expressions; the first takes expressions, -- the others take variable names sumE e1 e2 = infixE (Just e1) [|(+)|] (Just e2) varDiffE e1 e2 = infixE (Just (varE e1)) [|(-)|] (Just (varE e2)) varProdE e1 e2 = infixE (Just (varE e1)) [|(*)|] (Just (varE e2)) conPat vars = conP name (fmap varP vars) -- data Vector = Vector !Double ... !Double theDataD = dataD noContext name [] -- no context, no params [normalC name (replicate n (strict ctyp))] [''Eq,''Ord,''Show] -- 'deriving' clause innerProdD = -- instance InnerProductSpace Vector Double where ... instanceD noContext ( conT ''InnerProductSpace `appT` conT name `appT` ctyp) -- innerProduct = ... [valD (varP 'innerProduct) (normalB -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) - (lamE [conPat xs, conPat ys] -- x1*y1 + + xn*yn + 0 (foldl sumE [|0|] $ zipWith varProdE xs ys) )) [] -- no 'where' clause ] abGroupD = instanceD noContext ( conT ''AbGroup `appT` conT name) -- minus = ... [valD (varP 'minus) (normalB -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) - (lamE [conPat xs, conPat ys] -- Vector (x1-y1) ... (xn-yn) (foldl appE (conE name) $ zipWith varDiffE xs ys) )) [] -- no 'where' clause ] sequence [theDataD,innerProdD,abGroupD] Main.hs -- {-# LANGUAGE TemplateHaskell #-} {-# LANGUAGE MultiParamTypeClasses #-} module Main where import TH $(makeVectors 3 [t|Double|] Vec3) main = print $ euclidean (Vec3 1 1 1) (Vec3 0 0 0) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
I understand from your later post that is was in fact specialized, but how do I make sure it _is_ specialized? -ddump-tc seems to give the generalized type, so it seems you'd need to look at the -ddump-simpl output if you want to know whether a local function is specialized. http://www.haskell.org/haskellwiki/Performance/GHC#Looking_at_the_Core Can I just add a type signature in the dist_fast definition for euclidean, If you need it at exactly one type, that is the simplest way. There's also the SPECIALIZE pragma http://www.haskell.org/ghc/docs/latest/html/users_guide/pragmas.html#specialize-pragma and for local and non-exported functions '-O' should enable auto-specialization http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html After that, unrolling the fused fold loop (uvector internal) might help a bit, but isn't there yet: http://hackage.haskell.org/trac/ghc/ticket/3123 http://hackage.haskell.org/trac/ghc/wiki/Inlining And even if that gets implemented, it doesn't apply directly to your case, where the loop is in a library, but you might want to control its unrolling in your client code. Having the loop unrolled by a default factor (8x or so) should help for loops like this, with little computation. This seems rather serious, and might be one of the bigger reasons why I'm getting nowhere close to C in terms of performance... You should be able to get near (a small factor) without it, but not having it leaves a substantial gap in performance, especially for simple loop bodies (there is also the case of enabling fusion over multiple loop iterations, but that may call for proper fix-point fusion). The loop body is ridiculously small, so it would make sense to unroll it somewhat to help avoid the loop overhead. However, it seems like GHC isn't able to do that now. Apart from the links above, the new backend also has relevant TODO items: http://hackage.haskell.org/trac/ghc/wiki/BackEndNotes Is there any way to unroll the loop myself, to speed things up? Seems hard, because I'm using uvector... You could 'cabal unpack' uvector, hand-unroll the core loop all its operations get fused into, then reinstall the modified package.. (perhaps that should be a package configuration option, at least until GHC gets loop or recursion unrolling - since this would be a temporary workaround, it would probably not be worth it to have multiple package versions with different unroll factors; if this actually helps uvector users with performance in practice, they could suggest it as a patch). Claus ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
Hi, meh, I just realised that there is no sensible way to actually introduce/eliminate the generated types. I'm attaching a revised version with fromList/toList functions. Maybe the vector type should be polymorphic and be an instance of Functor, Monad and Foldable? But then we really depend on specialisation. Greetings, Daniel {-# LANGUAGE TemplateHaskell #-} {-# OPTIONS -fglasgow-exts #-} module TH where import Language.Haskell.TH import Control.Monad -- Non-TH stuff class InnerProductSpace v r | v - r where innerProduct :: v - v - r class AbGroup v where minus :: v - v - v class FromToList v r | v - r where fromList :: [r] - Maybe v toList :: v - [r] euclidean x y = case minus x y of z - sqrt $! innerProduct z z -- TH noContext :: Q Cxt noContext = return [] strict :: Q Type - StrictTypeQ strict = liftM ((,) IsStrict) makeVectors :: Int -- ^ Dimension - Q Type -- ^ Component type, assumed to be a 'Num' - String -- ^ Name for the generated type - Q [Dec] makeVectors n ctyp name0 = do -- let's assume ctyp = Double, name = Vector for the comments -- generate names for the variables we will need xs - replicateM n (newName x) ys - replicateM n (newName y) lst - newName list let name = mkName name0 -- shorthands for arithmetic expressions; the first takes expressions, -- the others take variable names sumE e1 e2 = infixE (Just e1) [|(+)|] (Just e2) varDiffE e1 e2 = infixE (Just (varE e1)) [|(-)|] (Just (varE e2)) varProdE e1 e2 = infixE (Just (varE e1)) [|(*)|] (Just (varE e2)) conPat vars = conP name (fmap varP vars) -- data Vector = Vector !Double ... !Double theDataD = dataD noContext name [] -- no context, no params [normalC name (replicate n (strict ctyp))] [''Eq,''Ord,''Show] -- 'deriving' clause innerProdD = -- instance InnerProductSpace Vector Double where ... instanceD noContext ( conT ''InnerProductSpace `appT` conT name `appT` ctyp) -- innerProduct = ... [valD (varP 'innerProduct) (normalB -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) - (lamE [conPat xs, conPat ys] -- x1*y1 + + xn*yn + 0 (foldl sumE [|0|] $ zipWith varProdE xs ys) )) [] -- no 'where' clause ] abGroupD = instanceD noContext ( conT ''AbGroup `appT` conT name) -- minus = ... [valD (varP 'minus) (normalB -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) - (lamE [conPat xs, conPat ys] -- Vector (x1-y1) ... (xn-yn) (foldl appE (conE name) $ zipWith varDiffE xs ys) )) [] -- no 'where' clause ] fromToListD = instanceD noContext ( conT ''FromToList `appT` conT name `appT` ctyp) [ funD 'fromList [ clause [listP $ fmap varP xs] (normalB ([|Just|] `appE` (foldl appE (conE name) $ fmap varE xs))) [] , clause [wildP] (normalB [|Nothing|]) [] -- wrong number of elements ] , funD 'toList [ clause [conPat xs] (normalB (listE (fmap varE xs))) []] ] sequence [theDataD,innerProdD,abGroupD,fromToListD] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
On May 19, 2009, at 13:24 , Daniel Schüssler wrote: Hello! On Monday 18 May 2009 14:37:51 Kenneth Hoste wrote: I'm mostly interested in the range 10D to 100D is the dimension known at compile-time? Then you could consider Template Haskell. In general, no. :-) It will be known for some applications, but not for others. I'm more and more amazed what comes into play just to implement something simple like n-dim. Euclidean distance relatively fast using Haskell. It seems to me that GHC is missing several critical optimizations (yes, I know, patches welcome) to enable it to emit fast code for HPC applications. I'm still a big fan of Haskell, for a variety of reasons, but it seems like it's not ready yet for the task I had in mind, which is a shame. Just to be clear, this isn't a flame bait post or anything, just my 2 cents. K. I wrote up some code for generating the vector types and vector subtraction/inner product below, HTH. One problem is that I'm using a typeclass and apparently you can't make {-# SPECIALISE #-} pragmas with TH, so let's hope it is automatically specialised by GHC. Greetings, Daniel TH.hs -- {-# LANGUAGE TemplateHaskell #-} {-# OPTIONS -fglasgow-exts #-} module TH where import Language.Haskell.TH import Control.Monad -- Non-TH stuff class InnerProductSpace v r | v - r where innerProduct :: v - v - r class AbGroup v where minus :: v - v - v euclidean x y = case minus x y of z - sqrt $! innerProduct z z -- TH noContext :: Q Cxt noContext = return [] strict :: Q Type - StrictTypeQ strict = liftM ((,) IsStrict) makeVectors :: Int -- ^ Dimension - Q Type -- ^ Component type, assumed to be a 'Num' - String -- ^ Name for the generated type - Q [Dec] makeVectors n ctyp name0 = do -- let's assume ctyp = Double, name = Vector for the comments -- generate names for the variables we will need xs - replicateM n (newName x) ys - replicateM n (newName y) let name = mkName name0 -- shorthands for arithmetic expressions; the first takes expressions, -- the others take variable names sumE e1 e2 = infixE (Just e1) [|(+)|] (Just e2) varDiffE e1 e2 = infixE (Just (varE e1)) [|(-)|] (Just (varE e2)) varProdE e1 e2 = infixE (Just (varE e1)) [|(*)|] (Just (varE e2)) conPat vars = conP name (fmap varP vars) -- data Vector = Vector !Double ... !Double theDataD = dataD noContext name [] -- no context, no params [normalC name (replicate n (strict ctyp))] [''Eq,''Ord,''Show] -- 'deriving' clause innerProdD = -- instance InnerProductSpace Vector Double where ... instanceD noContext ( conT ''InnerProductSpace `appT` conT name `appT` ctyp) -- innerProduct = ... [valD (varP 'innerProduct) (normalB -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) - (lamE [conPat xs, conPat ys] -- x1*y1 + + xn*yn + 0 (foldl sumE [|0|] $ zipWith varProdE xs ys) )) [] -- no 'where' clause ] abGroupD = instanceD noContext ( conT ''AbGroup `appT` conT name) -- minus = ... [valD (varP 'minus) (normalB -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) - (lamE [conPat xs, conPat ys] -- Vector (x1-y1) ... (xn-yn) (foldl appE (conE name) $ zipWith varDiffE xs ys) )) [] -- no 'where' clause ] sequence [theDataD,innerProdD,abGroupD] Main.hs -- {-# LANGUAGE TemplateHaskell #-} {-# LANGUAGE MultiParamTypeClasses #-} module Main where import TH $(makeVectors 3 [t|Double|] Vec3) main = print $ euclidean (Vec3 1 1 1) (Vec3 0 0 0) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.ho...@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] fast Eucl. dist. - Haskell vs C
Hi Kenneth, I wrote a benchmark similar to yours using the haskell blas library I wrote (latest version is on github at http://github.com/patperry/blas/tree/master , an older version is on hackage). The pure Haskell implementation is not very good, and the problem seems to be repeated boxing/unboxing of doubles. It may be possible to avoid this by writing a custom fold function for Vectors, but I haven't tried it. Using a vectorized subtraction function along with the BLAS1 dnrm2 function, I was able to get performance about 2.8-5.4x worse than C. This isn't great, but it's not terrible, either. I didn't have the performance problems you ran in to using the FFI. My benchmark includes: distVectors1: a native implementation using foldl' and zipWith distVectors2: a vectorize implementation, defined as @distVectors2 x y = norm2 (x-y)@ distVectors3: a custom loop in haskell distVectors4: an FFI call to C All 4 functions use the Vector type, provided by the blas package. I compiled with -O -fexcess-preceision and ran timings for vectors of size 10, 100, and 1000. Here are the results: Macintosh-62:~/Code/blas/examples (euclidean)$ ./Euclidean 10 n: 10 * distVectors1: . 1.366ns per iteration / 732029.06 per second. * distVectors2: . 1.286ns per iteration / 34.02 per second. * distVectors3: .. 0.950ns per iteration / 1052847.93 per second. * distVectors4: ... 0.452ns per iteration / 2213752.56 per second. Macintosh-62:~/Code/blas/examples (euclidean)$ ./Euclidean 100 n: 100 * distVectors1: .. 12.496ns per iteration / 80025.79 per second. * distVectors2: 2.149ns per iteration / 465311.61 per second. * distVectors3: .. 8.728ns per iteration / 114572.53 per second. * distVectors4: .. 0.597ns per iteration / 1675765.65 per second. Macintosh-62:~/Code/blas/examples (euclidean)$ ./Euclidean 1000 n: 1000 * distVectors1: .. 125.920ns per iteration / 7941.52 per second. * distVectors2: .. 10.677ns per iteration / 93661.99 per second. * distVectors3: ... 85.533ns per iteration / 11691.42 per second. * distVectors4: 1.962ns per iteration / 509668.17 per second. The distVectors2 function is computing the norm in a more numerically-stable way. It is also allocating a temporary array to store x-y. I haven't looked over your benchmarking code too thoroughly, but parts of it seem suspect to me. Namely, you are using mapM instead of mapM_. Also, it seems like you are including GC times in your benchmarks. Attached is my code. You will need the latest blas from github to compile it. You also might want to look into hmatrix. I have very little time for haskell code right now (I'm trying to finish my Ph.D. in the next month). This summer I plan to work more on this stuff, and I'll probably make a release in July. Patrick Euclidean.hs Description: Binary data dist.c Description: Binary data ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] fast Eucl. dist. - Haskell vs C
Hello, For a while now, I've been trying to come up with a fast Haskell-only function which implements Euclidean distance in n-dimensional space. So far, I've been disappointed by the performance of my best effort in Haskell, compared to C. I'm hoping some of the Haskell experts and/or performance gurus on this list can help me out on resolving this, and also maybe shed some light on some strange (imho) things I've run into. My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result: dist_fast :: UArr Double - UArr Double - Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds= zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y I've been benchmarking this function against various alternatives using the MicroBench [1] package, which allows to get accurate timings of single function calls. I've also mimicked the MicroBench approach in pure C, to get comparable timings for a C-only implementation. The C-only function is quite straightforward: double dist(int dim, double* p1, double* p2){ int i; double d = 0.0; for(i=0; i dim; i++){ d += (p1[i] - p2[i])*(p1[i] - p2[i]); } return sqrt(d); } (Note that the optimizer takes care of the p1-p2 operation appearing twice in the code). All code is attached if you'd like to play around with it. All numbers reported below are using GHC 6.10.2 and gcc 4.3.3 on Linux/x86. The compilation details can be found in the Makefile attached, but in short, I'm using -O2 -fexcess-precision or -O2 -fexcess-precision -fvia-C -optc-O3 with GHC, and -O3 with gcc. Dist.hs Description: Binary data microbench_dist.c Description: Binary data microbench_dist.hs Description: Binary data dist_c.c Description: Binary data dist_c.h Description: Binary data Makefile Description: Binary data Now the bad news: for small dimensions, e.g. 2D/3D-space, the dist_fast function is 70-240x slower than a pure C implementation, depending on the architecture. For example, in 2D-space on an Intel Pentium 4 (3.0GHz, 1M L2 cache), a single call to dist_fast takes about 1.75 microseconds (or 0.0175s), while a call to dist_c (C implementation of Eucl. dist), takes about 0.025 microseconds (70x slowdown). On a Core 2 2.5GHz with 6MB L2 this maps to 1.9 and 0.008 microseconds, resp. (i.e. 240x slower), while on a Core i7 2.66GHz with 8MB L2 the numbers are 1.53 and 0.011 microseconds (i.e. 140x slower). For larger dimensions, the gap becomes less big, but is still worrying: 10D: 40-110x; 100D: 10-17x; 1kD: 2.5x-6x. I'm mostly interested in the range 10D to 100D, so seeing that Haskell is over 10x and up to 100x slower than C is kind of making me cry. I've tried some things to improve on this without much luck, on the contrary: *) Marking dist_fast for inlining makes things worse; in general the inlined version is 2x slower for low dimensionality, and even 5x slower for larger dimensionality. This was somewhat surprising to me... *) In a moment of weakness, I used the Foreign Function Interface to call the dist_c C-only implementation from Haskell. Unfortunately, there seems to be a lot of overhead in calling dist_c from Haskell. Most of the performance gain from using C melts away, and sometimes the performance of the FFI'd dist_c is 15-30% worse than the native dist_fast version (especially at low dimensionality). Only for the largest dimensionalities (10k-100kD), the FFI'd version reaches the performance of the native C approach. But, since I'm mostly interested in the 10-100D range, this is of little use to me. One thing I noticed is that compiling through C using -fvia-C -optc-O3 might be a bad idea, depending on your system. On an Intel Pentium 4 system, -fvia-C -optc-O3 was giving me a speedup of up 70% (large dim.), while on Core 2 and Core i7 it resulted in a slowdown of 15-20% ! I was using roughly equal versions of GCC with this, i.e. a self-bootstrapped GCC 4.3.x. So, my question to the list if simple: how can I get better performance out of a Haskell-only approach? Any comments/suggestions are highly appreciated. I'd prefer a Haskell-only approach, but my main concern is speed. The Euclidean distance function will be used quite heavily in various tools. I currently have a C-version of some of the tools, but the amount of code that is needed for those tools is becoming ridiculously big. I believe using Haskell will allow me to come up with a more easy to maintain code base. However, I don't want to pay a huge price for this in terms of performance. greetings, Kenneth [1] MicroBench: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/microbench -- Kenneth Hoste Paris research group -
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result: dist_fast :: UArr Double - UArr Double - Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds= zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y You'll probably want to make sure that 'euclidian' is specialized to the types you need (here 'Double'), not used overloaded for 'Num a=a' (check -ddump-tc, or -ddump-simpl output). After that, unrolling the fused fold loop (uvector internal) might help a bit, but isn't there yet: http://hackage.haskell.org/trac/ghc/ticket/3123 http://hackage.haskell.org/trac/ghc/wiki/Inlining And even if that gets implemented, it doesn't apply directly to your case, where the loop is in a library, but you might want to control its unrolling in your client code. Having the loop unrolled by a default factor (8x or so) should help for loops like this, with little computation. Claus ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
kenneth.hoste: Hello, For a while now, I've been trying to come up with a fast Haskell-only function which implements Euclidean distance in n-dimensional space. So far, I've been disappointed by the performance of my best effort in Haskell, compared to C. I'm hoping some of the Haskell experts and/or performance gurus on this list can help me out on resolving this, and also maybe shed some light on some strange (imho) things I've run into. My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result: dist_fast :: UArr Double - UArr Double - Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds= zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y The problem in your uvector code is the use of lists, rather than uvector generators. Replace [1..n] with enumFromTo: import Control.Monad import System.Environment import System.IO import Data.Array.Vector dist :: UArr Double - UArr Double - Double dist p1 p2 = sumU (zipWithU euclidean p1 p2) where euclidean x y = d*d where d = x-y main = do [dim] - map read `fmap` getArgs print $ dist (enumFromToFracU 1.0 dim) (enumFromToFracU 1.0 dim) Now the entire thiing will fuse to a loop. $s$wfold_s1RR :: Double# - Double# - Double# - Double# $s$wfold_s1RR = \ (sc_s1RH :: Double#) (sc1_s1RI :: Double#) (sc2_s1RJ :: Double#) - case ## sc1_s1RI a5_s1QR of wild4_a1tn { False - case ## sc_s1RH a5_s1QR of wild5_X1vx { False - let { x1_a1Jg [ALWAYS Just L] :: Double# x1_a1Jg = -## sc1_s1RI sc_s1RH } in $s$wfold_s1RR (+## sc_s1RH 1.0) (+## sc1_s1RI 1.0) (+## sc2_s1RJ (*## x1_a1Jg x1_a1Jg)); True - sc2_s1RJ }; True - sc2_s1RJ }; } in case $s$wfold_s1RR 1.0 1.0 0.0 of ww_s1QH { __DEFAULT - a19 (D# ww_s1QH) and this assembly: $ ghc -O2 -fvia-C -optc-O3 s1T9_info: movsd 5(%rbx), %xmm7 ucomisd %xmm7, %xmm6 ja .L15 ucomisd %xmm7, %xmm5 jbe .L18 .L14: .L15: movsd (%rbp), %xmm5 leaq8(%rbp), %rbp jmp *(%rbp) .L18: movapd %xmm6, %xmm7 subsd %xmm5, %xmm7 mulsd %xmm7, %xmm7 addsd (%rbp), %xmm7 movsd %xmm7, (%rbp) movsd .LC0(%rip), %xmm7 addsd %xmm7, %xmm5 addsd %xmm7, %xmm6 jmp s1T9_info Which I'd wager will match the C, which has to allocate the two arrays (GHC essentially decides it doesn't need the arrays any more. -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
dist_fast :: UArr Double - UArr Double - Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds= zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y You'll probably want to make sure that 'euclidian' is specialized to the types you need (here 'Double'), not used overloaded for 'Num a=a' (check -ddump-tc, or -ddump-simpl output). Sorry about that misdirection - as it happened, I was looking at the tc output for 'dist_fast' (euclidean :: forall a. (Num a) = a - a - a), but the simpl output for 'dist_fast_inline' .., which uses things like __inline_me .. case Dist.sumU (Dist.$wzipWithU .. GHC.Num.- @ GHC.Types.Double GHC.Float.$f9 x_aLt y_aLv Once I actually add a 'dist_fast_inline_caller', that indirection disappears in the inlined code, just as it does for dist_fast itself. dist_fast_inlined_caller :: UArr Double - UArr Double - Bool dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2 2 However, in the simpl output for 'dist_fast_inline_caller', the 'sumU' and 'zipWithU' still don't seem to be fused - Don? Claus ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
claus.reinke: dist_fast :: UArr Double - UArr Double - Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds= zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y You'll probably want to make sure that 'euclidian' is specialized to the types you need (here 'Double'), not used overloaded for 'Num a=a' (check -ddump-tc, or -ddump-simpl output). Sorry about that misdirection - as it happened, I was looking at the tc output for 'dist_fast' (euclidean :: forall a. (Num a) = a - a - a), but the simpl output for 'dist_fast_inline' .., which uses things like __inline_me .. case Dist.sumU (Dist.$wzipWithU .. GHC.Num.- @ GHC.Types.Double GHC.Float.$f9 x_aLt y_aLv Once I actually add a 'dist_fast_inline_caller', that indirection disappears in the inlined code, just as it does for dist_fast itself. dist_fast_inlined_caller :: UArr Double - UArr Double - Bool dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2 2 However, in the simpl output for 'dist_fast_inline_caller', the 'sumU' and 'zipWithU' still don't seem to be fused - Don? All the 'seq's and so on should be unnecessary, and even so, I still get the expected fusion: import Control.Monad import System.Environment import System.IO import Data.Array.Vector {- dist :: UArr Double - UArr Double - Double dist p1 p2 = sumU (zipWithU euclidean p1 p2) where euclidean x y = d*d where d = x-y -} main = do [dim] - map read `fmap` getArgs print $ dist_fast_inlined (enumFromToFracU 1.0 dim) (enumFromToFracU 1.0 dim) dist_fast_inlined :: UArr Double - UArr Double - Double {-# INLINE dist_fast_inlined #-} dist_fast_inlined p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds= zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y {- 19 RuleFired 2 /## 3 SC:$wfold0 5 int2Double# 1 map 1 mapList 3 streamU/unstreamU 2 truncate/Double-Int 1 unpack 1 unpack-list $s$wfold_s1TB :: Double# - Double# - Double# - Double# -} ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
Once I actually add a 'dist_fast_inline_caller', that indirection disappears in the inlined code, just as it does for dist_fast itself. dist_fast_inlined_caller :: UArr Double - UArr Double - Bool dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2 2 However, in the simpl output for 'dist_fast_inline_caller', the 'sumU' and 'zipWithU' still don't seem to be fused - Don? All the 'seq's and so on should be unnecessary, and even so, I still get the expected fusion: As I said, I don't get the fusion if I just add the function above to the original Dist.hs, export it and compile the module with '-c -O2 -ddump-simpl': Dist.dist_fast_inlined_caller = \ (w1_s1nb :: Data.Array.Vector.UArr.UArr GHC.Types.Double) (w2_s1nc :: Data.Array.Vector.UArr.UArr GHC.Types.Double) - case (Dist.$wzipWithU Dist.lvl2 w1_s1nb w2_s1nc) `cast` (trans Data.Array.Vector.UArr.TFCo:R56:UArr Data.Array.Vector.UArr.NTCo:R56:UArr :: Data.Array.Vector.UArr.UArr GHC.Types.Double ~ Data.Array.Vector.Prim.BUArr.BUArr GHC.Types.Double) of _ { Data.Array.Vector.Prim.BUArr.BUArr ipv_s1lb ipv1_s1lc ipv2_s1ld - letrec { $wfold_s1nN :: GHC.Prim.Double# - GHC.Prim.Int# - GHC.Prim.Double# LclId [Arity 2 Str: DmdType LL] $wfold_s1nN = \ (ww_s1mZ :: GHC.Prim.Double#) (ww1_s1n3 :: GHC.Prim.Int#) - case GHC.Prim.==# ww1_s1n3 ipv1_s1lc of _ { GHC.Bool.False - $wfold_s1nN (GHC.Prim.+## ww_s1mZ (GHC.Prim.indexDoubleArray# ipv2_s1ld (GHC.Prim.+# ipv_s1lb ww1_s1n3))) (GHC.Prim.+# ww1_s1n3 1); GHC.Bool.True - ww_s1mZ }; } in case $wfold_s1nN 0.0 0 of ww_s1n7 { __DEFAULT - GHC.Prim.## (GHC.Prim.sqrtDouble# ww_s1n7) 2.0 } } Claus ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
claus.reinke: Once I actually add a 'dist_fast_inline_caller', that indirection disappears in the inlined code, just as it does for dist_fast itself. dist_fast_inlined_caller :: UArr Double - UArr Double - Bool dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2 2 However, in the simpl output for 'dist_fast_inline_caller', the 'sumU' and 'zipWithU' still don't seem to be fused - Don? All the 'seq's and so on should be unnecessary, and even so, I still get the expected fusion: As I said, I don't get the fusion if I just add the function above to the original Dist.hs, export it and compile the module with '-c -O2 -ddump-simpl': I can't reproduce this. Does the complete program fragment I posted earlier yield the desired result? -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
As I said, I don't get the fusion if I just add the function above to the original Dist.hs, export it and compile the module with '-c -O2 -ddump-simpl': I can't reproduce this. Interesting. I'm using ghc 6.11.20090320 (windows), uvector-0.1.0.3. I attach the modified Dist.hs and its simpl output, created via: ghc -c Dist.hs -O2 -ddump-tc -ddump-simpl-stats -ddump-simpl Dist.dumps Perhaps others can confirm the effect? Note that the 'dist_fast' in the same module does get fused, so it is not likely an options issue. I still suspect that the inlining of the 'Dist.zipWith' wrapper in the 'dist_fast_inlined' '__inline_me' has some significance - it is odd to see inlined code in an '__inline_me' and the fusion rule won't trigger on 'Dist.sumU . Dist.$wzipWithU', right? Does the complete program fragment I posted earlier yield the desired result? Yes. Note that the original poster also reported slowdown from use of 'dist_fast_inlined'. Claus Dist.dumps Description: Binary data Dist.hs Description: Binary data ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe