> This is a general problem when working with RULES-based > optimisations. Here is an example of what happens: suppose we have > > foo :: Vector Int -> Vector Int > foo xs = map (+1) xs > > Now, GHC will generate a nice tight loop for this but if in a > different module, we have something like this: > > bar xs = foo (foo xs) > > then this won't fuse because (a) foo won't be inlined and (b) even if > GHC did inline here, it would inline the nice tight loop which can't > possibly fuse instead of the original map which can. By slapping an > INLINE pragma on foo, you're telling GHC to (almost) always inline the > function and to use the original definition for inlining, thus giving > it a chance to fuse.
thanks for the insight, roman! >> the downside after adding the INLINE pragmas is that now some of my modules >> take >> _really_ long to compile (up to a couple of minutes); any ideas where i can >> start looking to bring the compilation times down again? > > Alas, stream fusion (and fusion in general, I guess) requires what I > would call whole loop compilation - you need to inline everything into > loops. That tends to be slow. I don't know what your code looks like > but you could try to control inlining a bit more. For instance, if you > have something like this: > > foo ... = ... map f xs ... > where > f x = ... > > you could tell GHC not to inline f until fairly late in the game by adding > > {-# INLINE [0] f #-} > > to the where clause. This helps sometimes. thanks, i'll check it out. > I'm surprised -Odph doesn't produce faster code than -O2. In any > case, you could try turning these flags on individually (esp. > -fno-method-sharing and the spec-constr flags) to see how they affect > performance and compilation times. in the end it turned out that i had forgotten another INLINE pragma and in my crude benchmarks -O2 and -Odph give basically the same results, -O2 being a little faster. i hope i'll have time next week to do proper benchmarks, and i also want to try ghc HEAD with the llvm patches. conv_1 conv_2 conv_3 -Odph 1.004 2.715 1.096 -O2 1.000 2.710 1.097 i'm still curious, though, why my three versions of direct convolution perform so differently (see attached file). in particular, i somehow expected conv_3 to be the slowest and conv_2 to perform similar to conv_1. any ideas? i haven't had a look at the core yet, mainly because i'm lacking the expertise ... <sk>
import Data.Vector.Generic (Vector, (!)) import qualified Data.Vector.Generic as V conv_1, conv_2, conv_3 :: (Num a, Vector v a) => v a -> v a -> v a {-# INLINE conv_1 #-} conv_1 h x = V.generate (l+m) f where m = V.length h - 1 l = V.length x {-# INLINE f #-} f n = g 0 n (max 0 (n-l+1)) (min n m) g y n m k = if m <= k then let y' = y + (h ! m) * (x ! (n-m)) in y' `seq` g y' n (m+1) k else y {-# INLINE conv_2 #-} conv_2 h x = V.generate (l+m) f where l = V.length x m = V.length h - 1 {-# INLINE f #-} f n = let j = max 0 (n-l+1) k = (min n m) - j + 1 in V.sum (V.zipWith (*) (V.slice j k h) (V.reverse (V.slice (n - j - k + 1) k x))) {-# INLINE conv_3 #-} conv_3 h x = V.generate (l+m-1) f where m = V.length h l = V.length x p = V.replicate (m-1) 0 x' = p ++ x ++ p {-# INLINE f #-} f i = V.sum (V.zipWith (*) (V.reverse h) (V.slice i m x'))
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe