I understand from your later post that is was in fact specialized, but
how do I make sure it _is_ specialized?
-ddump-tc seems to give the generalized type, so it seems you'd need
to look at the -ddump-simpl output if you want to know whether a
local function is specialized.
http://www.haskell.org/haskellwiki/Performance/GHC#Looking_at_the_Core
Can I just add a type signature in the dist_fast definition for euclidean,
If you need it at exactly one type, that is the simplest way. There's
also the SPECIALIZE pragma
http://www.haskell.org/ghc/docs/latest/html/users_guide/pragmas.html#specialize-pragma
and for local and non-exported functions '-O' should enable auto-specialization
http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html
After that, unrolling the fused fold loop (uvector internal) might
help a bit, but isn't there yet:
http://hackage.haskell.org/trac/ghc/ticket/3123
http://hackage.haskell.org/trac/ghc/wiki/Inlining
And even if that gets implemented, it doesn't apply directly to your
case, where the loop is in a library, but you might want to control
its unrolling in your client code. Having the loop unrolled by a default
factor (8x or so) should help for loops like this, with little
computation.
This seems rather serious, and might be one of the bigger reasons why
I'm getting nowhere close to C in terms of performance...
You should be able to get near (a small factor) without it, but not having
it leaves a substantial gap in performance, especially for simple loop
bodies (there is also the case of enabling fusion over multiple loop
iterations, but that may call for proper fix-point fusion).
The loop body is ridiculously small, so it would make sense to
unroll it somewhat to help avoid the loop overhead.
However, it seems like GHC isn't able to do that now.
Apart from the links above, the new backend also has relevant TODO
items: http://hackage.haskell.org/trac/ghc/wiki/BackEndNotes
Is there any way to unroll the loop myself, to speed things up?
Seems hard, because I'm using uvector...
You could 'cabal unpack' uvector, hand-unroll the core loop all
its operations get fused into, then reinstall the modified package..
(perhaps that should be a package configuration option, at least
until GHC gets loop or recursion unrolling - since this would be
a temporary workaround, it would probably not be worth it to
have multiple package versions with different unroll factors; if
this actually helps uvector users with performance in practice,
they could suggest it as a patch).
Claus
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe