Re: [Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-07-14 Thread Derek Elkins
On Sat, 2007-07-14 at 11:11 -0700, Anatoly Yakovenko wrote:

> yea, i agree, i am doing a lot of ugly hacks to get things going
> faster.  Actually i think the "cleaner" approach would be to use Harpy
> extension and do the math with x86 assembly :). 

Extension?  It's just a library.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-07-14 Thread Anatoly Yakovenko

-- Forwarded message --
From: Anatoly Yakovenko <[EMAIL PROTECTED]>
Date: Jul 14, 2007 11:09 AM
Subject: Re: Your SHA1
To: Dominic Steinitz <[EMAIL PROTECTED]>



> 1. Very good.


Thanks, it was a fun experiment.


> 2. It has type hash::BS.ByteString -> IO [Word] but hash is a pure function.
> Can this be changed?


you can call it safely with unsafePerformIO, I do a lot of hacks to
get the performance, like casting a C ptr from a bytestring to a word
array, but each call to hash should have no side effects.


> 3. I haven't tried but I assume it only runs with ghc and not hugs? I guess if
> point 1 could be addressed then we could put it in the crypto library 
(assuming
> you are happy with this) with some sort of conditional flag to use your code 
if
> the library is being built for ghc but to use the slow version for other
> compilers / interpreters.


i would be surprised if this ran on hugs, i havent tried.  Also, i
haven't verified that my implementation is correct, as far as the sha1
algorithm is concerned.  the complexity is the same, so if there any
bugs in the math they can be easily fixed without making it slower.


> On a more discursive note, I still don't think we have found the holy grail
> here: idiomatic functional programming (probably not using StorableArray and
> unsafeRead and unsafeWrite) and lightning fast speed.


yea, i agree, i am doing a lot of ugly hacks to get things going
faster.  Actually i think the "cleaner" approach would be to use Harpy
extension and do the math with x86 assembly :).  This way its at least
straight forward, cause right now, i was just randomly changing
strictness and unboxing things so ghc generates the fastest code, but
a different haskell compiler would have completely different results.


> Dominic.
>
> PS I noticed you have:
>
> splitByN::Int -> BS.ByteString -> [BS.ByteString]
> splitByN nn ll = st : (if (BS.null en) then [] else (splitByN nn en))
>where
>   (st,en) = BS.splitAt nn ll
>
> It's a function I often use:
>
> splitByN n =
>unfoldr k
>   where
>  k [] = Nothing
>  k p = Just (splitAt n p)
>
> Maybe it should be a standard part of List and ByteString?


yea i agree, i've seen a couple other implementations as well.  seems
like its something that should be in Prelude and ByteString.

Anatoly
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-07-03 Thread Donald Bruce Stewart
aeyakovenko:
> inlining some of the functions definitely gave me a boost, so i am
> about 8.5 times slower then openssl sha1sum.  I dont really understand
> the core output, but after inlining i got a completely different
> profile output, i am guessing its because the cost of the inlined
> functions is spread to the callers.
> 
> COST CENTREMODULE   %time %alloc
> 
> updateElem SHA1  13.40.0
> sRotateL   SHA1  13.40.0
> hashElem   SHA1  12.50.0
> sXor   SHA1  10.90.0
> unboxW SHA1  10.00.0

So I'd now dive in and seriously look at the Core for these guys.
Work out what they're doing, and how they differ from the C version.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-07-03 Thread Anatoly Yakovenko

Are you using -auto, or -auto-all? Because it makes a difference to
the generated core, and the extent to which inlining takes place. I've
noticed that -auto permits more inlining than -auto-all, so try -auto


-auto doesn't generate any meaningfull profiling info for me


if you can. Also, follow the advice in the GHC manual, and only export
the functions you need to. This will aid both the inliner and
specialiser enormously.


cool, this actually helped quite a bit, now only 7.5 times slower :)



As for reading core (well, actually simplifier output; core has less
"punctuation"), these links might help:

4.16.3. How to read Core syntax
http://www.haskell.org/ghc/docs/latest/html/users_guide/options-debugging.html#id3130643

(and the Encoding module has the actual rules for the Unique names)
http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs

6.2. Faster: producing a program that runs quicker
http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html
(see "How do I find out a function's strictness?")


thanks for the tip, ill take a look at those.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-07-03 Thread Alistair Bayley

On 03/07/07, Anatoly Yakovenko <[EMAIL PROTECTED]> wrote:

inlining some of the functions definitely gave me a boost, so i am
about 8.5 times slower then openssl sha1sum.  I dont really understand
the core output, but after inlining i got a completely different
profile output, i am guessing its because the cost of the inlined
functions is spread to the callers.


Are you using -auto, or -auto-all? Because it makes a difference to
the generated core, and the extent to which inlining takes place. I've
noticed that -auto permits more inlining than -auto-all, so try -auto
if you can. Also, follow the advice in the GHC manual, and only export
the functions you need to. This will aid both the inliner and
specialiser enormously.

As for reading core (well, actually simplifier output; core has less
"punctuation"), these links might help:

4.16.3. How to read Core syntax
http://www.haskell.org/ghc/docs/latest/html/users_guide/options-debugging.html#id3130643

(and the Encoding module has the actual rules for the Unique names)
http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs

6.2. Faster: producing a program that runs quicker
http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html
(see "How do I find out a function's strictness?")

Alistair
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-07-03 Thread Anatoly Yakovenko

inlining some of the functions definitely gave me a boost, so i am
about 8.5 times slower then openssl sha1sum.  I dont really understand
the core output, but after inlining i got a completely different
profile output, i am guessing its because the cost of the inlined
functions is spread to the callers.

COST CENTREMODULE   %time %alloc

updateElem SHA1  13.40.0
sRotateL   SHA1  13.40.0
hashElem   SHA1  12.50.0
sXor   SHA1  10.90.0
unboxW SHA1  10.00.0
temp   SHA1   8.10.0
sAdd   SHA1   7.80.0
sAnd   SHA1   5.00.0
do20   SHA1   4.1   18.0
hashA16IntoA80 SHA1   2.80.9
do60   SHA1   2.5   18.0
splitByN   SHA1   2.2   15.6
ffkk   SHA1   2.20.0
sOrSHA1   1.60.0
do40   SHA1   0.9   18.0
hashPtrIntoA80 SHA1   0.62.7
hashA80SHA1   0.61.8
do80   SHA1   0.6   18.0
joinTail   SHA1   0.02.1
main   Main   0.04.8


On 6/30/07, Donald Bruce Stewart <[EMAIL PROTECTED]> wrote:

aeyakovenko:
> So I tried implementing a more efficient sha1 in haskell, and i got to
> about 12 times slower as C.  The darcs implementation is also around
> 10 to 12 times slower, and the crypto one is about 450 times slower.
> I haven't yet unrolled the loop like the darcs implementation does, so
> I can still get some improvement from that, but I want that to be the
> last thing i do.
>
> I think I've been getting speed improvements when minimizing
> unnecessary allocations.  I went from 40 times slower to 12 times
> slower by converting a foldM to a mapM that modifies a mutable array.
>
> Anyone have any pointers on how to get hashElem and updateElem to run
> faster, or any insight on what exactly they are allocating.  To me it
> seems that those functions should be able to do everything they need
> to without a malloc.

Try inlining key small functions, and check the core.

-O2 -ddump-simpl | less

-- Don


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-07-01 Thread Anatoly Yakovenko

so using mseq didn't seem to make any difference, i still had the same
performance.

On 7/1/07, Benja Fallenstein <[EMAIL PROTECTED]> wrote:

Hi,

2007/7/1, Bulat Ziganshin <[EMAIL PROTECTED]>:
>aa <- unsafeRead a5 0
>return $! aa
>bb <- unsafeRead a5 1
>return $! bb

If this is a useful pattern, would it make sense to have a function to
encapsulate it?

mseq :: Monad m => m a -> m a
mseq m = m >>= (return $!)

- Benja


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-07-01 Thread Benja Fallenstein

Hi,

2007/7/1, Bulat Ziganshin <[EMAIL PROTECTED]>:

   aa <- unsafeRead a5 0
   return $! aa
   bb <- unsafeRead a5 1
   return $! bb


If this is a useful pattern, would it make sense to have a function to
encapsulate it?

mseq :: Monad m => m a -> m a
mseq m = m >>= (return $!)

- Benja
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-07-01 Thread Hugh Perkins

Just an outsider's reaction, and for all I know unsafeRead is actually safe,
but if the point of using Haskell (and I'm still trying to discover what
that is ;-) ) is either to be able to rigorously mathematically prove that
your program will work perfectly (target usage 1), or to carry out threading
easily and automatically (what I'm interested specifically in), ... then why
do we have to throw unsafe functions around the place to get any decent
speed???

Ok, I'll go back to my hole; just a reaction.  I know everything has to
start somewhere, and build up, just be aware that having to use unsafe stuff
to get decent speed is not good PR ;-)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-07-01 Thread Bulat Ziganshin
Hello Anatoly,

Sunday, July 1, 2007, 3:58:24 AM, you wrote:

> Anyone have any pointers on how to get hashElem and updateElem to run
> faster, or any insight on what exactly they are allocating.  To me it
> seems that those functions should be able to do everything they need
> to without a malloc.

haskell allocations isn't a malloc, it's just a pointer increment, so
it's very fast. any temporary data created in haskell code need to be
allocated so the only case when you don't have allocations is cycle on
unboxed values

in your particular case you should try the following trick:

   aa <- unsafeRead a5 0
   return $! aa
   bb <- unsafeRead a5 1
   return $! bb

currently, your code implies that unsafeRead may return boxed value.
'let' by itself doesn't enforce unboxing, the compiler implies that
value assigned in 'let' may be actually not used. you can use either
'case' or above-mentioned trick with '$!' (or seq) to avoid boxing

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-06-30 Thread Donald Bruce Stewart
aeyakovenko:
> So I tried implementing a more efficient sha1 in haskell, and i got to
> about 12 times slower as C.  The darcs implementation is also around
> 10 to 12 times slower, and the crypto one is about 450 times slower.
> I haven't yet unrolled the loop like the darcs implementation does, so
> I can still get some improvement from that, but I want that to be the
> last thing i do.
> 
> I think I've been getting speed improvements when minimizing
> unnecessary allocations.  I went from 40 times slower to 12 times
> slower by converting a foldM to a mapM that modifies a mutable array.
> 
> Anyone have any pointers on how to get hashElem and updateElem to run
> faster, or any insight on what exactly they are allocating.  To me it
> seems that those functions should be able to do everything they need
> to without a malloc.

Try inlining key small functions, and check the core.

-O2 -ddump-simpl | less

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

2007-06-30 Thread Anatoly Yakovenko

So I tried implementing a more efficient sha1 in haskell, and i got to
about 12 times slower as C.  The darcs implementation is also around
10 to 12 times slower, and the crypto one is about 450 times slower.
I haven't yet unrolled the loop like the darcs implementation does, so
I can still get some improvement from that, but I want that to be the
last thing i do.

I think I've been getting speed improvements when minimizing
unnecessary allocations.  I went from 40 times slower to 12 times
slower by converting a foldM to a mapM that modifies a mutable array.

Anyone have any pointers on how to get hashElem and updateElem to run
faster, or any insight on what exactly they are allocating.  To me it
seems that those functions should be able to do everything they need
to without a malloc.

This is the profiling statistics generated from my implementation

COST CENTREMODULE   %time %alloc

hashElem   SHA1  42.9   66.2
updateElem SHA1  12.7   16.7
unboxW SHA1  10.60.0
hashA80SHA1   5.20.3
temp   SHA1   4.60.0
sRotateL   SHA1   4.60.0
ffkk   SHA1   3.32.6
hashA16IntoA80 SHA1   3.10.1
sXor   SHA1   2.90.0
do60   SHA1   2.92.6
sAdd   SHA1   2.30.0
do20   SHA1   1.32.6
splitByN   SHA1   1.22.3
do80   SHA1   0.82.6
do40   SHA1   0.42.6

Thanks,
Anatoly
module SHA1 where
import GHC.Exts
import Data.Array.IO
import Foreign.C.String
import Control.Monad
import Foreign.ForeignPtr
import Foreign.Ptr
import Data.List
import Data.Bits
import Control.Exception
import Data.Word
import Data.Array.Storable
import Array
import Data.Array.Base
import qualified Data.ByteString as BS
import qualified Data.ByteString.Base as BS(unsafeUseAsCStringLen)

data Word160 = Word160 !Word# !Word# !Word# !Word# !Word#
   deriving Show

unboxW::Word -> Word#
unboxW !(W# w) = w 

boxW::Word# -> Word
boxW !w = W# w 

unboxI::Int -> Int#
unboxI !(I# w) = w 

sXor !a !b = a `xor#`b
sAnd !a !b = a `and#` b
sOr !a !b = a `or#` b
sComp !a = (not# a)
sAdd !a !b = a `plusWord#` b

sRotateL::Word# -> Int# -> Word#
sRotateL !a !b = (uncheckedShiftL# a b) `sOr` (uncheckedShiftRL# a ((unboxI 32) -# b))

add5 (Word160 x0 x1 x2 x3 x4) (Word160 y0 y1 y2 y3 y4) = Word160 (x0 `sAdd` y0) (x1 `sAdd` y1) (x2 `sAdd` y2) (x3 `sAdd` y3) (x4 `sAdd` y4)

word64ToWord8s::Word64 -> [Word8]
word64ToWord8s ww = map (mkWord8 ww) $ reverse [0..7]
   where
  mkWord8::Word64 -> Int -> Word8
  mkWord8 ww ii = fromIntegral $ (shiftR ww $ ii * 8) .&. 0xff

mkTailStr::Int -> BS.ByteString
mkTailStr len = BS.concat [BS.pack oneAndZeros, BS.pack (word64ToWord8s $ fromIntegral ln)]
   where
  ln = 8 * len 
  oneAndZeros::[Word8] = (0x80::Word8):(replicate ((needzeros - 7) `div` 8) (0x00::Word8))
  needzeros = if(bits < 0) then bits + 512 else bits 
 where
bits = 447 - ((8 * ln) `mod` 512)

splitByN::Int -> BS.ByteString -> [BS.ByteString]
splitByN nn ll = st : (if (BS.null en) then [] else (splitByN nn en))
   where
  (st,en) = BS.splitAt nn ll

updateElem::(StorableArray Int Word) -> Int -> IO ()
updateElem a80 ii = do 
   x3 <-  unsafeRead a80 (ii - 3)
   x8 <-  unsafeRead a80 (ii - 8)
   x14 <- unsafeRead a80 (ii - 14)
   x16 <- unsafeRead a80 (ii - 16)
   let ux3 = unboxW x3 
   let ux8 = unboxW x8 
   let ux14 = unboxW x14 
   let ux16 = unboxW x16 
   unsafeWrite a80 ii (boxW (sRotateL (ux3 `sXor` ux8 `sXor` ux14 `sXor` ux16) (unboxI 1)))

data FFKK = FFKK !Word# !Word# 

ffkk::Int -> Word# -> Word# -> Word# -> FFKK
ffkk !ii !bb !cc !dd 
   | ii < 20 = do20 bb cc dd
   | ii < 40 = do40 bb cc dd
   | ii < 60 = do60 bb cc dd
   | otherwise = do80 bb cc dd

do20 !bb !cc !dd = FFKK ff kk
   where
  ff = (bb `sAnd` cc) `sOr` ((sComp bb) `sAnd` dd)
  kk = unboxW 0x5A827999
do40 !bb !cc !dd = FFKK ff kk
   where
  ff = bb `sXor` cc `sXor` dd
  kk = unboxW 0x6ED9EBA1
do60 !bb !cc !dd = FFKK ff kk
   where
  ff = (bb `sAnd` cc) `sOr` (bb `sAnd` dd) `sOr` (cc `sAnd` dd)
  kk = unboxW 0x8F1BBCDC
do80 !bb !cc !dd = FFKK ff kk
   where
  ff = bb `sXor` cc `sXor` dd
  kk = unboxW 0xCA62C1D6

temp ww (FFKK ff kk) aa ee = (sRotateL aa (unboxI 5)) `sAdd` ff `sAdd` ee `sAdd` kk `sAdd` ww 

hashElem::(StorableArray Int Word) -> (StorableArray Int Word) -> Int -> IO ()
hashElem a5 a80 ii = do
   aa <- unsafeRead a5 0
   bb <- unsafeRead a5 1
   cc <- unsafeRead a