This starts out with my being interested in darcs <-> git related issues.
Since git uses sha1 I wanted to have the ability to calculate sha1 in
an application where I was intending to use darcs as a back-end.

The performance gap is > * 30 between Haskell and  sha1sum.  That
seemed rather steep and so I started looking...

Using ghc 6.4.2

The following code is from SHA1:


-- {-# INLINE step #-}
step :: ABCDE -> BS.ByteString -> ABCDE
step abcde0@(ABCDE a b c d e) words = abcde5
    where s16 = get_word_32s words
          s80 = s16 ++ (zipWith4 f0) (drop 13 s80) (drop 8 s80) (drop 2 s80) s80
          f0 a b c d = rotL (a `xor` b `xor` c `xor` d) 1
          (s20_0, s60) = splitAt 20 s80
          (s20_1, s40) = splitAt 20 s60
          (s20_2, s20) = splitAt 20 s40
          (s20_3, _)   = splitAt 20 s20
          abcde1 = foldl (doit f1 0x5a827999) abcde0 s20_0
          abcde2 = foldl (doit f2 0x6ed9eba1) abcde1 s20_1
          abcde3 = foldl (doit f3 0x8f1bbcdc) abcde2 s20_2
          ABCDE a' b' c' d' e' = foldl (doit f2 0xca62c1d6) abcde3 s20_3
          f1 (XYZ x y z) = (x .&. y) .|. ((complement x) .&. z)
          f2 (XYZ x y z) = x `xor` y `xor` z
          f3 (XYZ x y z) = (x .&. y) .|. (x .&. z) .|. (y .&. z)
          abcde5 = ABCDE (a + a') (b + b') (c + c') (d + d') (e + e')

-- {-# INLINE get_word_32s #-}
get_word_32s :: BS.ByteString -> [Word32]
get_word_32s s = map f [0..15]
    where f i = foldl (+) 0 $ map (\n -> toEnum (fromEnum (BS.index s (i*4+n))) `shiftL` (8 * (3-n))) [0..3]

-- {-# INLINE doit #-}
doit :: (XYZ -> Word32) -> Word32 -> ABCDE -> Word32 -> ABCDE
doit f k (ABCDE a b c d e) w = ABCDE a' a (rotL b 30) c d
 where a' = rotL a 5 + f (XYZ b c d) + e + w + k

-- {-# INLINE rotL #-}
rotL :: Word32 -> Rotation -> Word32
rotL a s = shiftL a s .|. shiftL a (s-32)
 -- rotL a s = a `seq` rotate a  s

I want to focus on the rotL function --- get_word_32s might be faster with a rewrite.

Using -prof -auto-all and -P at runtime produced the following summary...

COST CENTRE                    MODULE               %time %alloc  ticks     bytes

get_word_32s                   MySHA1                39.3   35.0  10597 6952614960
rotL                           MySHA1                22.6   18.5   6089 3682275072
step                           MySHA1                21.2   24.1   5701 4783669848
doit                           MySHA1                16.2   21.8   4362 4332088320
As part of a larger program, but the sha1 portion was where 98% of the time went.
So
  • I tried using the built-in "rotate",
  • I tried inlining,
  • I tried using -fasm directly
  • I tried generating C with -O2,
  •  I tried using 'seq' in a number of places.
None of this made much difference.

Comment, the C code for just rotL would be
unsigned int
rotate(unsigned int a, int b)
{
  return (a << b) | (a >> (b - 32));
}

and the assembler at gcc -O2 is concise and straightforward:
    .file    "rotate.c"
    .text
    .p2align 4,,15
.globl rotate
    .type    rotate, @function
rotate:
    pushl    %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %edx
    movl    12(%ebp), %ecx
    popl    %ebp
    movl    %edx, %eax
    sall    %cl, %eax
    subl    $32, %ecx
    shrl    %cl, %edx
    orl    %edx, %eax
    ret
    .size    rotate, .-rotate
    .section    .note.GNU-stack,"",@progbits
    .ident    "GCC: (GNU) 3.3.5 (Debian 1:3.3.5-3)"
GHC's code on this part is nothing like competitive because of the need for closures --- I think.

1) What approaches to getting the compiler to optimize have I overlooked?  Or, how would
    you change the code to get the compiler to generate better code for rotL?

2) Am I missing an easy way to get strictness on the rotL function?

3) Is there something I am missing that keeps ghc's code from becoming about as
concise as gcc -O2? 

4) Is the code generator simply not able to handle this?

Thanks in advance,

George



_______________________________________________
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell

Reply via email to