Beginners Digest, Vol 44, Issue 6

beginners-request Sun, 05 Feb 2012 03:00:39 -0800

Send Beginners mailing list submissions to
        beginners@haskell.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://www.haskell.org/mailman/listinfo/beginners
or, via email, send a message with subject or body 'help' to
        beginners-requ...@haskell.org


You can reach the person managing the list at
        beginners-ow...@haskell.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Beginners digest..."


Today's Topics:

   1. Re:  Performance of Prime Generator (Zhi-Qiang Lei)
   2.  numerical integration over lists (Thomas Engel)
   3. Re:  Is foldr slow? (Zhi-Qiang Lei)


----------------------------------------------------------------------

Message: 1
Date: Sun, 5 Feb 2012 13:35:52 +0800
From: Zhi-Qiang Lei <zhiqiang....@gmail.com>
Subject: Re: [Haskell-beginners] Performance of Prime Generator
To: Haskell Beginer <beginners@haskell.org>
Message-ID: <cabced26-fb06-400c-b7d2-044256ef9...@gmail.com>
Content-Type: text/plain; charset=iso-8859-1

Hi Guys,

I just find the STUArray you are using is strange for me. Is there any tutorial 
for that? Google didn't help. Thanks.

On Feb 5, 2012, at 10:30 AM, Daniel Fischer wrote:

> Sorry for the late reply, haven't checked my mail for a while.
> 
> On Tuesday 24 January 2012, 13:50:32, Ertugrul S?ylemez wrote:
>> Daniel Fischer <daniel.is.fisc...@googlemail.com> wrote:
>>>>> Well, thanks, so far I have tried wheel only and Sieve of
>>>>> Eratosthenes only approaches. They both failed. When the numbers
>>>>> is between 999900000 and 1000000000, they can take more than a
>>>>> hour on my laptop.
>>> 
>>> They shouldn't. But you have to use the right data structures.
>>> 
>>> For a prime sieve, you need unboxed mutable arrays if you want it to
>>> be fast. You can use STUArrays from Data.Array.ST or mutable unboxed
>>> vectors from Data.Vector.Mutable.Unboxed.
>> 
>> That's what I've used.  You find the code on hpaste [1].  It's a
>> carefully optimized Sieve of Eratosthenes and needs around 20 seconds
>> for the primes up to 10^9.  See the refinement in the annotation, which
>> I've added just now.  Before that it took around 35 seconds.
>> 
>> I considered that to be "too slow".
> 
> Right you are. Unless you need the primes to perform some time-consuming 
> calculations with them after sieving, that is too slow.
> 
> But let us compare it with a similar C implementation.
> 
>> 
>> [1] http://hpaste.org/56866
> 
> The interesting parts are:
> ========================
> soeST :: forall s. Int -> ST s (STUArray s Int Bool)
> soeST n = do
>    arr <- newArray (0, n) True
>    mapM_ (\i -> writeArray arr i False) [0, 1]
>    let n2 = floor (sqrt (fromIntegral n))
> 
>    let loop :: Int -> ST s ()
>        loop i | i > n2 = return ()
>        loop i = do
>            b <- readArray arr i
> 
>            let reset :: Int -> ST s ()
>                reset j | j > n = return ()
>                reset j = writeArray arr j False >> reset (j + i)
> 
>            when b (reset (i*i))
> 
>            loop (succ i)
> 
>    loop 2
>    return arr
> 
> soeCount :: Int -> Int
> soeCount = length . filter id . elems . soeA
> ========================
> 
> With ghc-7.2.2 and -O2, that took 16.8 seconds here to count the 50847534 
> primes up to 10^9. That's in the same ballpark as your time, maybe a bit 
> faster, depends on what 'around 20 seconds' means exactly.
> 
> Let's be unfriendly first:
> 
> Arracc.h:
> ============
> #include <stdint.h>
> 
> inline int readArray(uint64_t *arr, int n, int i);
> inline void writeArray(uint64_t *arr, int n, int i, int b);
> ============
> 
> Arracc.c:
> ============
> #include <stdlib.h>
> #include "Arracc.h"
> 
> inline int readArray(uint64_t *arr, int n, int i){
>    if (i < 0 || i > n){
>        perror("Error in array index\n");
>        exit(EXIT_FAILURE);
>    }
>    return (arr[i >> 6] >> (i & 63)) & 1;
> }
> 
> inline void writeArray(uint64_t *arr, int n, int i, int b){
>    if (i < 0 || i > n){
>        perror("Error in array index\n");
>        exit(EXIT_FAILURE);
>    }
>    if (b) {
>        arr[i >> 6] |= (1ull << (i&63));
>    } else {
>        arr[i >> 6] &= ~(1ull << (i&63));
>    }
> }
> ============
> 
> main.c:
> ============
> #include <stdlib.h>
> #include <stdio.h>
> #include <stdint.h>
> #include <math.h>
> #include "Arracc.h"
> 
> uint64_t *soeA(int n);
> int pCount(uint64_t *arr, int n);
> 
> int main(int argc, char *argv[]){
>    int limit = (argc > 1) ? strtoul(argv[1],NULL,0) : 100000000;
>    uint64_t *sieve = soeA(limit);
>    printf("%d\n",pCount(sieve,limit));
>    free(sieve);
>    return EXIT_SUCCESS;
> }
> 
> uint64_t *soeA(int n){
>    int s = (n >> 6)+1;
>    uint64_t *arr = malloc(s*sizeof *arr);
>    if (arr == NULL){
>        perror("Allocation failed\n");
>        exit(EXIT_FAILURE);
>    }
>    int i, j, n2 = (int)sqrt(n);
>    for(i = 0; i < s; ++i){
>        arr[i] = -1;
>    }
>    writeArray(arr,n,0,0);
>    writeArray(arr,n,1,0);
>    for(i = 2; i <= n2; ++i){
>        if (readArray(arr,n,i)){
>            for(j = i*i; j <= n; j += i){
>                writeArray(arr,n,j,0);
>            }
>        }
>    }
>    return arr;
> }
> 
> int pCount(uint64_t *arr, int n){
>    int i, count = 0;
>    for(i = 0; i <= n; ++i){
>        if (readArray(arr,n,i)){
>            ++count;
>        }
>    }
>    return count;
> }
> ============
> 
> 
> $ gcc -c -O3 Arracc.c
> $ gcc -c -O3 main.c
> $ gcc -O3 main.o Arracc.o -lm
> 
> Takes 18.3 seconds. Oops, slower than the Haskell programme.
> 
> Okay, ghc does some cross-module inlining and that enables further 
> optimisations which we denied gcc here, so recompile everything with -flto 
> additionally. That brings the C version down to 12.8 seconds.
> 
> Much better, but still not overwhelming. gcc can do a little better with 
> everything in one file, 12.72 seconds.
> 
> We can squeeze out a little more by omitting the bounds checks for array 
> indexing, 12.63 seconds.
> 
> But if we do the same for our Haskell programme, replace 
> readArray/writeArray with unsafeRead/unsafeWrite, that becomes faster too, 
> 14.3 seconds.
> 
> I have to admit that I don't know why the bounds-checking costs very little 
> in C and a substantial amount in Haskell, but hey, who refuses free 
> optimisations.
> 
> Lesson 1: Omit pointless array bounds checks.
> An array bounds check is pointless if you have just checked the validity of 
> the index on the line above.
> 
> Where are we?
> 
> We have two not overwhelmingly fast programmes, Haskell (ghc-7.2.2) about 
> 13% slower than C (gcc-4.5.1).
> 
> Not a bad result for ghc, but the algorithm is less than optimal.
> 
> What's the deal?
> 
> We cross off multiples of primes 2549489372 times, and we need (10^9+1)/8 
> bytes of memory for the sieve.
> 
> Both numbers are higher than is good for us. Since the sieve is so large 
> and we cross off the multiples of each prime sequentially until we reach 
> the sieve limit, we have lots of cache misses. Bad for performance.
> And each crossing-off takes a couple of clock cycles,
> 
> arr[i >> 6] &= ~(1ull << (i&63));
> 
> shift index, fetch value, (index & 63), shift 1, complement, and with 
> value, store value.
> 
> I'm no CPU expert, so I don't know how many cycles that takes, but even 
> though some of the operations can be done in parallel on modern CPUs, it's 
> still several cycles.
> 
> Lesson 2: Avoid duplicate work and redundant data.
> It's easy to separate even and odd numbers, there aren't many even primes, 
> and it's unnecessary to mark even numbers as multiples of odd primes.
> 
> Separating the crossing-off of even and odd numbers, crossing off only odd 
> multiples of odd primes, reduces the number of crossings-off by almost 40% 
> and the running time for the C programme to 8.63 seconds (I haven't done 
> that for the Haskell programme).
> 
> If we go the small step further and eliminate the even numbers from the 
> sieve completely, we reduce the required memory by half and the crossings-
> off by almost 60% (the fraction of eliminated crossings-off slowly 
> decreases to 50% for higher limits, but as far as today's RAM takes us, it 
> remains close to 60%).
> 
> At the cost of very slightly more complicated code, we can thus reduce the 
> running time to 6.06 seconds (C) resp. 6.49 seconds (Haskell) (about 7% 
> slower than C, not bad at all).
> 
> That's a pretty good start, for some purposes it's already good enough.
> 
> If we need more, the next step is to eliminate multiples of 3 from the 
> sieve. That reduces the memory requirements by a further factor of 2/3, and 
> the number of crossings-off by a further (mumble, didn't do an exact 
> calculation, educated guess) roughly 40% (that fraction decreases to 1/3 
> for higher limits, but again very slowly).
> 
> The code becomes a bit more complicated again - more than in the previous 
> step, but still fairly straightforward.
> 
> The running time reduces by another factor of 0.65-0.7 or so. We'd be in 
> the region of 4-5 seconds then.
> 
> One can eliminate the multiples of further small primes, for each prime the 
> additional code complexity increases and the reduction in work/running time 
> decreases. At some point, the slowdown from the more complex code 
> annihilates the gain from the reduced work. I stopped after eliminating 
> multiples of 5, 7 might be worth it, but I'm not convinced.
> 
> Using a segmented sieve provides better locality at the cost of more 
> complex code. If the sieve is small enough to fit in the L2 cache and large 
> enough that some significant work can be done per segment, it's a net gain.
> 
> tl,dr: The naive algorithm we started from is too slow for higher limits, 
> to get goodish performance, it has to be tweaked heavily. But Haskell plays 
> in the same league as C (at least the C I've written).
> 
>> 
>>>> I haven't tried it, but an equivalent C implementation can easily
>>>> compute the first 10^9 prime numbers within seconds.
>>> 
>>> You mean the primes up to 10^9 here?
>> 
>> Yes, sorry.  And I was referring to the Sieve of Atkin there, but you
>> are right.  That one is only slightly faster.
> 
> Well, D.J. Bernstein's primegen is much faster. With the sieve size adapted 
> to my L1 cache, it counts the primes to 10^9 in 0.68 seconds.
> But it's only heavily optimised for primes to 2^32. Above that, my sieve 
> catches up (and overtakes it around 5*10^11, yay).
> 
>> 
>> 
>> Greets,
>> Ertugrul
> 
> Cheers,
> Daniel
> 
> _______________________________________________
> Beginners mailing list
> Beginners@haskell.org
> http://www.haskell.org/mailman/listinfo/beginners


Best regards,
Zhi-Qiang Lei
zhiqiang....@gmail.com




------------------------------

Message: 2
Date: Sun, 5 Feb 2012 09:19:02 +0100
From: Thomas Engel <thomas.eng...@gmx.net>
Subject: [Haskell-beginners] numerical integration over lists
To: beginners@haskell.org
Message-ID: <20120205081902.GA3461@siduxbox>
Content-Type: text/plain; charset=us-ascii

Hello,

I have two list (one with x values, the other with y values)
What I want to do is a numercial integration according the following formula:

Result x2 = Result x1 + ((y(x1) + y(x2))/2) * (x2 -x1)

and put the result in another list.

below my first try:

integriereListe::(a)->(a)->(a)
integriereListe [][]  = [0.0]
integriereListe (x:xs) (y:ys)   = ((y - y2) /2) * (x2 -x)
                                  where
                                  x2 = head xs
                                  y2 = head ys    

but I got the failure

 Couldn't match type `[t0]' with `[[t0]]'
    In the pattern: y : ys
    In an equation for `integriereListe':
        integriereListe (x : xs) (y : ys)
          = ((y - y2) / 2) * (x2 - x)
          where
              x2 = head xs
              y2 = head ys
                     

another problem is how get the result  x1 (see above)

any hints are welcome.

Thomas



------------------------------

Message: 3
Date: Sun, 5 Feb 2012 18:33:35 +0800
From: Zhi-Qiang Lei <zhiqiang....@gmail.com>
Subject: Re: [Haskell-beginners] Is foldr slow?
To: Haskell Beginer <beginners@haskell.org>
Message-ID: <5d37e09c-9d0e-4077-81fa-723a1ef75...@gmail.com>
Content-Type: text/plain; charset=iso-8859-1

Thank you very much. You are right. foldl performs much better here. I thought 
foldr always has a performance advantage due to its tail recursion.

On Feb 3, 2012, at 5:36 AM, Chadda? Fouch? wrote:

> 
> let go a b = b `minus` (multiplies a c)
> foldr go cs' (p:p':ps) ==> foldr go cs' (p':ps) `minus` multiplies p c
> ==> (minus needs his first argument to start removing stuff) (foldr go
> cs' ps `minus` multiplies p' c) `minus` multiplies p c
> And so on and so forth...
> In other words, contrary to your first version, this function must
> develop the entire list before making its first suppression...
> 
> Your version rather correspond to a foldl (be careful of strictness,
> using foldl it's pretty easy to get big thunks).
> 
> Foldr is very useful for functions that are lazy in their second
> argument like (||) , (&&),  (:) or others but if the function is
> strict in its second argument like yours (strict in b)...
> 
> -- 
> Jeda?


Best regards,
Zhi-Qiang Lei
zhiqiang....@gmail.com




------------------------------

_______________________________________________
Beginners mailing list
Beginners@haskell.org
http://www.haskell.org/mailman/listinfo/beginners


End of Beginners Digest, Vol 44, Issue 6
****************************************

Beginners Digest, Vol 44, Issue 6

Reply via email to