Re: [Haskell-cafe] Re: FASTER primes

Daniel Fischer Sun, 03 Jan 2010 16:31:12 -0800

Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
>
> Exactly the point I tried to make. :)
>


>
> again, yes. :)
>

>
> yes.
>

>
> yes, that's what I meant - the cost of calling all the fuctions that - we
> know in advance will - have nothing to do eventually.
>

8-)

> >
> > But there's a lot of list constructuion and deconstruction necessary for
> > the Euler sieve.
>
> yes. Unless, of course, s smart compiler recognizes there's only one
> consumer for the values each multiples-list is producing, and keeps it
> stripped down to a generator function, and its current value. I keep
> thinkig a smart compiler could eliminate all these "span" calls and replace
> them with just some pointers manipulating...
>

Of course I'm no smart compiler, but I don't see how it could be even possible 
to replace 
the span calls with pointer manipulation when dealing with lazily generated 
(infinite, if 
we're really mean) lists. Even when you're dealing only with strict finite 
lists, it's not 
trivial to do efficiently.

> > That may be more work than the multiple hits cause.
>
> so that too would make filters win; only _if_ the cheap primality test
> existed!..
>
> > > > the multiples of p with 2*p, you have
> > > > ...
> > >
> > > There are two questions here - where to start crossing off numbers, and
> > > when. If you'd start at 2*p maybe the overall complexity would remain
> > > the same but it'll add enormous overhead with all those duplicate
> > > multiples.
> >
> > The additional duplicate multiples aren't the problem. Sure, the numbers
> > having a prime divisor larger than the square root would be crossed off
> > one additional time,  but that isn't so much per se. The additional
> > crossings off are O(bound), so they don't harm the time complexity. But
> > they can have effects which multiply the running time by a large
> > constant.
>
> yes, exactly what I wanted to say. :)
>

*g* I wondered what you meant by the distinction between where and when to 
cross off.

> > > No, the question is not where to start, but when. PQ might hide the
> > > problem until the memory blows up. Anything that we add that won't have
> > > any chance of contributing to the final result, is added for nothing
> > > and only drives the total cost up needlessly.
> >
> > Well, if you start crossing off at p^2 and feed your PQ from a separate
> > prime generator, the memory blow-up is far away. E.g., when the main
> > sieve is at 10^12, the PQ contains less than 80000 multiples. No space
> > problem in sight. At 10^14, the PQ's size is still less than 700000. 
> > That's noticeable, but there's still plenty of space.
>
> again, what I mean is, not _where_ I start crossing them off in a PQ, but
> _when_. The article's code starts crossing them off _at_ p^2 - by adding
> p^2+2p into the PQ - _as_ _soon_ as p itself is reached. It won't surface
> until p^2 will be considered for a prime; it'll lay dormant deep inside the
> queue's guts. When reaching 7919, the thousand (i.e. pi(7919) ) entries
> will hang out inside the PQ - instead of just 24. A memory blowup. (this is
> of course fixed in Melissa's ZIP package). Of course due to the nature of
> PQ it might actually not hurt the performance for a while, depending on
> partcular PQ implementation. Complexity _and_ constant factors.
>

It will start to have an impact pretty soon. Assuming at least one of the 
relevant PQ 
operations to be Theta(log size), each composite between ~400 and ~400000 
(rough 
estimates) will take something like twice as long to handle. It will start to 
hurt really 
badly only a while later, though, as a guesstimate, with more than a million 
primes in the 
PQ, memory will have a hard time.

> > If, on the other hand, you start crossung off at 2*p, when the main sieve
> > is at 10^7, the size of the PQ is > 650000, at 10^8, the size is more
> > than 5.5 million. That starts to become a memory problem rather soon.
>
> here you don't have a choice or when to add it - you have to add it at p
> itself - so the problem is clear. But even when you cross at p^2, the
> question remains, of when you add the p^2 entry into the PQ. That was my
> point.
>
> Postponed Filters code makes this clear, and thus hard to err on.
> Unfortunately, it wasn't present the article.
>
> > > > > I was under impression that the first shows O(n^2) approx., and the
> > > > > second one O(n^1.5) (for n primes produced).
> > > >
> > > > In Turner/postponed filters, things are really different. Actually,
> > > > Ms. O'Neill is right, it is a different algorithm. In the above, we
> > > > match each
> > >
> > > what _is_ different is divisibility testing vs composites removal,
> > > which follows from her in-depth analysis although is never quite
> > > formulated in such words in her article. But nothing matters until the
> > > premature starting up is eliminated, and that key observation is
> > > missing for the article either - worse, it is brushed off with the
> > > casual remark that it is "a pleasing but minor optimization". Which
> > > remark, as you show here, is true in the imperative, mutable-storage
> > > setting, but is made in an article abut functional code, in regard to
> > > the functional code of Turner's sieve.
> >
> > I think that remark was meant to apply to composite removal, not Turner's
> > sieve.
>
> It is right there on page 2, right when the Turner's sieve is presented and
> discussed. The only explanation that I see is that she thought of it in
> regards to the imperative code, just as her analysis concentrates only on
> calculation aspects of the imperative code itself.
>

To be fair, she writes:

"Let us ﬁrst describe the original “by hand” sieve algorithm as practiced by
Eratosthenes. 
...
The starting point of p^2 is a pleasing but minor optimization, which can be 
made
because lower multiples will have already been crossed oﬀ when we found the 
primes
prior to p. 
... (This optimization does not aﬀect the
time complexity of the sieve, however, so its absence from the code in Section 
1 is
not our cause for worry.)"

So it's in context of the imperative code (although rather numbers on paper 
than bits in 
RAM).
For imperative (be it array or paper), it is indeed minor; for her PQ sieve, it 
can be 
considered minor from a theoretical point of view (doesn't change time 
complexity), but 
practically, anything that changes performance by a factor of two or more is 
only 'minor' 
for hopelessly inadequate algorithms (whether a computation would take ten or 
twenty 
billion years is indeed a minor difference; one hour or two is a major 
difference).
However, as you say, the important point is not whether p's multiples get 
crossed off 
starting from 2*p or from p^2, but whether p's multiples get inserted into the 
queue when 
you look at p or at p^2.

> > But even then, the 'minor' isn't adequate considering her PQ approach
> > where, although it doesn't change time complexity (in theory), it changes
> > space complexity - and that may affect running time drastically. Probably
> > she was thinking of the typical array sieve when she made that remark.
>
> In short, that remark was not referring to what it seemingly refers to, in
> the article, and as such was very confusing. It was a big STOP sign on the
> way to Postponed Filters - Euler's - Bird's merged multiples - tree-merging
> (with wheel) road of little steps, and used as a justification for her to
> make a big leap across the chasm towards the PQ code.
>
> So to speak. :)
>
> > > So the key understanding is overlooked.
> > >
> > > And what we're after here is the insight anyway. Skipping steps of
> > > natural development does not contribute to garnering an insight.
>
> .. and that is of course a matter of personal preference. A genius is
> perfectly capable of making big leaps across chasms. Heck they might even
> be able to fly :)
>
> > > > In the postponed filters, we match each prime p with all primes <=
> > > > sqrt p (again, the composites contribute less). I think that gives a
> > > > complexity of O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log
> > > > n)) for n primes produced.
> > >
> > > Empirically, it was always _below_ 1.5, and above 1.4 , and PQ/merged
> > > multiples removal around 1.25..1.17 .
> >
> > I should really stop doing those calculations in my head, it seems I'm
> > getting too old for that. Doing it properly, on paper, the cost for
> > identifying p_k is pi(sqrt p_k) [trial division by all primes less than
> > sqrt p_k].
> > p_k is approximately k*log k, so pi(sqrt p_k) is approximately
> > (sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k).
> > Summing that, the cost for identifying the first n primes is
> > Theta(n^1.5/log n).
>
> that would correspond to what I've seen much better.
>

It should. Though empirical measurements don't always adhere exactly to the 
theoretical 
results, the deviation is usually small (for large enough input).

> > Using a feeder, i.e. primes = 2:3.sieve feederprimes [5, 7 .. ], where
> > feederprimes are generated as above, to reduce memory pressure and GC
> > impact, that fits pretty well with my measurements of the time to find
> > p_k for several k between 500,000 and 20,000,000.
>
> The quesion of a memory blowup with the treefolding merge still remains.
> For some reason using its second copy for a feeder doesn't reduce the
> memory (as reported by standalone compiled program, GHCi reported values
> are useless) - it causes it to increase twice.....
>

I have a partial solution. The big problem is that the feeder holds on to the 
beginning of 
comps while the main runner holds on to a later part. Thus the entire segment 
between 
these two points must be in memory. 
So have two lists of composites (yeah, you know that, but it didn't work so 
far).
But you have to force the compiler not to share them: enter -fno-cse.
The attached code does that (I've also expanded the wheel), it reduces the 
memory 
requirements much (a small part is due to the larger wheel, a factor of ~5 due 
to the non-
sharing).
It still uses much more memory than the PQ, and while the PQ's memory 
requirements grow 
very slowly, the tree-fold merge's still grow rather fast (don't go much beyond 
the 
10,000,000th prime), I'm not sure why.

{-# OPTIONS_GHC -O2 -fno-cse #-}
module V13Primes (primes) where

{-# SPECIALIZE primes :: () -> [Int] #-}
{-# SPECIALIZE primes :: () -> [Integer] #-}
primes :: Integral a => () -> [a]
primes () = 2:3:5:7:11:13:primes'
   where
    primes' = roll 17 wheel13 `minus` compos primes''
    primes'' = 17:19:23:29:31:37:rollFrom 41 `minus` compos primes''

{-# SPECIALISE pmults :: Int -> ([Int],[Int]) #-}
{-# SPECIALISE pmults :: Int -> ([Int],[Int]) #-}
pmults :: Integral a => a -> ([a],[a])
pmults p = case map (*p) (rollFrom p) of
            (x:xs) -> ([x],xs)

multip :: Integral a => [a] -> [([a],[a])]
multip ps = map pmults ps

{-# SPECIALISE compos :: [Int] -> [Int] #-}
{-# SPECIALISE compos :: [Integer] -> [Integer] #-}
compos :: Integral a => [a] -> [a]
compos ps = fst $ tfold mergeSP (pairwise mergeSP (multip ps))

rollFrom n = let x = (n-17) `mod` 30030
                 -- n *IS* coprime with {2,3,5,7,11,13}   !!!
                 (y,_) = span (< x) wheelNums
             in roll n $ drop (length y) wheel13

tfold f (a: ~(b: ~(c:xs)))
                     = (a `f` (b `f` c)) `f` tfold f (pairwise f xs)
pairwise f (x:y:ys)  = f x y : pairwise f ys

{-# SPECIALIZE mergeSP :: ([Int],[Int]) -> ([Int],[Int]) -> ([Int],[Int])  #-}
{-# SPECIALISE mergeSP :: ([Integer],[Integer]) -> ([Integer],[Integer]) -> ([Integer],[Integer])  #-}
mergeSP :: Integral a => ([a],[a]) -> ([a],[a]) -> ([a],[a])
mergeSP (a,b) ~(c,d) = let (bc,b') = spMerge b c
                           in (a ++ bc, merge b' d)  -- `debug` "mergeSP " ++ show (take 14 a)
     where
      spMerge u@(x:xs) w@(y:ys) = case compare x y of
               LT -> (x:c,d) where ~(c,d) = spMerge xs w
               EQ -> (x:c,d) where ~(c,d) = spMerge xs ys
               GT -> (y:c,d) where ~(c,d) = spMerge u  ys
      spMerge u [] = ([], u)
      spMerge [] w = ([], w)

wheelNums :: Integral a => [a]
wheelNums  = roll 0 wheel13

roll :: Integral a => a -> [a] -> [a]
roll       = scanl (+)

{-# SPECIALIZE wheel :: [Int] #-}
{-# SPECIALISE wheel :: [Integer] #-}
wheel :: Integral a => [a]
wheel      = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2:
           4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wheel

{-# SPECIALISE wheel11 :: [Int] #-}
{-# SPECIALISE wheel11 :: [Integer] #-}
wheel11 :: Integral a => [a]
wheel11 = res
      where
        snms = scanl (+) 11 (take 47 wheel)
        nums = tail $ scanl (+) 11 (take 529 wheel)
        cops = nums `minus` map (*11) snms
        diffs = zipWith (-) (tail cops) cops
        res = foldr (:) res diffs

{-# SPECIALISE wheel13 :: [Int] #-}
{-# SPECIALISE wheel13 :: [Integer] #-}
wheel13 :: Integral a => [a]
wheel13 = res
      where
        snms = take 480 $ scanl (+) 13 wheel11
        nums = take (480*13+1) . tail $ scanl (+) 13 wheel11
        cops = nums `minus` map (*13) snms
        diffs = zipWith (-) (tail cops) cops
        res = foldr (:) res diffs

{-# SPECIALISE minus :: [Int] -> [Int] -> [Int] #-}
{-# SPECIALISE minus :: [Integer] -> [Integer] -> [Integer] #-}
minus :: Integral a => [a] -> [a] -> [a]
minus a@(x:xs) b@(y:ys) = case compare x y of
       LT -> x: xs `minus` b
       GT ->    a  `minus` ys
       EQ ->    xs `minus` ys
minus a b  = a

merge a@(x:xs) b@(y:ys) = case compare x y of
       LT -> x: merge xs b
       EQ -> x: merge xs ys
       GT -> y: merge a  ys
merge a b  = if null b then a else b

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: FASTER primes

Reply via email to