Re: [Haskell-cafe] Re: FASTER primes
Am Samstag 16 Januar 2010 18:53:33 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Donnerstag 14 Januar 2010 08:25:48 schrieb Will Ness: > > > Daniel Fischer web.de> writes: > > > > Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus: > > > > > I wonder whether it's really the liveness of pair in > > > > > > > > > > mergeSP (a,b) pair > > > > > = let sm = spMerge b (fst pair) > > > > > in (a ++ fst sm, merge (snd sm) (snd pair)) > > > > > > > > > > that is responsible for the space leak, ... > > > > > > > > I think that is responsible. At least that's how I understand the > > > > core: > > > > > > > > mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d) > > > >where > > > > (bc, b') = spMerge b c > > > > spMerge ... > > > > > > That is equivalent to > > > > > > first (a++) . second (`merge`d) $ spMerge b c > > > > > > and Daniel's fix is equivalent to > > > > > > first (a++) $ spMerge b c d > > That should've been > > mergeSP (a,b) p = first(a++) . second(`merge`snd p) $ spMerge b (fst > p) > > and > > mergeSP (a,b) p = first(a++) $ spMerge b (fst p) (snd p) > > > The code fragments you've posted are essentially > > > mergeSP (a,b) p = let res = case p of (c,_) -> > case spMerge b c of (# x,y #) -> > (x,y) > in >(# (++) a (case res of (bc,_)-> bc) , > case res of (_,b') -> > case p of (_,d) -> merge b' d #) > > and > > mergeSP (a,b) p = let res = case p of (c,d) -> > case spMerge b c d of (# x,y #) -> > (x,y) > in >(# (++) a (case res of (bc,_)-> bc) , > case res of (_,b') -> b' #) > > > This looks like Haskell to me, with many detailes explicitely written > out, probaly serving as immediate input to the compiler - not its > output. So it can't say to us much about how this is actually > implemented on the lower level. (?) It's 'core', the intermediate language GHC uses and does its transformations upon. It's more or less a slimmed down version of Haskell. That came from -ddump-simpl, thus it's the output of the simplifier, after numerous transformation/optimisation passes. I think that is then fairly directly translated to assembly (via Cmm), the STG to STG and Cmm passes do not much optimisation anymore (I may be wrong, what know I about the compiler. However, I've found that the -ddump-simpl output corresponds well to the actual behaviour whenever I look.). I find that much more redable than the -fext-core output (http://www.haskell.org/ghc/docs/6.12.1/html/users_guide/ext-core.html "GHC can dump its optimized intermediate code (said to be in “Core” format) to a file as a side-effect of compilation."), which says the same, only more verbose and less readable. Of course, if I could read assembly, that would exactly reveal what happens. > > Your theory would certainly hold if the translation was done one-to-one > without any further code rearrangements. But it could've been further > re-arranged by the compiler at some later stage (is there one?) into an > equivalent of, e.g. > > > mergeSP (a,b) p = let (bc,b') = case p of (c,_) -> > case spMerge b c of (x,y) -> > (x,y) > in >(# (++) a bc , > case p of (_,d) -> merge b' d #) > > > and further, > > > mergeSP (a,b) p = let (c,d) = case p of (x,y) -> (x,y) > (bc,b') = case spMerge b c of (x,y) -> > (x,y) > in >(# (++) a bc , merge b' d #) > > > could it? This would take hold on /d/ and /c/ at the same time, right? It would. But AFAICT, after the simplifier is done, no such rearrangements occur anymore. > > What is that code that you've shown exactly? At which stage is it > produced and is it subject to any further manipulation? I'm no GHC expert either, so I don't know what it does when exactly. But after parsing and desugaring, there come a few iterations of the simplifier, intermingled with specialising, demand-analysis, CSE, let- floating, worker-wrapper-split, ... . At the end of all that, the Tidy Core is generated (part of which I posted). What happens from then on, well ... > I apologise if > these are obvious questions, I don't know anything about GHC. I also > don't know what (# x,y #) means? Unboxed tuple (pair in this case). That means, have the components for themselves, don't wrap them in a (,) constructor. > > One thing seems certain - we should not hold explicit references to same > entities in different parts of our code, to avoid space leaks with more > confidence. Except when it's good to share ;) > To make code look as much tail-recursive as possible, so to speak. Tail recursion isn't really important (for GHC at least, I think for lazy languages in general), due to different evaluati
Re: [Haskell-cafe] Re: FASTER primes
Am Donnerstag 14 Januar 2010 08:25:48 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus: > > > I wonder whether it's really the liveness of pair in > > > > > > mergeSP (a,b) pair > > > = let sm = spMerge b (fst pair) > > > in (a ++ fst sm, merge (snd sm) (snd pair)) > > > > > > that is responsible for the space leak, for chances are that > > > Sparud's technique applies and pair is properly disposed of. > > > Rather, it could be that we need the stronger property that forcing > > > the second component will evaluate the first to NF. > > > > I think that is responsible. At least that's how I understand the > > core: > > > > mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d) > >where > > (bc, b') = spMerge b c > > spMerge ... > > That is equivalent to > > first (a++) . second (`merge`d) $ spMerge b c > > and Daniel's fix is equivalent to > > first (a++) $ spMerge b c d > > > Now, when compiler sees the first variant, it probably treats spMerge as > opaque. I.e. although in reality spMerge only contributes to the > first "channel" while it is progressively instantiated, and (`merge`d) > will only be called upon when spMerge's final clause is reached, that is > (most likely) not known to the compiler at this stage. When looking at > just the first expression itself, it has to assume that spMerge may > contribute to both channels (parts of a pair) while working, and so > can't know _when_ /d/ will get called upon to contribute to the data, as > it is consumed finally at access. > > So /d/ is gotten hold of prematurely, _before_ going into spMerge. No, the problem is that d itself is gotten hold of too late, it is accessed only indirectly via the pair, so we keep an unnecessary reference to c via d. > > The second variant passes the responsibility for actually accessing its > inputs to spMerge itself, and _it_ is clear about needing /d/ only in > the very end. The second variant is clear about not needing the _pair_ anymore once spMerge is entered. Thus d doesn't reference c anymore. > > Just a theory. :) > > Does that make sense? > ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus: > I wonder whether it's really the liveness of pair in > > mergeSP (a,b) pair > = let sm = spMerge b (fst pair) > in (a ++ fst sm, merge (snd sm) (snd pair)) > > that is responsible for the space leak, for chances are that Sparud's > technique applies and pair is properly disposed of. Rather, it could > be that we need the stronger property that forcing the second component > will evaluate the first to NF. I think that is responsible. At least that's how I understand the core: mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d) where (bc, b') = spMerge b c spMerge ... -- OldMerge.$wmergeSP :: [GHC.Types.Int] -> [GHC.Types.Int] -> ([GHC.Types.Int], [GHC.Types.Int]) -> (# [GHC.Types.Int], [GHC.Types.Int] #) GblId [Arity 3 Str: DmdType LLL] OldMerge.$wmergeSP = \ (ww_sny :: [GHC.Types.Int]) (ww1_snz :: [GHC.Types.Int]) (w_snB :: ([GHC.Types.Int], [GHC.Types.Int])) -> let { ds_so7 [ALWAYS Just D(SS)] :: ([GHC.Types.Int], [GHC.Types.Int]) LclId [Str: DmdType] ds_so7 = case w_snB of _ { (c_adj, _) -> case OldMerge.$wspMerge ww1_snz c_adj of _ { (# ww3_snH, ww4_snI #) -> (ww3_snH, ww4_snI) } } } in (# GHC.Base.++ @ GHC.Types.Int ww_sny (case ds_so7 of _ { (bc_ajQ, _) -> bc_ajQ }), case ds_so7 of _ { (_, b'_ajS) -> case w_snB of _ { (_, d_adk) -> OldMerge.merge b'_ajS d_adk } -- Here, in the second component of the result, -- we reference the entire pair to get the dorks } #) OldMerge.mergeSP :: ([GHC.Types.Int], [GHC.Types.Int]) -> ([GHC.Types.Int], [GHC.Types.Int]) -> ([GHC.Types.Int], [GHC.Types.Int]) GblId [Arity 2 Worker OldMerge.$wmergeSP Str: DmdType U(LL)Lm] OldMerge.mergeSP = __inline_me (\ (w_snw :: ([GHC.Types.Int], [GHC.Types.Int])) (w1_snB :: ([GHC.Types.Int], [GHC.Types.Int])) -> case w_snw of _ { (ww_sny, ww1_snz) -> case OldMerge.$wmergeSP ww_sny ww1_snz w1_snB of _ { (# ww3_snN, ww4_snO #) -> (ww3_snN, ww4_snO) } }) -- vs. mergeSP (a,b) ~(c,d) = (a ++ bc, m) where (bc,m) = spMerge b c d spMerge ... -- NewMerge.$wmergeSP :: [GHC.Types.Int] -> [GHC.Types.Int] -> ([GHC.Types.Int], [GHC.Types.Int]) -> (# [GHC.Types.Int], [GHC.Types.Int] #) GblId [Arity 3 Str: DmdType LLL] NewMerge.$wmergeSP = \ (ww_snB :: [GHC.Types.Int]) (ww1_snC :: [GHC.Types.Int]) (w_snE :: ([GHC.Types.Int], [GHC.Types.Int])) -> let { ds_soa [ALWAYS Just D(SS)] :: ([GHC.Types.Int], [GHC.Types.Int]) LclId [Str: DmdType] ds_soa = case w_snE of _ { (c_adj, d_adk) -> -- There's no reference to the pair after this case NewMerge.$wspMerge ww1_snC c_adj d_adk of _ { (# ww3_snK, ww4_snL #) -> (ww3_snK, ww4_snL) } } } in (# GHC.Base.++ @ GHC.Types.Int ww_snB (case ds_soa of _ { (bc_ajT, _) -> bc_ajT }), case ds_soa of _ { (_, b'_ajV) -> b'_ajV } #) NewMerge.mergeSP :: ([GHC.Types.Int], [GHC.Types.Int]) -> ([GHC.Types.Int], [GHC.Types.Int]) -> ([GHC.Types.Int], [GHC.Types.Int]) GblId [Arity 2 Worker NewMerge.$wmergeSP Str: DmdType U(LL)Lm] NewMerge.mergeSP = __inline_me (\ (w_snz :: ([GHC.Types.Int], [GHC.Types.Int])) (w1_snE :: ([GHC.Types.Int], [GHC.Types.Int])) -> case w_snz of _ { (ww_snB, ww1_snC) -> case NewMerge.$wmergeSP ww_snB ww1_snC w1_snE of _ { (# ww3_snQ, ww4_snR #) -> (ww3_snQ, ww4_snR) } }) -- ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Dienstag 12 Januar 2010 11:30:07 schrieb Heinrich Apfelmus: > Tricky stuff. It is known that pairs/records are prone to unwanted > retention, see for example the recent thread > > http://thread.gmane.org/gmane.comp.lang.haskell.cafe/66903/focus=67871 > > or > > Jan Sparud. Fixing some space leaks without a garbage collector. > http://citeseer.ist.psu.edu/240848.html "CiteSeerx is momentarily busy, Please try us again, in a few moments. We apologize for any inconvenience." :( tried and re-tried. Took http://bert.md.chalmers.se/pub/cs-reports/papers/sparud/spaceleak- fix.ps.gz finally > > It is exactly because these troubles that I'm advocating the original > VIP data structure that buries the dorks (that name is awesome :D) deep > inside the structure. :) > > > In any case, it appears to me that the lazy pattern match in mergeSP > is actually unnecessary! This is because mergeSP returns a pair > constructor immediately, so infinite nesting works even when the second > argument is demanded. In other words, > > mergeSP :: Integral a => People a -> People a -> People a > mergeSP (P a b) (P c d) = P (a ++ bc) (merge b' d) > where > P bc b' = spMerge b c > spMerge [] ys = P [] ys > spMerge xs [] = P [] xs > spMerge xs@(x:xt) ys@(y:yt) = case compare x y of > LT -> celebrate x (spMerge xt ys) > EQ -> celebrate x (spMerge xt yt) > GT -> celebrate y (spMerge xs yt) > > should work fine and do away with the memory issue because the pair is > now deconstructed immediately. No, <>. For the reason you stated below. In tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` smartfold f xs , it must compute smartfold f xs too early to match it against P c d. The compiler can't see that smartfold mergeSP always returns a P [well, it might return _|_, mightn't it?]. > > Hm, a strict second argument might however mean that the tree produced > by tfold is demanded too early. > > > > Regards, > Heinrich Apfelmus > > -- > http://apfelmus.nfshost.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Freitag 08 Januar 2010 18:37:21 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Donnerstag 07 Januar 2010 11:43:44 schrieb Heinrich Apfelmus: > > > Will Ness wrote: > > > > > > Hm? In my world view, there is only reduction to normal form and I > > > don't see how "allocate its own storage" fits in there. Okasaki > > > having shown something to that end would be new to me? > > > > Perhaps what was meant was "storage must be allocated for each filter" > > (which, however, seems trivial). > > I still contend that in case of nested filters, where each filter has > only one consumer, even that isn't ultimately necessary (a chain of > pointers could be formed). That's because there's no "peek"s, only > "pull"s there. If the filters are used in merge situation, then there > will be some "peek"s so current value must be somehow stored, though > it's better to be made explicit and thus static (the place for it, I > mean, instead of the "rolling" cons cell). Such implementation technique > would prevent some extra gc. > Well, for each filter we need to store 1. a pointer to where in the wheel we are 2. the prime by which to multiply the steps of the wheel 3. the current value of the filter (which number to reject next) I don't see how to use less. It's not much, but it's not zero storage. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Samstag 09 Januar 2010 08:04:20 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness: > > > Daniel Fischer web.de> writes: > > > > It's not tail-recursive, the recursive call is inside a celebrate. > > It is (spMerge that is). No. "In computer science, tail recursion (or tail-end recursion) is a special case of recursion in which the last operation of the function, the tail call, is a recursive call." The last operation of spMerge is a call to celebrate or the pair constructor (be that P or (,)). Doesn't matter, though, as for lazy languages, tail recursion isn't very important. > It calls tail-recursive celebrate in a tail > position. What you've done, is to eliminate the outstanding context, by > moving it inward. Your detailed explanation is more clear than that. :) > > BTW when I run VIP code it is consistently slower than using just pairs, I can't reproduce that. Ceteris paribus, I get the exact same allocation and GC figures whether I use People or (,), running times identical enough (difference between People and (,) is smaller than the difference between runs of the same; the difference between the fastest and the slowest run of the two is less than 0.5%). I think it must be the other changes you made. > modified with wheel and feeder and all. So what's needed is to > re-implement your approach for pairs: > > mergeSP (a,b) ~(c,d) = let (bc,bd) = spMerge b c d >in (a ++ bc, bd) > where > spMerge u [] d = ([], merge u d) > spMerge u@(x:xs) w@(y:ys) d = case compare x y of >LT -> consSP x $ spMerge xs w d >EQ -> consSP x $ spMerge xs ys d >GT -> consSP y $ spMerge u ys d > > consSP x ~(a,b) = (x:a,b) -- don't forget that magic `~` !!! I called that (<:). > > > BTW I'm able to eliminate sharing without a compiler switch by using > Yes, I can too. But it's easy to make a false step and trigger sharing. I can get a nice speedup (~15%, mostly due to much less garbage collecting) by doing the final merge in a function without unnecessarily wrapping the result in a pair (whose secondcomponent is ignored): -- Doesn't need -fno-cse anymore, -- but it needs -XScopedTypeVariables for the local type signatures primes :: forall a. Integral a => () -> [a] primes () = 2:3:5:7:11:13:calcPrimes 17 primes'' where calcPrimes s cs = rollFrom s `minus` compos cs bootstrap = 17:19:23:29:31:37:calcPrimes 41 bootstrap primes' = calcPrimes 17 bootstrap primes''= calcPrimes 17 primes' pmults :: a -> ([a],[a]) pmults p = case map (*p) (rollFrom p) of (x:xs) -> ([x],xs) multip :: [a] -> [([a],[a])] multip ps = map pmults ps compos :: [a] -> [a] compos ps = case pairwise mergeSP (multip ps) of ((a,b):cs) -> a ++ funMerge b (pairwise mergeSP cs) funMerge b (x:y:zs) = let (c,d) = mergeSP x y in mfun b c d (pairwise mergeSP zs) mfun u@(x:xs) w@(y:ys) d l = case compare x y of LT -> x:mfun xs w d l EQ -> x:mfun xs ys d l GT -> y:mfun u ys d l mfun u [] d l = funMerge (merge u d) l This uses a different folding structure again, which seems to give slightly better performance than the original tree-fold structure. In contrast to the VippyPrimes, it profits much from a larger allocation area, running with +RTS -A2M gives a >10% speedup for prime # 10M/20M, +RTS -A8M nearly 20%. -A16M and -A32M buy a little more, but in that range at least, it's not much (may be significant for larger targets). Still way slower than PQ, but the gap is narrowing. > > mtwprimes () = 2:3:5:7:primes >where > primes = doPrimes 121 primes > > doPrimes n prs = let (h,t) = span (< n) $ rollFrom 11 > in h ++ t `diff` comps prs > doPrimes2 n prs = let (h,t) = span (< n) $ rollFrom (12-1) >in h ++ t `diff` comps prs > > mtw2primes () = 2:3:5:7:primes >where > primes = doPrimes 26 primes2 > primes2 = doPrimes2 121 primes2 > > > Using 'splitAt 26' in place of 'span (< 121)' didn't work though. > > > How about them wheels? :) > Well, what about them? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness: > Daniel Fischer web.de> writes: > > > > mergeSP :: Integral a => People a -> People a -> People a > > mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd) > > where > > mrgd = spMerge (dorks p1) (vips p2) (dorks p2) > > spMerge l1 [] l3 = P [] (merge l1 l3) > > spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of > > LT -> celebrate x (spMerge xs l2 l3) > > EQ -> celebrate x (spMerge xs ys l3) > > GT -> celebrate y (spMerge l1 ys l3) > > > > -- > > Hi Daniel, > > Is it so that you went back to my fold structure? Yes. > Was it better for really big numbers of primes? Yes, it is slower for <= 20 million primes (and some beyond), but faster for >= 50 million (and some before), according to the few tests I made. > > I had the following for ages (well, at least two weeks) but I thought it > was slower and took more memory (that was _before_ the 'no-share' and > 'feeder' stuff). I can see the only difference in that you've re-written > spMerge in a tail-recursive style with explicitly deconstructed parts; It's not tail-recursive, the recursive call is inside a celebrate. As it turns out, the important things are 1. a feeder and separate lists of multiples for the feeder and the runner, for the reasons detailed earlier (two-step feeding and larger wheel are pleasing but minor optimisations). 2. a strict pattern in tfold 3. moving the merge inside spMerge (you can have mergeSP (a,b) ~(c,d) = (a ++ bc, m) where (bc,m) = spMerge b c spMerge u [] = ([],merge u d) ... without exploding memory, but it's *much* slower than letting spMerge take three arguments) Now, I have to admit, the only point I _really_ understand is 1. (and why the three-argument spMerge is faster than the two-argument one taking the merge-partner from mergeSP's binding :). Why has mergeSP (a,b) ~(c,d) = let (bc,b') = spMerge b c in (a ++ bc, merge b' d) a memory leak, but mergeSP (a,b) ~(c,d) = let (bc,m) = spMerge' b c d in (a ++ bc, m) not? Well, looking at the core for mergeSP, the fog clears somewhat. The former is translated roughly to mergeSP (a,b) pair = let sm = spMerge b (fst pair) in (a ++ fst sm, merge (snd sm) (snd pair)) It holds on to the pair until the result of the merge is demanded, that is until the entire (a ++ fst sm) is consumed. Only then is the pair released and can be collected. On top of that, as soon as a is consumed and (fst sm) [or bc] is demanded, spMerge forces the evaluation of (fst pair) [c]. After a few levels, the evaluated list will take more space than the thunk. It cannot yet be collected, because pair is still alive. The elements have to be duplicated for (fst sm), because they're intertwined with those of b. On the next level of merging, they have to be duplicated again. The latter is translated roughly to mergeSP (a,b) pair = let sm = spMerge' b (fst pair) (snd pair) in (a ++ fst sm, snd sm) The pair is released as soon as the result of the spMerge' is demanded, that is, when a is consumed. Then the elements of (fst pair) need not be duplicated and they can be discarded almost immediately after they have been produced [for small primes, multiples of larger primes live a little longer, but those are fewer]. So, no duplication, shorter lifespan => no leak. Having seen that, the question is, why can't the compiler see that deconstructing the pair when the first component is needed is better? The first component of the pair is used in no other place, so keeping the pair can't have any advantage, can it? And why does tfold f (a: ~(b: ~(c:xs))) = ... leak, but not tfold f (a:b:c:xs) = ... ? I guess it's similar. tfold f (a: ~(b: ~(c:xs))) = (a `f` (b `f` c)) `f` tfold f ([pairwise f] xs) is tfold f (a:zs) = (a `f` (head zs `f` (head $ tail zs))) `f` tfold f (pairwise f $ drop 2 zs) the latter part holds on to the beginning of zs, leading again to data duplication and too long lifespans. > mine was relying on the compiler (8-L) to de-couple the two pipes and > recognize that the second just passes along the final result, unchanged. > > The two versions seem to me to be _exactly_ operationally equivalent. Well, they're not. The main difference is what we told the compiler _when_ to deconstruct the pairs. You said "at the latest possible moment", I said "as soon as we need the first component". It's not entirely obvious, but it is frequently said that understanding the space (and time) behaviour of lazy evaluation isn't always easy. > All this searching for the code better understood by the compiler is > _*very*_ frustrating, as it doesn't reflect on the semantics of the > code, or even the operational semantics of the code. :-[ > > Weren't the P's supposed to disappear completely in the com
Re: [Haskell-cafe] Re: FASTER primes
Am Donnerstag 07 Januar 2010 17:13:38 schrieb Daniel Fischer: > compos :: [a] -> [a] > compos = vips . itfold mergeSP . multip Sigh! That's what I get for editing the code in the mail editor. I decided to change the really stupid 'itfold' to 'smartfold' and forgot this occurrence. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Mittwoch 06 Januar 2010 19:13:01 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Mittwoch 06 Januar 2010 00:09:07 schrieb Will Ness: > > > Daniel Fischer web.de> writes: > > > > Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer: > > > > Fix rfold: > > > > > > > > rfold f [x] = x > > > > rfold f xs = rfold f (pairwise f xs) > > > > > > > > and it's faster also for those. > > > > The memory is almost completely due to the tree-merging of the multiples > > for the fastest runner. While it produces faster than flat merging, the > > exponential growth of the trees makes a bad memory citizen. > > Isn't the number of nodes the same in any two trees with the same number of > leafs? Sure. And I don't know why it takes *that much* memory, but since a flat merge doesn't consume much memory, it must be the trees. > > BTW using > > compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP $ map pmults ps > > instead of > > compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP > $ pairwise mergeSP $ map pmults ps > > brings down memory consumption further by 25%..45% on 1..8 mln primes > produced, while slowing it down by about 0%..2% (that's after eliminating > the lazy pattern in tfold as per your advice). > So much? Wow. I forgot the numbers, but I had tried that too, I thought the memory gain wasn't worth the speed loss. Another thing that reduces memory usage is compos :: [a] -> [a] compos ps = case pairwise mergeSP (multip ps) of ((a,b):cs) -> a ++ funMerge b (nwise 1 mergeSP $ pairwise mergeSP cs) funMerge b (x:y:zs) = let (c,d) = mergeSP x y in mfun b c d zs mfun u@(x:xs) w@(y:ys) d l = case compare x y of LT -> x:mfun xs w d l EQ -> x:mfun xs ys d l GT -> y:mfun u ys d l mfun u [] d l = funMerge (merge u d) l That's about the same speed as the latter of the above here and uses about 25% less memory. Removing one or two 'pairwise's brings memory down further, but makes it slower. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Mittwoch 06 Januar 2010 00:09:07 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer: > > > memory still grows, but much slower, in my tests, due to the much > > > smaller GC time, it's a bit faster than the version with the original > > > tfold. > > > > Not for larger inputs (but not so large that the tree-fold dies OOM). > > Fix rfold: > > > > rfold f [x] = x > > rfold f xs = rfold f (pairwise f xs) > > > > and it's faster also for those. > > Niiice This is just great! :) > > I tried a two-step feed BTW (that's three separate sets of lists) , with > the original structure. It ran with same speed as your new version (10..20% > faster) but with the memory of the previous one :) (80M for 8 mil primes vs > the new one's 10M). The memory is almost completely due to the tree-merging of the multiples for the fastest runner. While it produces faster than flat merging, the exponential growth of the trees makes a bad memory citizen. With the nwise and rfold, a two-step (or even three-step) feeding doesn't gain nearly as much (I found about 1% speedup). > But your new structure is just great! I hoped there is > something better, that's why I posted it here in the first place. > I have two more goodies :) 1. now that the trees don't grow so fast, don't use lazy patterns in the definition of tfold: tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` tfold f xs gains something like 6-7% here (and uses a little less memory). 2. Now we have a big wheel, 5760 differences per period. Then dropping a few thousand elements from the wheel after calculating how many in rollFrom is slow: rollFrom n = go ((n-17) `rem` 30030) wheel13 where go r xxs@(x:xs) | r < x = roll n xxs | otherwise = go (r-x) xs gains another couple of percents for large targets (~1% for the 10M prime, ~2% for 20M, I expect that to stay in th region of 1.5-3% for larger targets). > 'pairwise' puts odd leafs higher on the right. It might be better if it was > so on the left, for the frequency of production is higher. Maybe. But how would you do it? I tried passing the length to rfold, so when there was an odd numberof trees in the list, it would move the first out of the recursion. Any possible gains in production have been more than eaten up by the control code (not a big difference, but it was there). > > > Thanks a lot for your comments! ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Dienstag 05 Januar 2010 14:49:58 schrieb Will Ness: > > ... There are two attempts to eliminate 45. > > I would say there are two requests to not have 45 in the output. > Thers are many possible ways to phrase it. > > > I don't see any problem here. As Melissa (and yourself, I think) have > > > shown, double hits on multiples are few and far between. > > > > It's not a problem, it just means it's not Euler's sieve, because that > > attempts to eliminate each composite exactly once. > > yes I see now. My bad. Wasn't reading that wikipedia page with enough > attention to detail. It uses the modified (culled, so far) numbers to find > out the next multiples to be eliminated, Minor factual error, no big deal. > not just the prime itself like my code does. Which is operationally far better because it's much simpler :) > > You solution is indeed, exponential: > > euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks) > primes = 2:euler [3,5..] > > > primes >= 2:euler (a...@[3,5..]) >3:euler (bs@(tail as `minus` map (3*) as)) > 5:euler (cs@(tail bs `minus` map (5*) bs)) > > > There are two separate look-back pointers to /as/ in /bs/, and there are > two such in /cs/, to /bs/. The book-keeping machinery explodes. > > > > Also, it uses no filters, i.e. no individual number divisibility > > > testing. The "filters" refers first of all to testing an individual > > > number to decide whether to keep it in or not. > > > > Umm, the postponed filters says keep everything until p^2, then eliminate > > (filter out, remove) multiples of p in the remainder, after that, pick > > next prime. > > That's precisely what the above does. It doesn't do the filtering out by > > divisibility testing but by minus (hence more efficiently). I would say > > that's a variant of the postponed filters. > > Filter is usually (as in Haskell's 'filter') is about testing individual > elements by a predicate function. There is of course a filtering effect in > two lists elts' comparison that 'minus' performs, so that's debatable. Even > the PQ code performs "filtering" in this wider sense. > I understood the 'filters' in postponed filters in that wider sense. And I would also find it perfectly acceptable to say that the PQ code filters out the composites from the stream. Of course if you're used to using 'filter' only in the stricter sense, you wouldn't call the `minus` thing a variant of the postponed filters. > > not just primes - all the composites not yet removed, too. Between p and p^2, there are only primes left, fortunately. > So it can't even be implemented on shared mutable storage if we > want to delay the removal (as we must, in unbounded case) - Yes. And it's not nice in the bounded case either. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Dienstag 05 Januar 2010 10:33:19 schrieb Will Ness: > Such a > system would probably have to distinguish, at the type level, between > [1..m] ; cycle [1..m] ; take n [1..m] ; etc. These would all be not just > fuctions, but parts of a type's (here list) behaviour with automatically > deduced semantics. > > What would such a type system be called? > Seriously complicated, I think. Don't get me wrong, a system with such capabilities would be A.W.E.S.O.M.E. I just can't see it happening anytime soon. > > I would imagine so. Do I get this fusion on lists for free from the > compiler, or do I have to recode for that? (haven't yet time to look into > the article mentioned). > Without optimisations, hardly ever if at all (needs a compiler expert to know). With optimisations, sometimes. But it doesn't always see the possibility where it would with recoding in the right style (dons is expert in that). With stream fusion, more often. But I think it would again be not very hard to hide opportunities for fusion by your coding style. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Will Ness skrev: Emil Axelsson chalmers.se> writes: For me, a real smart compiler is one that would take in e.g. (sum $ take n $ cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs maybe (just an aside). (Also an aside, I couldn't resist...) Then I'm sure you'd say that Feldspar [1] has a smart compiler :) but it didn't produce f n m = if n < m then n*(n+1)/2 else let (q,r)=quotRem n m in q*(m*(m+1)/2) + r*(r+1)/2 :) Ah, I see... Yes, that would be a very smart compiler :) / Emil ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer: > compos ps = fst (tfold mergeSP $ nwise 1 mergeSP (pairwise mergeSP (multip > ps))) > > tfold f (a: ~(b: ~(c:xs))) > = (a `f` (b `f` c)) `f` tfold f xs > > nwise k f xs = let (ys,zs) = splitAt k xs in rfold f ys : nwise (k+1) f zs > > rfold f [x] = x > rfold f (x:xs) = x `f` rfold f xs > > memory still grows, but much slower, in my tests, due to the much smaller > GC time, it's a bit faster than the version with the original tfold. Not for larger inputs (but not so large that the tree-fold dies OOM). Fix rfold: rfold f [x] = x rfold f xs = rfold f (pairwise f xs) and it's faster also for those. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Montag 04 Januar 2010 16:30:18 schrieb Will Ness: > Heinrich Apfelmus quantentunnel.de> writes: > > > > (I haven't followed the whole thread, but hopefully I have enough grasp > > of it to make a useful remark. :)) > > > > Concerning lists as producer/consumer, I think that's exactly what lazy > > evaluation is doing. Neither filter , map or span evaluate and > > store more list elements that strictly necessary. > > I laways suspected as much, but was once told that Chris Okasaki has shown > that any filter etc must allocate its own storage. With the peek/pull they > don't have to, if they are nested, and the deepest one from the real > storage gets pulled through some pointer chasing eventually. Span isn't so > easily compiled out too or is it? But that might be a minor point. > > For me, a real smart compiler is one that would take in e.g. (sum $ take n > $ cycle $ [1..m]) and spill out a straight up math formula, inside a few > ifs maybe (just an aside). Such things just aren't common enough. If you start letting the compiler look for patterns such as this which can be transformed into a simple formula, compile times would explode. > > Such a smart compiler might even be able to derive a well performing code > right from the Turner's sieve. :) > > > Sure, creating a list head only to immediately consume it is somewhat > > inefficient -- and the target of stream fusion[1] -- but this is an > > overhead of how list elements are stored, not how many. > > it might be equivalent to the (imagined) producer's storing its 'current' > value inside its frame. > > How much can we rely on the run-time to actually destroy all the > passed-over elements and not hang on to them for some time? If they're really passed over, they very probably won't survive the next GC. > Is this that compiler switch that Daniel mentioned? Is it reliable? > No, it's something different. Not entirely unrelated. The -fno-cse turns off Common Subexpression Elimination (rather sharing than elimination). That is, if you have f x = g (expensive x) y z (h (expensive x)) the compiler can see the common subexpression (expensive x) on the RHS and decide to share it, i.e. calculate it only once: f x = let e = expensive x in g e y z (h e) That may or may not be a good idea. If e is small (say an Int) and expensive isn't a blatant lie, it is a *very* good idea. But if e is a loong list that expensive x lazily produces and that is consumed by g and h in the right manner, it is often a bad idea because h might not start to consume before g has demanded a long prefix of the list and kaboom, Heap exhausted. If you turn on optimisations, the compiler does some looking for common subexpressions to share. There are some heuristics to decide when to share, but they're far from infallible. So, if you have a situation like above, and the compiler heuristics indicate that sharing would be good although it isn't, you're hosed. So you have the option to tell the compiler "try hard to optimise, but don't do CSE (because I know it would be bad here)" {- that can be done on a per-file basis, it would be cool if it could be done on a per-function basis -}. Now if you have a list producer and a consumer, without fusion, it goes like Consumer: Gimme Producer: creates cons cell, puts value, next cons cell (how to produce more) Consumer: takes value, does something with it, gimme more. Stream fusion is about eliminating the cons cell creating and value passing, to fuse production and consumption into a nice loop. That is of course impossible if the produced list is shared between two (or more) consumers. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Montag 04 Januar 2010 19:16:32 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Montag 04 Januar 2010 13:25:47 schrieb Will Ness: > > > Euler's sieve is > > > > > > sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..]) > > > where (h,t) = span (< p*p) xs > > > > Not quite. That's a variant of the postponed filters, it crosses off e.g. > > 45 twice, once as 3*15 and once as 5*9 (okay, it has already been removed > > by the first, so let's say 45 appears in two lists of numbers to be > > removed if present). > > there won't be any such. whatever is removed first, is just skipped second > (and third, etc). 45 does appear twice on the two multiples ists (3- and > 5-). But it is "removed" by first, and skipped by second. And there's no > removal here in the first place. There is no pre-filled storage. All there > is, is some lists comparison, and lists are just generators (I so keep > hoping). ((45:(offer 47 when demanded)) `minus` (45:(next will be 51 when demanded))) `minus` (45: (next will be 55 when demanded)) So there are two attempts to tell the generator to not output 45. To the second, it answers "I already knew that", but the request is made nevertheless. Lists sometimes are data structures occupying physical storage (cons cell, point to value, pointer to next...), sometimes generators (when asked for the next value, I'll answer this). I say a value is removed from a list regardless of whether it takes relinking pointers in a list of cons cells or it involves telling the generator not to give out that value. I could also say 'eliminated'. There are two attempts to eliminate 45. > > I don't see any problem here. As Melissa (and yourself, I think) have > shown, double hits on multiples are few and far between. It's not a problem, it just means it's not Euler's sieve, because that attempts to eliminate each composite exactly once. > > Also, it uses no filters, i.e. no individual number divisibility testing. > The "filters" refers first of all to testing an individual number to decide > whether to keep it in or not. Umm, the postponed filters says keep everything until p^2, then eliminate (filter out, remove) multiples of p in the remainder, after that, pick next prime. That's precisely what the above does. It doesn't do the filtering out by divisibility testing but by minus (hence more efficiently). I would say that's a variant of the postponed filters. > Euler's sieve removes multiples in advance, > so there's no testing and no filtering, only comparison. It follows the > _Postponed_ Filter's framework in postponing the action until the right > moment; the action itself is two lists comparison and skipping of the > equals (i.e. the "minus" action). > > > Euler's sieve is never attempting to remove a number more than once, > > that's > > How's that possible? http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes#Euler.27s_Sieve "Euler in his Proof of the Euler product formula for the Riemann zeta function came up with a version of the sieve of Eratosthenes, *better in the sense that each number was eliminated exactly once*." C) The number after the previous prime is also a prime. *Multiply each number in the list starting from this prime by this prime and discard the products*. > On wikipedia is says, it removes multiples of 3; then > multiples of 5; etc. So it just looks for each number it is removing, if it > is there, and if so, removes it. I would argue that looking whether to > remove or not, is part of attempting to remove. It can't have foresight, > right? > But it has :) By only trying to eliminates products of the prime p currently under consideration *with numbers (>= p) which have not yet been eliminated from the list*, it is known in advance that all these products are still in the list. When p is under consideration, the list contains (apart from the primes < p) precisely the numbers whose smallest prime factor is >= p. > > the outstanding feature of it. Unfortunately, that also makes it hard to > > implement efficiently. The problem is that you can't forget the primes > > between p and p^2, you must remember them for the removal of multiples of > > p later on. > > you're right, every formulation of these algorithms is always done in the > imperative, mutable-storage setting. They always speak of "removing" > numbers etc. sed s/remove/eliminate/ ? > > The more pressing problem with that code is its linear structure of course > which gets addressed by the tree-folding merge etc. > Which unfortunately introduces its own space problem :( > > An (inefficient but faithful) implementation would be > > > > euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks) > > I think it does exactly the same thing, computationally, No, in map (*p) ks, there are no numbers with a smaller prime factor than p, while in map (*p) [p, p+2 .. ] there are (for p > 3) > except it does so, > again, _prematurely_ exactly in Tu
Re: [Haskell-cafe] Re: FASTER primes
Am Montag 04 Januar 2010 18:08:42 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness: > > > The quesion of a memory blowup with the treefolding merge still > > > remains. For some reason using its second copy for a feeder doesn't > > > reduce the memory (as reported by standalone compiled program, GHCi > > > reported values are useless) - it causes it to increase twice. > > > > I have a partial solution. The big problem is that the feeder holds on to > > the beginning of comps while the main runner holds on to a later part. > > Thus the entire segment between these two points must be in memory. > > > > So have two lists of composites (yeah, you know that, but it didn't work > > so far). > > > > But you have to force the compiler not to share them: enter -fno-cse. > > The attached code does that (I've also expanded the wheel), it reduces > > the memory requirements much (a small part is due to the larger wheel, a > > factor of ~5 due to the non-sharing). > > I don't understand. What is there to be shared? Each multiples list is > consumed only at one point; there's nothing to be shared. Do you mean the > compiler still hangs on to them? If so, why?? Okay, let's look at the code. Your Primes.hs says {-# SPECIALIZE primes :: () -> [Int] #-} {-# SPECIALIZE primes :: () -> [Integer] #-} primes :: Integral a => () -> [a] primes () = 2:3:5:7:primes' where primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps (comps,_) = tfold mergeSP (pairwise mergeSP mults) mults = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes' fromList (x:xs) = ([x],xs) What's going on evaluating primes'? First, we know primes' is 11 : 13 : more. Then we can start looking at 'more'. For that, we need rollFrom 11 `minus` comps. rollFrom 11 is easy. What about comps? We need the start of primes' for that. Cool, we already know the first two elements, start tfolding, we know comps is 121:143:169:notYetKnown So the rollFrom 11 produces happily, minus doesn't veto anything for a while, then says no to 121, 143, 169. Now rollFrom 11 says "look at 173". Fine, what's next in comps? To find that out, we need the multiples of 17 and 19 (and 23 and 29, but those can still be deferred for a while). That gives ([289,323,361],whatever), then the mergeSP with the merged multiples of 11 and 13 produces (187:209:221:...) for notYetKnown. Everything's going on fine, but the problem is, when primes' produces primes in the region of n, it needs the comps in that region. Those contain the multiples of primes in the region of sqrt n, thus the small primes cannot yet be released. Bottom line, to produce p, all the primes from about sqrt p to p must be in memory. Oops. Enter the feeder. Say primes :: Integral a => () -> [a] primes () = 2:3:5:7:primes' where primes' = (rollFrom 11) `minus` comps primes'' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps (comps,_) = tfold mergeSP (pairwise mergeSP mults) mults = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes'' fromList (x:xs) = ([x],xs) Now primes'', the feeder of composites advances much more slowly and for primes' to produce p, primes'' must produce the next prime after sqrt p, thus holds on to all primes between sqrt (sqrt p) and sqrt p - not much of a problem. primes' doesn't hold on to smaller primes, great. But! primes' needs the part of comps around p. primes'' needs the part of comps around sqrt p. That means we must keep the part of the list of composites from sqrt p to p in memory. Even after removing the multiples of 2, 3, 5 and 7, it doesn't take long until there are far more composites between sqrt p and p than primes. Double oops. So we must make sure that the list of composites that primes' consumes is not the same as that which primes'' consumes. You can try primes :: Integral a => () -> [a] primes () = 2:3:5:7:primes' where primes' = (rollFrom 11) `minus` comps primes'' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps2 (comps,_) = tfold mergeSP (pairwise mergeSP mults) mults = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes'' (comps2,_) = tfold mergeSP (pairwise mergeSP mults2) mults2 = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes'' fromList (x:xs) = ([x],xs) If you compile without optimisations, it works. It just is rather slow. Unfortunately, the optimiser sees that comps2 and comps are the same and thinks, "why should I produce it twice if they're the same?" - whoops, back to square two. Tell the compiler that you do *not* want it shared via -fno-cse and it isn't shared, you really have two distinct lists of composites, happiness ensues - almost. > > I used the switch; it didn't help at all. The only thing I can see is > different is that all my interim data which I named with inner vars you > moved out to the top level as functions. Is that what did the trick? Not really, the abo
Re: [Haskell-cafe] Re: FASTER primes
Am Montag 04 Januar 2010 13:25:47 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness: > > > Daniel Fischer web.de> writes: > > > > But there's a lot of list constructuion and deconstruction necessary > > > > for the Euler sieve. > > > > > > yes. Unless, of course, s smart compiler recognizes there's only one > > > consumer for the values each multiples-list is producing, and keeps it > > > stripped down to a generator function, and its current value. I keep > > > thinkig a smart compiler could eliminate all these "span" calls and > > > replace them with just some pointers manipulating... > > > > Of course I'm no smart compiler, but I don't see how it could be even > > possible to replace the span calls with pointer manipulation when dealing > > with lazily generated (infinite, if we're really mean) lists. Even when > > you're dealing only with strict finite lists, it's not trivial to do > > efficiently. > > I keep thinking that storage duplication with span, filter etc. is not > really necessary, and can be replaced with some pointer chasing - > especially when there's only one consumer present for the generated values. > > What I mean is thinking of lists in terms of produce/consumer paradigm, as > objects supporting the { pull, peek } interface, keeping the generator > inside that would produce the next value on 'pull' request and keep it > ready for any 'peek's. > > Euler's sieve is > > sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..]) > where (h,t) = span (< p*p) xs Not quite. That's a variant of the postponed filters, it crosses off e.g. 45 twice, once as 3*15 and once as 5*9 (okay, it has already been removed by the first, so let's say 45 appears in two lists of numbers to be removed if present). Euler's sieve is never attempting to remove a number more than once, that's the outstanding feature of it. Unfortunately, that also makes it hard to implement efficiently. The problem is that you can't forget the primes between p and p^2, you must remember them for the removal of multiples of p later on. An (inefficient but faithful) implementation would be euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks) Its performance is really horrible though. A much better implementation is primes = 2:3:euler hprs [] (scanl (+) 5 (cycle [2,4])) where hprs = 5:drop 3 primes euler (p:ps) acc cs = h ++ euler ps (tail (acc ++ h)) (t `minus` comps) where (h,t) = span (< p*p) cs comps = map (*p) (acc ++ cs) still really bad. I don't yet see how to eat the cake and have it here. > > > > I think that remark was meant to apply to composite removal, not > > > > Turner's sieve. > > > > > > It is right there on page 2, right when the Turner's sieve is presented > > > and discussed. The only explanation that I see is that she thought of > > > it in regards to the imperative code, just as her analysis concentrates > > > only on calculation aspects of the imperative code itself. > > > > To be fair, she writes: > > > > "Let us first describe the original “by hand” sieve algorithm as practiced > > by Eratosthenes. > > ... > > The starting point of p^2 is a pleasing but minor optimization, which can > > be made > > because lower multiples will have already been crossed off when we found > > the primes prior to p. > > ... (This optimization does not affect the time complexity of the sieve, > > however, so its absence from the code in Section 1 > > *is not our cause for worry*.)" > > A-HA! > > But its absense from _*that*_ _*code*_ WAS the *major* cause for worry, as > dealing with it worked wonders on its complexity and constant factors. I think you're overinterpreting it. Sure, it's not perfectly worded, but her concern was primarily that it's not an Eratosthenes sieve but trial division (and woefully inefficient at that), I think. Now, because it's trial division and not Eratosthenes, it has a major impact, thus the absence of the optimisation is a reinforcing consequence of the original cause for worry. I may be overdefending her, though, we can't really know what she meant. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
For me, a real smart compiler is one that would take in e.g. (sum $ take n $ cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs maybe (just an aside). (Also an aside, I couldn't resist...) Then I'm sure you'd say that Feldspar [1] has a smart compiler :) The above expression written in Feldspar and the resulting C code can be found here: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=15592#a15593 The C code is somewhat complicated by the fact that Feldspar doesn't have infinite vectors. Feldspar usually works well on small examples like this one, but we're very much lacking bigger examples, so I can't advise you to use it for prime numbers just yet. / Emil [1] http://feldspar.sourceforge.net/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness: > > Exactly the point I tried to make. :) > > > again, yes. :) > > > yes. > > > yes, that's what I meant - the cost of calling all the fuctions that - we > know in advance will - have nothing to do eventually. > 8-) > > > > But there's a lot of list constructuion and deconstruction necessary for > > the Euler sieve. > > yes. Unless, of course, s smart compiler recognizes there's only one > consumer for the values each multiples-list is producing, and keeps it > stripped down to a generator function, and its current value. I keep > thinkig a smart compiler could eliminate all these "span" calls and replace > them with just some pointers manipulating... > Of course I'm no smart compiler, but I don't see how it could be even possible to replace the span calls with pointer manipulation when dealing with lazily generated (infinite, if we're really mean) lists. Even when you're dealing only with strict finite lists, it's not trivial to do efficiently. > > That may be more work than the multiple hits cause. > > so that too would make filters win; only _if_ the cheap primality test > existed!.. > > > > > the multiples of p with 2*p, you have > > > > ... > > > > > > There are two questions here - where to start crossing off numbers, and > > > when. If you'd start at 2*p maybe the overall complexity would remain > > > the same but it'll add enormous overhead with all those duplicate > > > multiples. > > > > The additional duplicate multiples aren't the problem. Sure, the numbers > > having a prime divisor larger than the square root would be crossed off > > one additional time, but that isn't so much per se. The additional > > crossings off are O(bound), so they don't harm the time complexity. But > > they can have effects which multiply the running time by a large > > constant. > > yes, exactly what I wanted to say. :) > *g* I wondered what you meant by the distinction between where and when to cross off. > > > No, the question is not where to start, but when. PQ might hide the > > > problem until the memory blows up. Anything that we add that won't have > > > any chance of contributing to the final result, is added for nothing > > > and only drives the total cost up needlessly. > > > > Well, if you start crossing off at p^2 and feed your PQ from a separate > > prime generator, the memory blow-up is far away. E.g., when the main > > sieve is at 10^12, the PQ contains less than 8 multiples. No space > > problem in sight. At 10^14, the PQ's size is still less than 70. > > That's noticeable, but there's still plenty of space. > > again, what I mean is, not _where_ I start crossing them off in a PQ, but > _when_. The article's code starts crossing them off _at_ p^2 - by adding > p^2+2p into the PQ - _as_ _soon_ as p itself is reached. It won't surface > until p^2 will be considered for a prime; it'll lay dormant deep inside the > queue's guts. When reaching 7919, the thousand (i.e. pi(7919) ) entries > will hang out inside the PQ - instead of just 24. A memory blowup. (this is > of course fixed in Melissa's ZIP package). Of course due to the nature of > PQ it might actually not hurt the performance for a while, depending on > partcular PQ implementation. Complexity _and_ constant factors. > It will start to have an impact pretty soon. Assuming at least one of the relevant PQ operations to be Theta(log size), each composite between ~400 and ~40 (rough estimates) will take something like twice as long to handle. It will start to hurt really badly only a while later, though, as a guesstimate, with more than a million primes in the PQ, memory will have a hard time. > > If, on the other hand, you start crossung off at 2*p, when the main sieve > > is at 10^7, the size of the PQ is > 65, at 10^8, the size is more > > than 5.5 million. That starts to become a memory problem rather soon. > > here you don't have a choice or when to add it - you have to add it at p > itself - so the problem is clear. But even when you cross at p^2, the > question remains, of when you add the p^2 entry into the PQ. That was my > point. > > Postponed Filters code makes this clear, and thus hard to err on. > Unfortunately, it wasn't present the article. > > > > > > I was under impression that the first shows O(n^2) approx., and the > > > > > second one O(n^1.5) (for n primes produced). > > > > > > > > In Turner/postponed filters, things are really different. Actually, > > > > Ms. O'Neill is right, it is a different algorithm. In the above, we > > > > match each > > > > > > what _is_ different is divisibility testing vs composites removal, > > > which follows from her in-depth analysis although is never quite > > > formulated in such words in her article. But nothing matters until the > > > premature starting up is eliminated, and that key observation is > > > missing for the article either - worse, it is brushed off with the > > > casual remark that it
Re: [Haskell-cafe] Re: FASTER primes
Am Samstag 02 Januar 2010 14:13:29 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness: > > > Daniel Fischer web.de> writes: > > > > Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer: > > > > > > especially the claim that going by primes squares > > > > > > is "a pleasing but minor optimization", > > > > > > > > > > Which it is not. It is a major optimisation. It reduces the > > > > > algorithmic complexity *and* reduces the constant factors > > > > > significantly. > > > > > > > > D'oh! Thinko while computing sum (takeWhile (<= n) primes) without > > > > paper. It doesn't change the complexity, and the constant factors are > > > > reduced far less than I thought. > > > > > > I do not understand. Turner's sieve is > > > > > > primes = sieve [2..] > > >where > > > sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0] > > > > > > and the Postponed Filters is > > > > > > primes = 2: 3: sieve (tail primes) [5,7..] > > >where > > > sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0] > > > where (h,~(_:t)) = span (< p*p) xs > > > > > > Are you saying they both exhibit same complexity? > > > > No. They don't. > > But if you're looking at an imperative (mutable array) sieve (that's > > simpler to analyse because you don't have to take the book-keeping costs > > of your priority queue, heap or whatever into account), if you start > > crossing out > > The key question then, is _*WHEN*_ and not _*WHAT*_. As is clearly > demonstrated by the case of Turner/Postponed filters, the work that is done > (of crossing numbers off) is the same Crossing off is only part of the work. Most of the work is checking whether to cross off the number in this round. And Turner does a lot more of that than the postponed filters. > - _when_ it is actually done - but > Turner's starts out so _prematurely_ that it is busy doing nothing most of > the time. It's not doing nothing. It's just doing a lot of superfluous work. It is _achieveing_ nothing most of the time, though. Take 7919, the thousandth prime. The postponed filters decide to keep it when fitering out the multiples of 89, the twenty-fourth prime. Turner also divides it by all 975 primes in between. That is a lot of real but futile work. > Thus its function call overhead costs pile up enormously, > overstaging the actual calculation. It's not only the function calls. Division is expensive, too. > > So analyzing that calculation in the premature execution setting is missing > the point, although helpful after we fix this, with the Postponed Filters. > _Only then_ the finer points of this algorithm's analysis can be applied - > namely, of avoiding testing primes divisibility altogether. And _if_ a fast > cheap primality test were to have existed, the filtering versions would win Sorry, I can't follow. What's the point of a primality test here? Every number whose multiples we want to remove is prime, what's left to test? > over, because they progressively cull the input sequence so there would be > no double hits as we have when merging the multiples (whether from lists or > inside the PQ). But there's a lot of list constructuion and deconstruction necessary for the Euler sieve. That may be more work than the multiple hits cause. > > > the multiples of p with 2*p, you have > > > > sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes] > > > > crossings-out, that is Theta(bound*log (log bound)). If you eliminate > > multiples of some small primes a priori (wheel), you can reduce the > > constant factor significantly, but the complexity remains the same (you > > drop a few terms from the front of the sum and multiply the remaining > > terms with phi(n)/n, where n is the product of the excluded primes). > > > > If you start crossing out at p^2, the number is > > > > sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes]. > > > > The difference is basically sum (takeWhile (<= sqrt bound) primes), which > > I stupidly - I don't remember how - believed to cancel out the main term. > > It doesn't, it's O(bound/log bound), so the complexity is the same. > > > > Now if you take a stream of numbers from which you remove composites, > > having a priority queue of multiples of primes, things are a little > > different. If you start crossing out at 2*p, when you are looking at n, > > you have more multiples in your PQ than if you start crossing out at p^2 > > (about pi(n/2) vs. pi(sqrt n)), so updating the PQ will be more > > expensive. But updating the PQ is O(log size), I believe, and log pi(n) > > is O(log pi(sqrt n)), so I think it shouldn't change the complexity here > > either. I think this would have complexity O(bound*log bound*log (log > > bound)). > > There are two questions here - where to start crossing off numbers, and > when. If you'd start at 2*p maybe the overall complexity would remain the > same but it'll add enormo
Re: [Haskell-cafe] Re: FASTER primes
Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer: > > > > especially the claim that going by primes squares > > > > is "a pleasing but minor optimization", > > > > > > Which it is not. It is a major optimisation. It reduces the algorithmic > > > complexity *and* reduces the constant factors significantly. > > > > D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper. > > It doesn't change the complexity, and the constant factors are reduced > > far less than I thought. > > I do not understand. Turner's sieve is > > primes = sieve [2..] >where > sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0] > > and the Postponed Filters is > > primes = 2: 3: sieve (tail primes) [5,7..] >where > sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0] > where (h,~(_:t)) = span (< p*p) xs > > Are you saying they both exhibit same complexity? No. They don't. But if you're looking at an imperative (mutable array) sieve (that's simpler to analyse because you don't have to take the book-keeping costs of your priority queue, heap or whatever into account), if you start crossing out the multiples of p with 2*p, you have sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes] crossings-out, that is Theta(bound*log (log bound)). If you eliminate multiples of some small primes a priori (wheel), you can reduce the constant factor significantly, but the complexity remains the same (you drop a few terms from the front of the sum and multiply the remaining terms with phi(n)/n, where n is the product of the excluded primes). If you start crossing out at p^2, the number is sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes]. The difference is basically sum (takeWhile (<= sqrt bound) primes), which I stupidly - I don't remember how - believed to cancel out the main term. It doesn't, it's O(bound/log bound), so the complexity is the same. Now if you take a stream of numbers from which you remove composites, having a priority queue of multiples of primes, things are a little different. If you start crossing out at 2*p, when you are looking at n, you have more multiples in your PQ than if you start crossing out at p^2 (about pi(n/2) vs. pi(sqrt n)), so updating the PQ will be more expensive. But updating the PQ is O(log size), I believe, and log pi(n) is O(log pi(sqrt n)), so I think it shouldn't change the complexity here either. I think this would have complexity O(bound*log bound*log (log bound)). > I was under impression that the first shows O(n^2) approx., and the second > one O(n^1.5) (for n > primes produced). In Turner/postponed filters, things are really different. Actually, Ms. O'Neill is right, it is a different algorithm. In the above, we match each prime only with its multiples (plus the next composite in the PQ version). In Turner's algorithm, we match each prime with all smaller primes (and each composite with all primes <= its smallest prime factor, but that's less work than the primes). That gives indeed a complexity of O(pi(bound)^2). In the postponed filters, we match each prime p with all primes <= sqrt p (again, the composites contribute less). I think that gives a complexity of O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes produced. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer: > > especially the claim that going by primes squares > > is "a pleasing but minor optimization", > > Which it is not. It is a major optimisation. It reduces the algorithmic > complexity *and* reduces the constant factors significantly. D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper. It doesn't change the complexity, and the constant factors are reduced far less than I thought. The huge performance differences I had must have been due to cache misses (unless you use a segmented sieve, starting at the square reduces the number of cache misses significantly). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes (was: Re: Code and Perf. Data for Prime Finders (was: Genuine Eratosthenes sieve))
Am Mittwoch 30 Dezember 2009 01:23:32 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Gee, seems my mail server reads your posts very thoroughly today :) > > I hope it's not a bad thing. :) > It means, twenty minutes after I replied to the previous, I got your hours old post which mentions points I could have included in the aforementioned reply. Not really bad, but somewhat inconvenient. > > Am Dienstag 29 Dezember 2009 14:58:10 schrieb Will Ness: > > > If I realistically needed primes generated in a real life setting, I'd > > > probably had to use some C for that. > > > > If you need the utmost speed, then probably yes. If you can do with a > > little less, my STUArray bitsieves take about 35% longer than the > > equivalent C code and are roughly eight times faster than ONeillPrimes. > > I can usually live well with that. > > > Wow! That's fast! (that's the code from haskellwiki's primes page, right?) > No, it's my own code. Nothing elaborate, just sieving numbers 6k±1, twice as fast as the haskellwiki code (here) and uses only 1/3 the memory. For the record: -- {-# LANGUAGE BangPatterns #-} module SievePrimes where import Data.Array.Base (unsafeRead, unsafeWrite, unsafeAt) import Data.Array.ST import Data.Array.Unboxed import Control.Monad (when) import Data.Bits primesUpTo :: Integer -> [Integer] primesUpTo bound = 2:3:[fromIntegral (3*i + (if testBit i 0 then 4 else 5)) | i <- [0 .. bd], par `unsafeAt` i] where bd0 = 2*(bound `div` 6) - case bound `mod` 6 of { 0 -> 2; 5 -> 0; _ -> 1 } bd = fromInteger bd0 sr = floor (sqrt (fromIntegral bound)) sbd = 2*(sr `div` 6) - case sr `mod` 6 of { 0 -> 2; 5 -> 0; _ -> 1 } par :: UArray Int Bool par = runSTUArray $ do ar <- newArray (0,bd) True let tick i s1 s2 | bd < i= return () | otherwise = do p <- unsafeRead ar i when p (unsafeWrite ar i False) tick (i+s1) s2 s1 sift i | i > sbd = return ar | otherwise = do p <- unsafeRead ar i when p (let (!start,!s1,!s2) = if even i then (i*(3*i+10)+7,2*i+3,4*i+7) else (i*(3*i+8)+4,4*i+5,2*i+3) in tick start s1 s2) sift (i+1) sift 0 -- The index magic is admittedly not obvious, but if you take that on faith, the rest is pretty clear. To find the n-th prime, -- module Main (main) where import SievePrimes (primesUpTo) import System.Environment (getArgs) printNthPrime :: Int -> IO () printNthPrime n = print (n, primesUpTo (estim n) !! (n-1)) main = do args <- getArgs printNthPrime $ read $ head args estim :: Int -> Integer estim k | n < 13= 37 | n < 70= 349 | otherwise = round x where n = fromIntegral k :: Double x = n * (log n + log (log n) - 1 + 1.05 * log (log n) / log n) -- If I don't construct the prime list but count directly on the array, it's ~6% faster. > > > If OTOH we're talking about a tutorial > > > code that is to be as efficient as possible without loosing it clarity, > > > being a reflection of essentials of the problem, then any overly > > > complicated advanced Haskell wouldn't be my choice either. > > > > +1 > > Though perhaps we view mutable array code differently. In my view, it's > > neither advanced nor complicated. > > convoluted, then. Not using "higher level" concepts, like map and foldr, :) > or head.until isSingleton (pairwise merge).map wrap , that kind of thing. > :) > > BTW I think a really smart compiler should just get a specification, like > Turner's sieve, and just derive a good code from that by itself. Go ahead and write one. I would love such a beast. > > Another example would be > > qq n m = sum $ take n $ cycle [1..m] > > which should really get compiled into just a math formula, IMO. Now _that_ > I would call a good compiler. Dream on, dream on, with hope in your heart. > That way I really won't have to learn how to use STUArray`s you see. > > I've seen this question asked a lot, what should be a good programming > language? > > IMO, English (plus math where needed, and maybe some sketches by hand). :) > Maybe you'd like http://en.wikipedia.org/wiki/Shakespeare_(programming_language) ? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Mittwoch 30 Dezember 2009 01:04:34 schrieb Will Ness: > > > While I haven't detected that with the primes code, I find that in my > > ghci your code is approximately 2.5 times faster than ONeill or Bayer > > when interpreted (no difference in scaling observed), while when compiled > > with -O2, ONeill is approximately three times as fast as your code > > that was what I was getting at first too, before I've put into my code the > _type_signatures_ and the "specialize" _pragmas_ as per her file. Then it > was only 1.3x slower, when compiled (with about same asymptotics and memory > usage). > Specialising your code to Int makes it half as fast as ONeill here (as an executable). That is largely due to the fact that your code uses much more memory here (54MB vs. 2MB for the millionth prime), though, the MUT times have a ratio of about 1.5. Now an interesting question is, why does it use so much memory here? Can you send me your exact code so I can see how that behaves here? > >and twice as fast as Bayer as an executable, about twice as fast as your > > code and slightly slower than Bayer in ghci. > > see, this kind of inconsistencies is exactly why I was concentrating only > on one platform in measuring the speed - the interp'/GHCi combination. The problem with that is that one is primarily interested in speed for library functions, which are mostly used as compiled code. > Especially when developing and trying out several approaches, to test with > compiler just takes too long. :) And why should it give (sometimes) wildly > different readings when running inside GHCi or standalone ?? Good question. > > > And I have huge memory problems in ghci with your code. > > That may be due to my implementation of merge and minus, though. You > > wrote 'standard' and I coded the straightforward methods. > > Here's what I'm using (BTW I've put it on the primes haskellwiki page too). > The memory reported for interpreted is about half of PQ's (IIRC), and > compiled - the same: > > minus a@(x:xs) b@(y:ys) = case compare x y of > LT -> x: xs `minus` b > GT ->a `minus` ys > EQ ->xs `minus` ys > minus a b = a > > merge a@(x:xs) b@(y:ys) = case compare x y of > LT -> x: merge xs b > EQ -> x: merge xs ys > GT -> y: merge a ys > merge a b = if null b then a else b More or less the same that I wrote. > > What's wrong with mutable arrays? There are a lot of algorithms which can > > be easily and efficiently implemented using mutable unboxed arrays while > > a comparably efficient implementation without mutable arrays is hard. For > > those, I consider STUArrays the natural choice. Sieving primes falls into > > that category. > > It's just that the mutating code tends to be convoluted, like in the > example I mentioned of quicksort. One has to read the C code with good > attention to understand it. Convoluted is (often) an exaggeration. But I agree that the specification of 'what' is usually easier to understand than that of 'how'. > "Normal" Haskell is much more visually apparent, like > > primes = 2: 3: sieve (tail primes) [5,7..] >where > sieve (p:ps) xs = h ++ sieve ps (t `minus` tail [q,q+2*p..]) > where (h,~(_:t)) = span (< q) xs > q = p*p > Yes. > or > > primes = 2: 3: sieve [] (tail primes) 5 >where > sieve fs (p:ps) x = [i | i<- [x,x+2..q-2], a!i] > ++ sieve ((2*p,q):fs') ps (q+2) > where > q = p*p > mults = [ [y+s,y+2*s..q] | (s,y)<- fs] > fs' = [ (s,last ms)| ((s,_),ms)<- zip fs mults] > a = accumArray (\a b->False) True (x,q-2) > [(i,()) | ms<- mults, i<- ms] > Umm, really? I'd think if you see what that does, you won't have difficulties with a mutable array sieve. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes (was: Re: Code and Perf. Data for Prime Finders (was: Genuine Eratosthenes sieve))
Gee, seems my mail server reads your posts very thoroughly today :) Am Dienstag 29 Dezember 2009 14:58:10 schrieb Will Ness: > Eugene Kirpichov gmail.com> writes: > > 2009/12/29 Will Ness yahoo.com>: > > > Daniel Fischer web.de> writes: > > >> Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness: > > >> > Now _this_, when tested as interpreted code in GHCi, runs about 2.5x > > >> > times faster than Priority Queue based code from Melissa O'Neill's > > >> > ZIP package mentioned at the haskellwiki/Prime_Numbers page, with > > >> > about half used memory reported, in producing 10,000 to 300,000 > > >> > primes. > > >> > > > >> > It is faster than BayerPrimes.hs from the ZIP package too, in the > > >> > tested range, at about 35 lines of code in total. > > >> > > >> That's nice. However, the important criterion is how compiled code > > >> (-O2) > > > > > > fares. Do the relations continue to hold? How does it compare to a > > > bitsieve? > > > > > > > > > Haven't gotten to that part yet. :) > > > > > > But why is it more important? Would that not tell us more about the > > > compiler performance than the code itself? > > > > If you mean "algorithmic complexity", you shouldn't care about a > > difference of 2.5x. > > It's not just at one point; the asymptotics are _the_same_ across the range > that I've tested (admittedly, somewhat narrow). I measure local behavior > simply as logBase in base of ratio of problem sizes, of the ratio of run > times. > > > If you mean "actual performance for a particular task", you should > > measure the performance in realistic conditions. Namely, if you're > > implementing a program that needs efficient generation of primes, > > won't you compile it with -O2? > > If I realistically needed primes generated in a real life setting, I'd > probably had to use some C for that. If you need the utmost speed, then probably yes. If you can do with a little less, my STUArray bitsieves take about 35% longer than the equivalent C code and are roughly eight times faster than ONeillPrimes. I can usually live well with that. > If OTOH we're talking about a tutorial > code that is to be as efficient as possible without loosing it clarity, > being a reflection of essentials of the problem, then any overly > complicated advanced Haskell wouldn't be my choice either. +1 Though perhaps we view mutable array code differently. In my view, it's neither advanced nor complicated. > And seeing that > this overly-complicated (IMO), steps-jumping PQ-based code was sold to us > as the only "faithful" rendering of the sieve, I wanted to see for myself > whether this really holds water. I can understand that very well. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes
Am Dienstag 29 Dezember 2009 14:34:03 schrieb Will Ness: > Daniel Fischer web.de> writes: > > Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness: > > > Now _this_, when tested as interpreted code in GHCi, runs about 2.5x > > > times faster than Priority Queue based code from Melissa O'Neill's ZIP > > > package mentioned at the haskellwiki/Prime_Numbers page, with > > > about half used memory reported, in producing 10,000 to 300,000 primes. > > > > > > It is faster than BayerPrimes.hs from the ZIP package too, in the > > > tested range, at about 35 lines of code in total. > > > > That's nice. However, the important criterion is how compiled code (-O2) > > fares. Do the relations continue to hold? How does it compare to a > bitsieve? > > > Haven't gotten to that part yet. :) > > But why is it more important? I thought the uppercase FASTER in the subject meant you were really interested in speed. If you're only interested in asymptotics, interpreted may be appropriate. However, it is possible that optimisation can change the perceived asymptotics of an algorithm (determined strictness eliminating thunks for example). While I haven't detected that with the primes code, I find that in my ghci your code is approximately 2.5 times faster than ONeill or Bayer when interpreted (no difference in scaling observed), while when compiled with -O2, ONeill is approximately three times as fast as your code and twice as fast as Bayer as an executable, about twice as fast as your code and slightly slower than Bayer in ghci. And I have huge memory problems in ghci with your code. That may be due to my implementation of merge and minus, though. You wrote 'standard' and I coded the straightforward methods. > Would that not tell us more about the compiler performance than the code > itself? Unless you write machine code or assembly, don't all performance tests tell us more about the compiler/interpreter performance than the code itself? That is, of course, with respect to algorithms with the same scaling behaviour. > > This code is just an endpoint (so far) in a short procession of natural > stepwise development of the famous classic Turner's sieve, That was sieve (x:xs) = x:sieve (filter ((/= 0) . (`mod` x)) xs) , was it? > through the > "postponed filters", through to Euler's sieve, the merging sieve (i.e. > Richard Bird's) and on to the tree-fold merging, with wheel. I just wanted > to see where the simple "normal" (i.e. _beginner_-friendly) functional code > can get, in a natural way. Good. > > It's not about writing the fastest code in _advanced_ Haskell. It's about > having clear and simple code that can be understood at a glance - i.e. > contributes to our understanding of a problem - faithfully reflecting its > essential elements, and because of _that_, fast. It's kind of like _not_ > using mutable arrays in a quicksort. What's wrong with mutable arrays? There are a lot of algorithms which can be easily and efficiently implemented using mutable unboxed arrays while a comparably efficient implementation without mutable arrays is hard. For those, I consider STUArrays the natural choice. Sieving primes falls into that category. > > Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't > feel right to me somehow, I fully agree. > especially the claim that going by primes squares > is "a pleasing but minor optimization", Which it is not. It is a major optimisation. It reduces the algorithmic complexity *and* reduces the constant facors significantly. > what with the postponed filters > (which serves as the framework for all the other variants) achieving the > orders of magnitude speedup and cutting the Turner's O(n^2) right down to > O(n^1.5) just by doing that squares optimization (with the final version > hovering around 1.24..1.17 in the tested range). The Euler's sieve being a > special case of Eratosthenes's, too, doesn't let credence to claims that > only the PQ version is somehow uniquely authentic and "faithful" to it. I never found that claim convincing either. > > Turner's sieve should have been always looked at as just a specification, > not a code, anyway, and actually running it is ridiculous. Postponed > filters version, is the one to be used as a reference point of the basic > _code_, precisely because it _does_ use the primes squares optimization, > which _is_ essential to any basic sieve. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: FASTER primes (was: Re: Code and Perf. Data for Prime Finders (was: Genuine Eratosthenes sieve))
2009/12/29 Will Ness : > Daniel Fischer web.de> writes: > >> >> >> Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness: >> > Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times >> > faster than Priority Queue based code from Melissa O'Neill's ZIP package >> > mentioned at the haskellwiki/Prime_Numbers page, with >> > about half used memory reported, in producing 10,000 to 300,000 primes. >> > >> > It is faster than BayerPrimes.hs from the ZIP package too, in the tested >> > range, at about 35 lines of code in total. >> >> That's nice. However, the important criterion is how compiled code (-O2) > fares. Do the relations continue to hold? How does it compare to a bitsieve? > > > Haven't gotten to that part yet. :) > > But why is it more important? Would that not tell us more about the compiler > performance than the code itself? > If you mean "algorithmic complexity", you shouldn't care about a difference of 2.5x. If you mean "actual performance for a particular task", you should measure the performance in realistic conditions. Namely, if you're implementing a program that needs efficient generation of primes, won't you compile it with -O2? > This code is just an endpoint (so far) in a short procession of natural > stepwise > development of the famous classic Turner's sieve, through the "postponed > filters", through to Euler's sieve, the merging sieve (i.e. Richard Bird's) > and > on to the tree-fold merging, with wheel. I just wanted to see where the simple > "normal" (i.e. _beginner_-friendly) functional code can get, in a natural way. > > It's not about writing the fastest code in _advanced_ Haskell. It's about > having > clear and simple code that can be understood at a glance - i.e. contributes to > our understanding of a problem - faithfully reflecting its essential elements, > and because of _that_, fast. It's kind of like _not_ using mutable arrays in a > quicksort. > > Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't feel > right to me somehow, especially the claim that going by primes squares is "a > pleasing but minor optimization", what with the postponed filters (which > serves > as the framework for all the other variants) achieving the orders of magnitude > speedup and cutting the Turner's O(n^2) right down to O(n^1.5) just by doing > that squares optimization (with the final version hovering around 1.24..1.17 > in > the tested range). The Euler's sieve being a special case of Eratosthenes's, > too, doesn't let credence to claims that only the PQ version is somehow > uniquely > authentic and "faithful" to it. > > Turner's sieve should have been always looked at as just a specification, not > a > code, anyway, and actually running it is ridiculous. Postponed filters > version, > is the one to be used as a reference point of the basic _code_, precisely > because it _does_ use the primes squares optimization, which _is_ essential to > any basic sieve. > > >> ___ >> Haskell-Cafe mailing list >> Haskell-Cafe haskell.org >> http://www.haskell.org/mailman/listinfo/haskell-cafe >> > > > ___ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe > -- Eugene Kirpichov Web IR developer, market.yandex.ru ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe