Re: [Haskell-cafe] Re: FASTER primes

2010-01-16 Thread Daniel Fischer
Am Samstag 16 Januar 2010 18:53:33 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Donnerstag 14 Januar 2010 08:25:48 schrieb Will Ness:
> > > Daniel Fischer  web.de> writes:
> > > > Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus:
> > > > > I wonder whether it's really the liveness of  pair  in
> > > > >
> > > > >   mergeSP (a,b) pair
> > > > >      = let sm = spMerge b (fst pair)
> > > > >        in (a ++ fst sm, merge (snd sm) (snd pair))
> > > > >
> > > > > that is responsible for the space leak, ...
> > > >
> > > > I think that is responsible. At least that's how I understand the
> > > > core:
> > > >
> > > > mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d)
> > > >where
> > > >   (bc, b') = spMerge b c
> > > >   spMerge ...
> > >
> > > That is equivalent to
> > >
> > >   first (a++) . second (`merge`d) $ spMerge b c
> > >
> > > and Daniel's fix is equivalent to
> > >
> > >   first (a++) $ spMerge b c d
>
> That should've been
>
>   mergeSP (a,b) p = first(a++) . second(`merge`snd p) $ spMerge b (fst
> p)
>
> and
>
>   mergeSP (a,b) p = first(a++) $ spMerge b (fst p) (snd p)
>
>
> The code fragments you've posted are essentially
>
>
>   mergeSP (a,b) p = let res = case p of (c,_) ->
> case spMerge b c of (# x,y #) ->
>   (x,y)
> in
>(# (++) a (case res of (bc,_)-> bc) ,
>   case res of (_,b') ->
> case p of (_,d) -> merge b' d  #)
>
> and
>
>   mergeSP (a,b) p = let res = case p of (c,d) ->
> case spMerge b c d of (# x,y #) ->
>   (x,y)
> in
>(# (++) a (case res of (bc,_)-> bc) ,
>   case res of (_,b') -> b' #)
>
>
> This looks like Haskell to me, with many detailes explicitely written
> out, probaly serving as immediate input to the compiler - not its
> output. So it can't say to us much about how this is actually
> implemented on the lower level. (?)

It's 'core', the intermediate language GHC uses and does its 
transformations upon. It's more or less a slimmed down version of Haskell.

That came from -ddump-simpl, thus it's the output of the simplifier, after 
numerous transformation/optimisation passes. I think that is then fairly 
directly translated to assembly (via Cmm), the STG to STG and Cmm passes do 
not much optimisation anymore (I may be wrong, what know I about the 
compiler. However, I've found that the -ddump-simpl output corresponds well 
to the actual behaviour whenever I look.).

I find that much more redable than the -fext-core output 
(http://www.haskell.org/ghc/docs/6.12.1/html/users_guide/ext-core.html 
"GHC can dump its optimized intermediate code (said to be in “Core” format) 
to a file as a side-effect of compilation."), which says the same, only 
more verbose and less readable.

Of course, if I could read assembly, that would exactly reveal what 
happens.

>
> Your theory would certainly hold if the translation was done one-to-one
> without any further code rearrangements. But it could've been further
> re-arranged by the compiler at some later stage (is there one?) into an
> equivalent of, e.g.
>
>
>   mergeSP (a,b) p = let (bc,b') = case p of (c,_) ->
> case spMerge b c of (x,y) ->
>   (x,y)
> in
>(# (++) a bc ,
>   case p of (_,d) -> merge b' d  #)
>
>
> and further,
>
>
>   mergeSP (a,b) p = let (c,d)   = case p of (x,y) -> (x,y)
> (bc,b') = case spMerge b c of (x,y) ->
>   (x,y)
> in
>(# (++) a bc , merge b' d  #)
>
>
> could it? This would take hold on /d/ and /c/ at the same time, right?

It would. But AFAICT, after the simplifier is done, no such rearrangements 
occur anymore.

>
> What is that code that you've shown exactly? At which stage is it
> produced and is it subject to any further manipulation?

I'm no GHC expert either, so I don't know what it does when exactly.
But after parsing and desugaring, there come a few iterations of the 
simplifier, intermingled with specialising, demand-analysis, CSE, let-
floating, worker-wrapper-split, ... .
At the end of all that, the Tidy Core is generated (part of which I 
posted). What happens from then on, well ...

> I apologise if
> these are obvious questions, I don't know anything about GHC. I also
> don't know what (# x,y #) means?

Unboxed tuple (pair in this case). That means, have the components for 
themselves, don't wrap them in a (,) constructor.

>
> One thing seems certain - we should not hold explicit references to same
> entities in different parts of our code, to avoid space leaks with more
> confidence.

Except when it's good to share ;)

> To make code look as much tail-recursive as possible, so to speak.

Tail recursion isn't really important (for GHC at least, I think for lazy 
languages in general), due to different evaluati

Re: [Haskell-cafe] Re: FASTER primes

2010-01-14 Thread Daniel Fischer
Am Donnerstag 14 Januar 2010 08:25:48 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus:
> > > I wonder whether it's really the liveness of  pair  in
> > >
> > >   mergeSP (a,b) pair
> > >      = let sm = spMerge b (fst pair)
> > >        in (a ++ fst sm, merge (snd sm) (snd pair))
> > >
> > > that is responsible for the space leak, for chances are that
> > > Sparud's technique applies and  pair  is properly disposed of.
> > > Rather, it could be that we need the stronger property that forcing
> > > the second component will evaluate the first to NF.
> >
> > I think that is responsible. At least that's how I understand the
> > core:
> >
> > mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d)
> >where
> >   (bc, b') = spMerge b c
> >   spMerge ...
>
> That is equivalent to
>
>   first (a++) . second (`merge`d) $ spMerge b c
>
> and Daniel's fix is equivalent to
>
>   first (a++) $ spMerge b c d
>
>
> Now, when compiler sees the first variant, it probably treats spMerge as
> opaque. I.e. although in reality spMerge only contributes to the
> first "channel" while it is progressively instantiated, and (`merge`d)
> will only be called upon when spMerge's final clause is reached, that is
> (most likely) not known to the compiler at this stage. When looking at
> just the first expression itself, it has to assume that spMerge may
> contribute to both channels (parts of a pair) while working, and so
> can't know _when_ /d/ will get called upon to contribute to the data, as
> it is consumed finally at access.
>
> So /d/ is gotten hold of prematurely, _before_ going into spMerge.

No, the problem is that d itself is  gotten hold of too late, it is 
accessed only indirectly via the pair, so we keep an unnecessary reference 
to c via d.

>
> The second variant passes the responsibility for actually accessing its
> inputs to spMerge itself, and _it_ is clear about needing /d/ only in
> the very end.

The second variant is clear about not needing the _pair_ anymore once 
spMerge is entered. Thus d doesn't reference c anymore.

>
> Just a theory. :)
>
> Does that make sense?
>


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-13 Thread Daniel Fischer
Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus:
> I wonder whether it's really the liveness of  pair  in
>
>   mergeSP (a,b) pair
>      = let sm = spMerge b (fst pair)
>        in (a ++ fst sm, merge (snd sm) (snd pair))
>
> that is responsible for the space leak, for chances are that Sparud's
> technique applies and  pair  is properly disposed of. Rather, it could
> be that we need the stronger property that forcing the second component
> will evaluate the first to NF.

I think that is responsible. At least that's how I understand the core:

mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d)
   where
  (bc, b') = spMerge b c
  spMerge ...
--
OldMerge.$wmergeSP :: [GHC.Types.Int]
  -> [GHC.Types.Int]
  -> ([GHC.Types.Int], [GHC.Types.Int])
  -> (# [GHC.Types.Int], [GHC.Types.Int] #)
GblId
[Arity 3
 Str: DmdType LLL]
OldMerge.$wmergeSP =
  \ (ww_sny :: [GHC.Types.Int])
(ww1_snz :: [GHC.Types.Int])
(w_snB :: ([GHC.Types.Int], [GHC.Types.Int])) ->
let {
  ds_so7 [ALWAYS Just D(SS)] :: ([GHC.Types.Int], [GHC.Types.Int])
  LclId
  [Str: DmdType]
  ds_so7 =
case w_snB of _ { (c_adj, _) ->
case OldMerge.$wspMerge ww1_snz c_adj
of _ { (# ww3_snH, ww4_snI #) ->
(ww3_snH, ww4_snI)
}
} } in
(# GHC.Base.++
 @ GHC.Types.Int
 ww_sny
 (case ds_so7 of _ { (bc_ajQ, _) -> bc_ajQ }),
   case ds_so7 of _ { (_, b'_ajS) ->
   case w_snB of _ { (_, d_adk) -> OldMerge.merge b'_ajS d_adk }

-- Here, in the second component of the result,
-- we reference the entire pair to get the dorks

   } #)

OldMerge.mergeSP :: ([GHC.Types.Int], [GHC.Types.Int])
-> ([GHC.Types.Int], [GHC.Types.Int])
-> ([GHC.Types.Int], [GHC.Types.Int])
GblId
[Arity 2
 Worker OldMerge.$wmergeSP
 Str: DmdType U(LL)Lm]
OldMerge.mergeSP =
  __inline_me (\ (w_snw :: ([GHC.Types.Int], [GHC.Types.Int]))
 (w1_snB :: ([GHC.Types.Int], [GHC.Types.Int])) ->
 case w_snw of _ { (ww_sny, ww1_snz) ->
 case OldMerge.$wmergeSP ww_sny ww1_snz w1_snB
 of _ { (# ww3_snN, ww4_snO #) ->
 (ww3_snN, ww4_snO)
 }
 })
--

vs.

mergeSP (a,b) ~(c,d) = (a ++ bc, m)
   where
  (bc,m) = spMerge b c d
  spMerge ...
--
NewMerge.$wmergeSP :: [GHC.Types.Int]
  -> [GHC.Types.Int]
  -> ([GHC.Types.Int], [GHC.Types.Int])
  -> (# [GHC.Types.Int], [GHC.Types.Int] #)
GblId
[Arity 3
 Str: DmdType LLL]
NewMerge.$wmergeSP =
  \ (ww_snB :: [GHC.Types.Int])
(ww1_snC :: [GHC.Types.Int])
(w_snE :: ([GHC.Types.Int], [GHC.Types.Int])) ->
let {
  ds_soa [ALWAYS Just D(SS)] :: ([GHC.Types.Int], [GHC.Types.Int])
  LclId
  [Str: DmdType]
  ds_soa =
case w_snE of _ { (c_adj, d_adk) ->

  -- There's no reference to the pair after this

case NewMerge.$wspMerge ww1_snC c_adj d_adk
of _ { (# ww3_snK, ww4_snL #) ->
(ww3_snK, ww4_snL)
}
} } in
(# GHC.Base.++
 @ GHC.Types.Int
 ww_snB
 (case ds_soa of _ { (bc_ajT, _) -> bc_ajT }),
   case ds_soa of _ { (_, b'_ajV) -> b'_ajV } #)

NewMerge.mergeSP :: ([GHC.Types.Int], [GHC.Types.Int])
-> ([GHC.Types.Int], [GHC.Types.Int])
-> ([GHC.Types.Int], [GHC.Types.Int])
GblId
[Arity 2
 Worker NewMerge.$wmergeSP
 Str: DmdType U(LL)Lm]
NewMerge.mergeSP =
  __inline_me (\ (w_snz :: ([GHC.Types.Int], [GHC.Types.Int]))
 (w1_snE :: ([GHC.Types.Int], [GHC.Types.Int])) ->
 case w_snz of _ { (ww_snB, ww1_snC) ->
 case NewMerge.$wmergeSP ww_snB ww1_snC w1_snE
 of _ { (# ww3_snQ, ww4_snR #) ->
 (ww3_snQ, ww4_snR)
 }
 })
--
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-12 Thread Daniel Fischer
Am Dienstag 12 Januar 2010 11:30:07 schrieb Heinrich Apfelmus:
> Tricky stuff. It is known that pairs/records are prone to unwanted
> retention, see for example the recent thread
>
>   http://thread.gmane.org/gmane.comp.lang.haskell.cafe/66903/focus=67871
>
> or
>
>   Jan Sparud. Fixing some space leaks without a garbage collector.
>   http://citeseer.ist.psu.edu/240848.html

"CiteSeerx is momentarily busy, Please try us again, in a few moments.
We apologize for any inconvenience."

:( tried and re-tried.

Took http://bert.md.chalmers.se/pub/cs-reports/papers/sparud/spaceleak-
fix.ps.gz finally

>
> It is exactly because these troubles that I'm advocating the original
> VIP data structure that buries the dorks (that name is awesome :D) deep
> inside the structure. :)
>
>
> In any case, it appears to me that the lazy pattern match in  mergeSP
> is actually unnecessary! This is because  mergeSP  returns a pair
> constructor immediately, so infinite nesting works even when the second
> argument is demanded. In other words,
>
>   mergeSP :: Integral a => People a -> People a -> People a
>   mergeSP (P a b) (P c d) = P (a ++ bc) (merge b' d)
>   where
> P bc b' = spMerge b c
> spMerge [] ys = P [] ys
> spMerge xs [] = P [] xs
> spMerge xs@(x:xt) ys@(y:yt) = case compare x y of
> LT -> celebrate x (spMerge xt ys)
> EQ -> celebrate x (spMerge xt yt)
> GT -> celebrate y (spMerge xs yt)
>
> should work fine and do away with the memory issue because the pair is
> now deconstructed immediately.

No, <>. For the reason you stated below.
In

tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` smartfold f xs

, it must compute smartfold f xs too early to match it against P c d. The 
compiler can't see that smartfold mergeSP always returns a P [well, it 
might return _|_, mightn't it?].

>
> Hm, a strict second argument might however mean that the tree produced
> by  tfold  is demanded too early.
>

>
>
> Regards,
> Heinrich Apfelmus
>
> --
> http://apfelmus.nfshost.com


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-09 Thread Daniel Fischer
Am Freitag 08 Januar 2010 18:37:21 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Donnerstag 07 Januar 2010 11:43:44 schrieb Heinrich Apfelmus:
> > > Will Ness wrote:
> > >
> > > Hm? In my world view, there is only reduction to normal form and I
> > > don't see how "allocate its own storage" fits in there. Okasaki
> > > having shown something to that end would be new to me?
> >
> > Perhaps what was meant was "storage must be allocated for each filter"
> > (which, however, seems trivial).
>
> I still contend that in case of nested filters, where each filter has
> only one consumer, even that isn't ultimately necessary (a chain of
> pointers could be formed). That's because there's no "peek"s, only
> "pull"s there. If the filters are used in merge situation, then there
> will be some "peek"s so current value must be somehow stored, though
> it's better to be made explicit and thus static (the place for it, I
> mean, instead of the "rolling" cons cell). Such implementation technique
> would prevent some extra gc.
>

Well, for each filter we need to store

1. a pointer to where in the wheel we are
2. the prime by which to multiply the steps of the wheel
3. the current value of the filter (which number to reject next)

I don't see how to use less. It's not much, but it's not zero storage.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-09 Thread Daniel Fischer
Am Samstag 09 Januar 2010 08:04:20 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
> > > Daniel Fischer  web.de> writes:
> >
> > It's not tail-recursive, the recursive call is inside a celebrate.
>
> It is (spMerge that is).

No.
"In computer science, tail recursion (or tail-end recursion) is a special 
case of recursion in which the last operation of the function, the tail 
call, is a recursive call."

The last operation of spMerge is a call to celebrate or the pair 
constructor (be that P or (,)). Doesn't matter, though, as for lazy 
languages, tail recursion isn't very important.

> It calls tail-recursive celebrate in a tail
> position. What you've done, is to eliminate the outstanding context, by
> moving it inward. Your detailed explanation is more clear than that. :)
>
> BTW when I run VIP code it is consistently slower than using just pairs,

I can't reproduce that. Ceteris paribus, I get the exact same allocation 
and GC figures whether I use People or (,), running times identical enough 
(difference between People and (,) is smaller than the difference between 
runs of the same; the difference between the fastest and the slowest run of 
the two is less than 0.5%). I think it must be the other changes you made.

> modified with wheel and feeder and all. So what's needed is to
> re-implement your approach for pairs:
>
>  mergeSP (a,b) ~(c,d) = let (bc,bd) = spMerge b c d
>in (a ++ bc, bd)
>  where
>   spMerge u [] d = ([], merge u d)
>   spMerge u@(x:xs) w@(y:ys) d = case compare x y of
>LT -> consSP x $ spMerge xs w  d
>EQ -> consSP x $ spMerge xs ys d
>GT -> consSP y $ spMerge u  ys d
>
>  consSP x ~(a,b) = (x:a,b)   -- don't forget that magic `~` !!!

I called that (<:).

>
>
> BTW I'm able to eliminate sharing without a compiler switch by using
>

Yes, I can too. But it's easy to make a false step and trigger sharing.

I can get a nice speedup (~15%, mostly due to much less garbage collecting) 
by doing the final merge in a function without unnecessarily wrapping the 
result in a pair (whose secondcomponent is ignored):

-- Doesn't need -fno-cse anymore,
-- but it needs -XScopedTypeVariables for the local type signatures

primes :: forall a. Integral a => () -> [a]
primes () = 2:3:5:7:11:13:calcPrimes 17 primes''
   where
calcPrimes s cs = rollFrom s `minus` compos cs
bootstrap   = 17:19:23:29:31:37:calcPrimes 41 bootstrap
primes' = calcPrimes 17 bootstrap
primes''= calcPrimes 17 primes'

pmults :: a -> ([a],[a])
pmults p = case map (*p) (rollFrom p) of
(x:xs) -> ([x],xs)

multip :: [a] -> [([a],[a])]
multip ps = map pmults ps

compos :: [a] -> [a]
compos ps = case pairwise mergeSP (multip ps) of
((a,b):cs) -> a ++ funMerge b (pairwise mergeSP cs)

funMerge b (x:y:zs) = let (c,d) = mergeSP x y
in mfun b c d (pairwise mergeSP zs)

mfun u@(x:xs) w@(y:ys) d  l = case compare x y of
LT -> x:mfun xs w d l
EQ -> x:mfun xs ys d l
GT -> y:mfun u ys d l
mfun u [] d l = funMerge (merge u d) l


This uses a different folding structure again, which seems to give slightly 
better performance than the original tree-fold structure. In contrast to 
the VippyPrimes, it profits much from a larger allocation area, running 
with +RTS -A2M gives a >10% speedup for prime # 10M/20M, +RTS -A8M nearly 
20%. -A16M and -A32M buy a little more, but in that range at least, it's 
not much (may be significant for larger targets).
Still way slower than PQ, but the gap is narrowing.

>
>  mtwprimes () = 2:3:5:7:primes
>where
> primes = doPrimes 121 primes
>
>  doPrimes n prs = let (h,t) = span (< n) $ rollFrom 11
>   in h ++ t `diff` comps prs
>  doPrimes2 n prs = let (h,t) = span (< n) $ rollFrom (12-1)
>in h ++ t `diff` comps prs
>
>  mtw2primes () = 2:3:5:7:primes
>where
> primes  = doPrimes 26 primes2
> primes2 = doPrimes2 121 primes2
>
>
> Using 'splitAt 26' in place of 'span (< 121)' didn't work though.
>
>
> How about them wheels? :)
>

Well, what about them?

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-08 Thread Daniel Fischer
Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
> Daniel Fischer  web.de> writes:

> >
> > mergeSP :: Integral a => People a -> People a -> People a
> > mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd)
> >   where
> > mrgd = spMerge (dorks p1) (vips p2) (dorks p2)
> > spMerge l1 [] l3 = P [] (merge l1 l3)
> > spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of
> > LT -> celebrate x (spMerge xs l2 l3)
> > EQ -> celebrate x (spMerge xs ys l3)
> > GT -> celebrate y (spMerge l1 ys l3)
> >
> > --
>
> Hi Daniel,
>
> Is it so that you went back to my fold structure?

Yes.

> Was it better for really big numbers of primes?

Yes, it is slower for <= 20 million primes (and some beyond), but faster 
for >= 50 million (and some before), according to the few tests I made.

>
> I had the following for ages (well, at least two weeks) but I thought it
> was slower and took more memory (that was _before_ the 'no-share' and
> 'feeder' stuff). I can see the only difference in that you've re-written
> spMerge in a tail-recursive style with explicitly deconstructed parts;

It's not tail-recursive, the recursive call is inside a celebrate.

As it turns out, the important things are

1. a feeder and separate lists of multiples for the feeder and the runner, 
for the reasons detailed earlier (two-step feeding and larger wheel are 
pleasing but minor optimisations).

2. a strict pattern in tfold

3. moving the merge inside spMerge (you can have

mergeSP (a,b) ~(c,d) = (a ++ bc, m)
   where
  (bc,m) = spMerge b c
  spMerge u [] = ([],merge u d)
  ...

without exploding memory, but it's *much* slower than letting spMerge take 
three arguments)

Now, I have to admit, the only point I _really_ understand is 1. (and why 
the three-argument spMerge is faster than the two-argument one taking the 
merge-partner from mergeSP's binding :).

Why has

mergeSP (a,b) ~(c,d)
   = let (bc,b') = spMerge b c in (a ++ bc, merge b' d)

a memory leak, but

mergeSP (a,b) ~(c,d)
   = let (bc,m) = spMerge' b c d in (a ++ bc, m)

not?

Well, looking at the core for mergeSP, the fog clears somewhat. The former 
is translated roughly to 

mergeSP (a,b) pair
   = let sm = spMerge b (fst pair) 
 in (a ++ fst sm, merge (snd sm) (snd pair))

It holds on to the pair until the result of the merge is demanded, that is 
until the entire (a ++ fst sm) is consumed. Only then is the pair released 
and can be collected. On top of that, as soon as a is consumed and (fst sm) 
[or bc] is demanded, spMerge forces the evaluation of (fst pair) [c]. After 
a few levels, the evaluated list will take more space than the thunk. It 
cannot yet be collected, because pair is still alive. The elements have to 
be duplicated for (fst sm), because they're intertwined with those of b.
On the next level of merging, they have to be duplicated again.

The latter is translated roughly to

mergeSP (a,b) pair
   = let sm = spMerge' b (fst pair) (snd pair)
 in (a ++ fst sm, snd sm)

The pair is released as soon as the result of the spMerge' is demanded, 
that is, when a is consumed. Then the elements of (fst pair) need not be 
duplicated and they can be discarded almost immediately after they have 
been produced [for small primes, multiples of larger primes live a little 
longer, but those are fewer].

So, no duplication, shorter lifespan => no leak.
Having seen that, the question is, why can't the compiler see that 
deconstructing the pair when the first component is needed is better? The 
first component of the pair is used in no other place, so keeping the pair 
can't have any advantage, can it?

And why does

tfold f (a: ~(b: ~(c:xs))) = ...

leak, but not

tfold f (a:b:c:xs) = ...

?

I guess it's similar.

tfold f (a: ~(b: ~(c:xs))) 
   = (a `f` (b `f` c)) `f` tfold f ([pairwise f] xs)

is

tfold f (a:zs) 
   = (a `f` (head zs `f` (head $ tail zs))) 
`f` tfold f (pairwise f $ drop 2 zs)

the latter part holds on to the beginning of zs, leading again to data 
duplication and too long lifespans.

> mine was relying on the compiler (8-L) to de-couple the two pipes and
> recognize that the second just passes along the final result, unchanged.
>
> The two versions seem to me to be _exactly_ operationally equivalent.

Well, they're not. The main difference is what we told the compiler _when_ 
to deconstruct the pairs. You said "at the latest possible moment", I said 
"as soon as we need the first component".

It's not entirely obvious, but it is frequently said that understanding the 
space (and time) behaviour of lazy evaluation isn't always easy.

> All this searching for the code better understood by the compiler is
> _*very*_ frustrating, as it doesn't reflect on the semantics of the
> code, or even the operational semantics of the code.  :-[
>
> Weren't the P's supposed to disappear completely in the com

Re: [Haskell-cafe] Re: FASTER primes

2010-01-07 Thread Daniel Fischer
Am Donnerstag 07 Januar 2010 17:13:38 schrieb Daniel Fischer:
>     compos :: [a] -> [a]
>     compos = vips . itfold mergeSP . multip

Sigh! That's what I get for editing the code in the mail editor. I decided 
to change the really stupid 'itfold' to 'smartfold' and forgot this 
occurrence.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-06 Thread Daniel Fischer
Am Mittwoch 06 Januar 2010 19:13:01 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Mittwoch 06 Januar 2010 00:09:07 schrieb Will Ness:
> > > Daniel Fischer  web.de> writes:
> > > > Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer:
> > > > Fix rfold:
> > > >
> > > > rfold f [x] = x
> > > > rfold f xs = rfold f (pairwise f xs)
> > > >
> > > > and it's faster also for those.
> >
> > The memory is almost completely due to the tree-merging of the multiples
> > for the fastest runner. While it produces faster than flat merging, the
> > exponential growth of the trees makes  a bad memory citizen.
>
> Isn't the number of nodes the same in any two trees with the same number of
> leafs?

Sure. And I don't know why it takes *that much* memory, but since a flat merge
doesn't consume much memory, it must be the trees.

>
> BTW using
>
>  compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP $ map pmults ps
>
> instead of
>
>  compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP
>  $ pairwise mergeSP $ map pmults ps
>
> brings down memory consumption further by 25%..45% on 1..8 mln primes
> produced, while slowing it down by about 0%..2% (that's after eliminating
> the lazy pattern in tfold as per your advice).
>
So much? Wow. I forgot the numbers, but I had tried that too, I thought the 
memory gain 
wasn't worth the speed loss. Another thing that reduces memory usage is


compos :: [a] -> [a]
compos ps = case pairwise mergeSP (multip ps) of
((a,b):cs) -> a ++ funMerge b (nwise 1 mergeSP $ pairwise 
mergeSP cs)

funMerge b (x:y:zs) = let (c,d) = mergeSP x y
in mfun b c d zs

mfun u@(x:xs) w@(y:ys) d  l = case compare x y of
LT -> x:mfun xs w d l
EQ -> x:mfun xs ys d l
GT -> y:mfun u ys d l
mfun u [] d l = funMerge (merge u d) l

That's about the same speed as the latter of the above here and uses about 25% 
less 
memory. Removing one or two 'pairwise's brings memory down further, but makes 
it slower.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-05 Thread Daniel Fischer
Am Mittwoch 06 Januar 2010 00:09:07 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer:
> > > memory still grows, but much slower, in my tests, due to the much
> > > smaller GC time, it's a bit faster than the version with the original
> > > tfold.
> >
> > Not for larger inputs (but not so large that the tree-fold dies OOM).
> > Fix rfold:
> >
> > rfold f [x] = x
> > rfold f xs = rfold f (pairwise f xs)
> >
> > and it's faster also for those.
>
> Niiice This is just great!  :)
>
> I tried a two-step feed BTW (that's three separate sets of lists) , with
> the original structure. It ran with same speed as your new version (10..20%
> faster) but with the memory of the previous one :) (80M for 8 mil primes vs
> the new one's 10M).

The memory is almost completely due to the tree-merging of the multiples for 
the fastest 
runner. While it produces faster than flat merging, the exponential growth of 
the trees 
makes  a bad memory citizen.
With the nwise and rfold, a two-step (or even three-step) feeding doesn't gain 
nearly as 
much (I found about 1% speedup).

> But your new structure is just great! I hoped there is
> something better, that's why I posted it here in the first place.
>

I have two more goodies :)
1. now that the trees don't grow so fast, don't use lazy patterns in the 
definition of 
tfold:

tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` tfold f xs

gains something like 6-7% here (and uses a little less memory).

2. Now we have a big wheel, 5760 differences per period. Then dropping a few 
thousand 
elements from the wheel after calculating how many in rollFrom is slow:

rollFrom n = go ((n-17) `rem` 30030) wheel13
  where
go r xxs@(x:xs)
| r < x = roll n xxs
| otherwise = go (r-x) xs

gains another couple of percents for large targets (~1% for the 10M prime, ~2% 
for 20M, I 
expect that to stay in th region of 1.5-3% for larger targets).

> 'pairwise' puts odd leafs higher on the right. It might be better if it was
> so on the left, for the frequency of production is higher.

Maybe. But how would you do it? I tried passing the length to rfold, so when 
there was an 
odd numberof trees in the list, it would move the first out of the recursion. 
Any possible 
gains in production have been more than eaten up by the control code (not a big 
difference, but it was there).

>
>
> Thanks a lot for your comments!


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-05 Thread Daniel Fischer
Am Dienstag 05 Januar 2010 14:49:58 schrieb Will Ness:
> > ... There are two attempts to eliminate 45.
>
> I would say there are two requests to not have 45 in the output.
>
Thers are many possible ways to phrase it.

> > > I don't see any problem here. As Melissa (and yourself, I think) have
> > > shown, double hits on multiples are few and far between.
> >
> > It's not a problem, it just means it's not Euler's sieve, because that
> > attempts to eliminate each composite exactly once.
>
> yes I see now. My bad. Wasn't reading that wikipedia page with enough
> attention to detail. It uses the modified (culled, so far) numbers to find
> out the next multiples to be eliminated,

Minor factual error, no big deal.

> not just the prime itself like my code does.

Which is operationally far better because it's much simpler :)

>
> You solution is indeed, exponential:
>
>   euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)
>   primes = 2:euler [3,5..]
>
>
>   primes
>= 2:euler (a...@[3,5..])
>3:euler (bs@(tail as `minus` map (3*) as))
>  5:euler (cs@(tail bs `minus` map (5*) bs))
>
>
> There are two separate look-back pointers to /as/ in /bs/, and there are
> two such in /cs/, to /bs/. The book-keeping machinery explodes.
>
> > > Also, it uses no filters, i.e. no individual number divisibility
> > > testing. The "filters" refers first of all to testing an individual
> > > number to decide whether to keep it in or not.
> >
> > Umm, the postponed filters says keep everything until p^2, then eliminate
> > (filter out, remove) multiples of p in the remainder, after that, pick
> > next prime.
> > That's precisely what the above does. It doesn't do the filtering out by
> > divisibility testing but by minus (hence more efficiently). I would say
> > that's a variant of the postponed filters.
>
> Filter is usually (as in Haskell's 'filter') is about testing individual
> elements by a predicate function. There is of course a filtering effect in
> two lists elts' comparison that 'minus' performs, so that's debatable. Even
> the PQ code performs "filtering" in this wider sense.
>

I understood the 'filters' in postponed filters in that wider sense. And I 
would also find 
it perfectly acceptable to say that the PQ code filters out the composites from 
the 
stream. Of course if you're used to using 'filter' only in the stricter sense, 
you 
wouldn't call the `minus` thing a variant of the postponed filters.


>
> not just primes - all the composites not yet removed, too.

Between p and p^2, there are only primes left, fortunately.

> So it can't even be implemented on shared mutable storage if we
> want to delay the removal (as we must, in unbounded case) -

Yes. And it's not nice in the bounded case either.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-05 Thread Daniel Fischer
Am Dienstag 05 Januar 2010 10:33:19 schrieb Will Ness:
> Such a
> system would probably have to distinguish, at the type level, between
> [1..m] ; cycle [1..m] ; take n [1..m] ; etc. These would all be not just
> fuctions, but parts of a type's (here list) behaviour with automatically
> deduced semantics.
>
> What would such a type system be called?
>

Seriously complicated, I think.
Don't get me wrong, a system with such capabilities would be A.W.E.S.O.M.E.
I just can't see it happening anytime soon.


>
> I would imagine so. Do I get this fusion on lists for free from the
> compiler, or do I have to recode for that? (haven't yet time to look into
> the article mentioned).
>
Without optimisations, hardly ever if at all (needs a compiler expert to know).
With optimisations, sometimes. But it doesn't always see the possibility where 
it would 
with recoding in the right style (dons is expert in that).
With stream fusion, more often. But I think it would again be not very hard to 
hide 
opportunities for fusion by your coding style.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-05 Thread Emil Axelsson

Will Ness skrev:

Emil Axelsson  chalmers.se> writes:

For me, a real smart compiler is one that would take in e.g. (sum $ 
take n $ 
cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs 
maybe (just an aside). 

(Also an aside, I couldn't resist...)

Then I'm sure you'd say that Feldspar [1] has a smart compiler :)



but it didn't produce

f n m = if n < m then n*(n+1)/2 else
let (q,r)=quotRem n m
in q*(m*(m+1)/2) + r*(r+1)/2

:)


Ah, I see... Yes, that would be a very smart compiler :)

/ Emil


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-04 Thread Daniel Fischer
Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer:
> compos ps = fst (tfold mergeSP $ nwise 1 mergeSP (pairwise mergeSP (multip
> ps)))
>
> tfold f (a: ~(b: ~(c:xs)))
>  = (a `f` (b `f` c)) `f` tfold f xs
>
> nwise k f xs = let (ys,zs) = splitAt k xs in rfold f ys : nwise (k+1) f zs
>
> rfold f [x] = x
> rfold f (x:xs) = x `f` rfold f xs
>
> memory still grows, but much slower, in my tests, due to the much smaller
> GC time, it's a bit faster than the version with the original tfold.

Not for larger inputs (but not so large that the tree-fold dies OOM).
Fix rfold:

rfold f [x] = x
rfold f xs = rfold f (pairwise f xs)

and it's faster also for those.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-04 Thread Daniel Fischer
Am Montag 04 Januar 2010 16:30:18 schrieb Will Ness:
> Heinrich Apfelmus  quantentunnel.de> writes:
> >
> > (I haven't followed the whole thread, but hopefully I have enough grasp
> > of it to make a useful remark. :))
> >
> > Concerning lists as producer/consumer, I think that's exactly what lazy
> > evaluation is doing. Neither  filter ,  map  or  span  evaluate and
> > store more list elements that strictly necessary.
>
> I laways suspected as much, but was once told that Chris Okasaki has shown
> that any filter etc must allocate its own storage. With the peek/pull they
> don't have to, if they are nested, and the deepest one from the real
> storage gets pulled through some pointer chasing eventually. Span isn't so
> easily compiled out too or is it? But that might be a minor point.
>
> For me, a real smart compiler is one that would take in e.g. (sum $ take n
> $ cycle $ [1..m]) and spill out a straight up math formula, inside a few
> ifs maybe (just an aside).

Such things just aren't common enough. If you start letting the compiler look 
for patterns 
such as this which can be transformed into a simple formula, compile times 
would explode.

>
> Such a smart compiler might even be able to derive a well performing code
> right from the Turner's sieve. :)
>
> > Sure, creating a list head only to immediately consume it is somewhat
> > inefficient -- and the target of stream fusion[1] -- but this is an
> > overhead of how list elements are stored, not how many.
>
> it might be equivalent to the (imagined) producer's storing its 'current'
> value inside its frame.
>
> How much can we rely on the run-time to actually destroy all the
> passed-over elements and not hang on to them for some time?

If they're really passed over, they very probably won't survive the next GC.

> Is this that compiler switch that Daniel mentioned? Is it reliable?
>
No, it's something different. Not entirely unrelated.
The -fno-cse turns off Common Subexpression Elimination (rather sharing than 
elimination).
That is, if you have

f x = g (expensive x) y z (h (expensive x))

the compiler can see the common subexpression (expensive x) on the RHS and 
decide to share it, i.e. calculate it only once:

f x = let e = expensive x in g e y z (h e)

That may or may not be a good idea. If e is small (say an Int) and expensive 
isn't a 
blatant lie, it is a *very* good idea. But if e is a loong list that expensive 
x lazily 
produces and that is consumed by g and h in the right manner, it is often a bad 
idea 
because h might not start to consume before g has demanded a long prefix of the 
list and 
kaboom, Heap exhausted.

If you turn on optimisations, the compiler does some looking for common 
subexpressions
to share. There are some heuristics to decide when to share, but they're far 
from 
infallible. So, if you have a situation like above, and the compiler heuristics 
indicate 
that sharing would be good although it isn't, you're hosed. So you have the 
option to tell 
the compiler "try hard to optimise, but don't do CSE (because I know it would 
be bad 
here)" {- that can be done on a per-file basis, it would be cool if it could be 
done on a 
per-function basis -}.

Now if you have a list producer and a consumer, without fusion, it goes like
Consumer: Gimme
Producer: creates cons cell, puts value, next cons cell (how to produce more)
Consumer: takes value, does something with it, gimme more.

Stream fusion is about eliminating the cons cell creating and value passing, to 
fuse 
production and consumption into a nice loop. That is of course impossible if 
the produced 
list is shared between two (or more) consumers.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-04 Thread Daniel Fischer
Am Montag 04 Januar 2010 19:16:32 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Montag 04 Januar 2010 13:25:47 schrieb Will Ness:
> > > Euler's sieve is
> > >
> > >  sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..])
> > >   where (h,t) = span (< p*p) xs
> >
> > Not quite. That's a variant of the postponed filters, it crosses off e.g.
> > 45 twice, once as 3*15 and once as 5*9 (okay, it has already been removed
> > by the first, so let's say 45 appears in two lists of numbers to be
> > removed if present).
>
> there won't be any such. whatever is removed first, is just skipped second
> (and third, etc). 45 does appear twice on the two multiples ists (3- and
> 5-). But it is "removed" by first, and skipped by second. And there's no
> removal here in the first place. There is no pre-filled storage. All there
> is, is some lists comparison, and lists are just generators (I so keep
> hoping).

((45:(offer 47 when demanded)) `minus` (45:(next will be 51 when demanded))) 
`minus` (45:
(next will be 55 when demanded))

So there are two attempts to tell the generator to not output 45. To the 
second, it 
answers "I already knew that", but the request is made nevertheless.

Lists sometimes are data structures occupying physical storage (cons cell, 
point to value, 
pointer to next...), sometimes generators (when asked for the next value, I'll 
answer 
this). I say a value is removed from a list regardless of whether it takes 
relinking 
pointers in a list of cons cells or it involves telling the generator not to 
give out that 
value. I could also say 'eliminated'. There are two attempts to eliminate 45.

>
> I don't see any problem here. As Melissa (and yourself, I think) have
> shown, double hits on multiples are few and far between.

It's not a problem, it just means it's not Euler's sieve, because that attempts 
to 
eliminate each composite exactly once.

>
> Also, it uses no filters, i.e. no individual number divisibility testing.
> The "filters" refers first of all to testing an individual number to decide
> whether to keep it in or not.

Umm, the postponed filters says keep everything until p^2, then eliminate 
(filter out, 
remove) multiples of p in the remainder, after that, pick next prime.
That's precisely what the above does. It doesn't do the filtering out by 
divisibility 
testing but by minus (hence more efficiently). I would say that's a variant of 
the 
postponed filters.

> Euler's sieve removes multiples in advance,
> so there's no testing and no filtering, only comparison. It follows the
> _Postponed_ Filter's framework in postponing the action until the right
> moment; the action itself is two lists comparison and skipping of the
> equals (i.e. the "minus" action).
>
> > Euler's sieve is never attempting to remove a number more than once,
> > that's
>
> How's that possible?

http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes#Euler.27s_Sieve

"Euler in his Proof of the Euler product formula for the Riemann zeta function 
came up 
with a version of the sieve of Eratosthenes, *better in the sense that each 
number was 
eliminated exactly once*."

C) The number after the previous prime is also a prime. *Multiply each number 
in the list 
starting from this prime by this prime and discard the products*.

> On wikipedia is says, it removes multiples of 3; then
> multiples of 5; etc. So it just looks for each number it is removing, if it
> is there, and if so, removes it. I would argue that looking whether to
> remove or not, is part of attempting to remove. It can't have foresight,
> right?
>

But it has :) By only trying to eliminates products of the prime p currently 
under 
consideration *with numbers (>= p) which have not yet been eliminated from the 
list*, it 
is known in advance that all these products are still in the list.

When p is under consideration, the list contains (apart from the primes < p) 
precisely the 
numbers whose smallest prime factor is >= p.

> > the outstanding feature of it. Unfortunately, that also makes it hard to
> > implement efficiently. The problem is that you can't forget the primes
> > between p and p^2, you must remember them for the removal of multiples of
> > p later on.
>
> you're right, every formulation of these algorithms is always done in the
> imperative, mutable-storage setting. They always speak of "removing"
> numbers etc.

sed s/remove/eliminate/ ?

>
> The more pressing problem with that code is its linear structure of course
> which gets addressed by the tree-folding merge etc.
>

Which unfortunately introduces its own space problem :(

> >  An (inefficient but faithful) implementation would be
> >
> > euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)
>
> I think it does exactly the same thing, computationally,

No, in map (*p) ks, there are no numbers with a smaller prime factor than p, 
while in 
map (*p) [p, p+2 .. ] there are (for p > 3)

> except it does so,
> again,  _prematurely_  exactly in Tu

Re: [Haskell-cafe] Re: FASTER primes

2010-01-04 Thread Daniel Fischer
Am Montag 04 Januar 2010 18:08:42 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
> > > The quesion of a memory blowup with the treefolding merge still
> > > remains. For some reason using its second copy for a feeder doesn't
> > > reduce the memory (as reported by standalone compiled program, GHCi
> > > reported values are useless) - it causes it to increase twice.
> >
> > I have a partial solution. The big problem is that the feeder holds on to
> > the beginning of comps while the main runner holds on to a later part.
> > Thus the entire segment between these two points must be in memory.
> >
> > So have two lists of composites (yeah, you know that, but it didn't work
> > so far).
> >
> > But you have to force the compiler not to share them: enter -fno-cse.
> > The attached code does that (I've also expanded the wheel), it reduces
> > the memory requirements much (a small part is due to the larger wheel, a
> > factor of ~5 due to the non-sharing).
>
> I don't understand. What is there to be shared? Each multiples list is
> consumed only at one point; there's nothing to be shared. Do you mean the
> compiler still hangs on to them? If so, why??

Okay, let's look at the code.
Your Primes.hs says


{-# SPECIALIZE primes :: () -> [Int] #-}
{-# SPECIALIZE primes :: () -> [Integer] #-}
primes :: Integral a => () -> [a]
primes () = 2:3:5:7:primes'
   where
    primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps
    (comps,_)  = tfold mergeSP (pairwise mergeSP mults)
    mults   = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes'
    fromList (x:xs) = ([x],xs)

What's going on evaluating primes'?
First, we know primes' is 11 : 13 : more.
Then we can start looking at 'more'. For that, we need 
rollFrom 11 `minus` comps.
rollFrom 11 is easy. What about comps? We need the start of primes' for that. 
Cool, we 
already know the first two elements, start tfolding, we know comps is 
121:143:169:notYetKnown

So the rollFrom 11 produces happily, minus doesn't veto anything for a while, 
then says no 
to 121, 143, 169. Now rollFrom 11 says "look at 173". Fine, what's next in 
comps?
To find that out, we need the multiples of 17 and 19 (and 23 and 29, but those 
can still 
be deferred for a while). That gives ([289,323,361],whatever), then the mergeSP 
with the 
merged multiples of 11 and 13 produces (187:209:221:...) for notYetKnown.
Everything's going on fine, but the problem is, when primes' produces primes in 
the region 
of n, it needs the comps in that region. Those contain the multiples of primes 
in the 
region of sqrt n, thus the small primes cannot yet be released.
Bottom line, to produce p, all the primes from about sqrt p to p must be in 
memory. Oops.

Enter the feeder. Say

primes :: Integral a => () -> [a]
primes () = 2:3:5:7:primes'
   where
    primes' = (rollFrom 11) `minus` comps
    primes'' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps
    (comps,_)  = tfold mergeSP (pairwise mergeSP mults)
    mults   = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes''
    fromList (x:xs) = ([x],xs)

Now primes'', the feeder of composites advances much more slowly and for 
primes' to 
produce p, primes'' must produce the next prime after sqrt p, thus holds on to 
all primes 
between sqrt (sqrt p) and sqrt p - not much of a problem. primes' doesn't hold 
on to 
smaller primes, great.
But! primes' needs the part of comps around p. primes'' needs the part of comps 
around 
sqrt p. That means we must keep the part of the list of composites from sqrt p 
to p in 
memory. Even after removing the multiples of 2, 3, 5 and 7, it doesn't take 
long until 
there are far more composites between sqrt p and p than primes. Double oops.

So we must make sure that the list of composites that primes' consumes is not 
the same as 
that which primes'' consumes.

You can try


primes :: Integral a => () -> [a]
primes () = 2:3:5:7:primes'
   where
    primes' = (rollFrom 11) `minus` comps
    primes'' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps2
    (comps,_)  = tfold mergeSP (pairwise mergeSP mults)
    mults   = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes''
    (comps2,_)  = tfold mergeSP (pairwise mergeSP mults2)
    mults2   = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes''
    fromList (x:xs) = ([x],xs)

If you compile without optimisations, it works. It just is rather slow. 
Unfortunately, the 
optimiser sees that comps2 and comps are the same and thinks, "why should I 
produce it 
twice if they're the same?" - whoops, back to square two.
Tell the compiler that you do *not* want it shared via -fno-cse and it isn't 
shared, you 
really have two distinct lists of composites, happiness ensues - almost.

>
> I used the switch; it didn't help at all. The only thing I can see is
> different is that all my interim data which I named with inner vars you
> moved out to the top level as functions. Is that what did the trick?

Not really, the abo

Re: [Haskell-cafe] Re: FASTER primes

2010-01-04 Thread Daniel Fischer
Am Montag 04 Januar 2010 13:25:47 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
> > > Daniel Fischer  web.de> writes:
> > > > But there's a lot of list constructuion and deconstruction necessary
> > > > for the Euler sieve.
> > >
> > > yes. Unless, of course, s smart compiler recognizes there's only one
> > > consumer for the values each multiples-list is producing, and keeps it
> > > stripped down to a generator function, and its current value. I keep
> > > thinkig a smart compiler could eliminate all these "span" calls and
> > > replace them with just some pointers manipulating...
> >
> > Of course I'm no smart compiler, but I don't see how it could be even
> > possible to replace the span calls with pointer manipulation when dealing
> > with lazily generated (infinite, if we're really mean) lists. Even when
> > you're dealing only with strict finite lists, it's not trivial to do
> > efficiently.
>
> I keep thinking that storage duplication with span, filter etc. is not
> really necessary, and can be replaced with some pointer chasing -
> especially when there's only one consumer present for the generated values.
>
> What I mean is thinking of lists in terms of produce/consumer paradigm, as
> objects supporting the { pull, peek } interface, keeping the generator
> inside that would produce the next value on 'pull' request and keep it
> ready for any 'peek's.
>
> Euler's sieve is
>
>  sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..])
>   where (h,t) = span (< p*p) xs

Not quite. That's a variant of the postponed filters, it crosses off e.g. 45 
twice, once 
as 3*15 and once as 5*9 (okay, it has already been removed by the first, so 
let's say 45 
appears in two lists of numbers to be removed if present).
Euler's sieve is never attempting to remove a number more than once, that's the 
outstanding feature of it. Unfortunately, that also makes it hard to implement 
efficiently. The problem is that you can't forget the primes between p and p^2, 
you must 
remember them for the removal of multiples of p later on.

 An (inefficient but faithful) implementation would be

euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)

Its performance is really horrible though. A much better implementation is

primes = 2:3:euler hprs [] (scanl (+) 5 (cycle [2,4]))
  where
hprs = 5:drop 3 primes
euler (p:ps) acc cs = h ++ euler ps (tail (acc ++ h)) (t `minus` comps)
  where
(h,t) = span (< p*p) cs
comps = map (*p) (acc ++ cs)

still really bad. I don't yet see how to eat the cake and have it here.

> > > > I think that remark was meant to apply to composite removal, not
> > > > Turner's sieve.
> > >
> > > It is right there on page 2, right when the Turner's sieve is presented
> > > and discussed. The only explanation that I see is that she thought of
> > > it in regards to the imperative code, just as her analysis concentrates
> > > only on calculation aspects of the imperative code itself.
> >
> > To be fair, she writes:
> >
> > "Let us first describe the original “by hand” sieve algorithm as practiced
> > by Eratosthenes.
> > ...
> > The starting point of p^2 is a pleasing but minor optimization, which can
> > be made
> > because lower multiples will have already been crossed off when we found
> > the primes prior to p.
> > ... (This optimization does not affect the time complexity of the sieve,
> > however, so its absence from the code in Section 1
> > *is not our cause for worry*.)"
>
> A-HA!
>
> But its absense from _*that*_ _*code*_ WAS the *major* cause for worry, as
> dealing with it worked wonders on its complexity and constant factors.

I think you're overinterpreting it. Sure, it's not perfectly worded, but her 
concern was 
primarily that it's not an Eratosthenes sieve but trial division (and woefully 
inefficient 
at that), I think. Now, because it's trial division and not Eratosthenes, it 
has a major 
impact, thus the absence of the optimisation is a reinforcing consequence of 
the original 
cause for worry.

I may be overdefending her, though, we can't really know what she meant.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-04 Thread Emil Axelsson
For me, a real smart compiler is one that would take in e.g. (sum $ take n $ 
cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs 
maybe (just an aside). 


(Also an aside, I couldn't resist...)

Then I'm sure you'd say that Feldspar [1] has a smart compiler :)

The above expression written in Feldspar and the resulting C code can be 
found here:


  http://hpaste.org/fastcgi/hpaste.fcgi/view?id=15592#a15593

The C code is somewhat complicated by the fact that Feldspar doesn't 
have infinite vectors.


Feldspar usually works well on small examples like this one, but we're 
very much lacking bigger examples, so I can't advise you to use it for 
prime numbers just yet.


/ Emil


[1] http://feldspar.sourceforge.net/


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2010-01-03 Thread Daniel Fischer
Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
>
> Exactly the point I tried to make. :)
>

>
> again, yes. :)
>

>
> yes.
>

>
> yes, that's what I meant - the cost of calling all the fuctions that - we
> know in advance will - have nothing to do eventually.
>

8-)

> >
> > But there's a lot of list constructuion and deconstruction necessary for
> > the Euler sieve.
>
> yes. Unless, of course, s smart compiler recognizes there's only one
> consumer for the values each multiples-list is producing, and keeps it
> stripped down to a generator function, and its current value. I keep
> thinkig a smart compiler could eliminate all these "span" calls and replace
> them with just some pointers manipulating...
>

Of course I'm no smart compiler, but I don't see how it could be even possible 
to replace 
the span calls with pointer manipulation when dealing with lazily generated 
(infinite, if 
we're really mean) lists. Even when you're dealing only with strict finite 
lists, it's not 
trivial to do efficiently.

> > That may be more work than the multiple hits cause.
>
> so that too would make filters win; only _if_ the cheap primality test
> existed!..
>
> > > > the multiples of p with 2*p, you have
> > > > ...
> > >
> > > There are two questions here - where to start crossing off numbers, and
> > > when. If you'd start at 2*p maybe the overall complexity would remain
> > > the same but it'll add enormous overhead with all those duplicate
> > > multiples.
> >
> > The additional duplicate multiples aren't the problem. Sure, the numbers
> > having a prime divisor larger than the square root would be crossed off
> > one additional time,  but that isn't so much per se. The additional
> > crossings off are O(bound), so they don't harm the time complexity. But
> > they can have effects which multiply the running time by a large
> > constant.
>
> yes, exactly what I wanted to say. :)
>

*g* I wondered what you meant by the distinction between where and when to 
cross off.

> > > No, the question is not where to start, but when. PQ might hide the
> > > problem until the memory blows up. Anything that we add that won't have
> > > any chance of contributing to the final result, is added for nothing
> > > and only drives the total cost up needlessly.
> >
> > Well, if you start crossing off at p^2 and feed your PQ from a separate
> > prime generator, the memory blow-up is far away. E.g., when the main
> > sieve is at 10^12, the PQ contains less than 8 multiples. No space
> > problem in sight. At 10^14, the PQ's size is still less than 70. 
> > That's noticeable, but there's still plenty of space.
>
> again, what I mean is, not _where_ I start crossing them off in a PQ, but
> _when_. The article's code starts crossing them off _at_ p^2 - by adding
> p^2+2p into the PQ - _as_ _soon_ as p itself is reached. It won't surface
> until p^2 will be considered for a prime; it'll lay dormant deep inside the
> queue's guts. When reaching 7919, the thousand (i.e. pi(7919) ) entries
> will hang out inside the PQ - instead of just 24. A memory blowup. (this is
> of course fixed in Melissa's ZIP package). Of course due to the nature of
> PQ it might actually not hurt the performance for a while, depending on
> partcular PQ implementation. Complexity _and_ constant factors.
>

It will start to have an impact pretty soon. Assuming at least one of the 
relevant PQ 
operations to be Theta(log size), each composite between ~400 and ~40 
(rough 
estimates) will take something like twice as long to handle. It will start to 
hurt really 
badly only a while later, though, as a guesstimate, with more than a million 
primes in the 
PQ, memory will have a hard time.

> > If, on the other hand, you start crossung off at 2*p, when the main sieve
> > is at 10^7, the size of the PQ is > 65, at 10^8, the size is more
> > than 5.5 million. That starts to become a memory problem rather soon.
>
> here you don't have a choice or when to add it - you have to add it at p
> itself - so the problem is clear. But even when you cross at p^2, the
> question remains, of when you add the p^2 entry into the PQ. That was my
> point.
>
> Postponed Filters code makes this clear, and thus hard to err on.
> Unfortunately, it wasn't present the article.
>
> > > > > I was under impression that the first shows O(n^2) approx., and the
> > > > > second one O(n^1.5) (for n primes produced).
> > > >
> > > > In Turner/postponed filters, things are really different. Actually,
> > > > Ms. O'Neill is right, it is a different algorithm. In the above, we
> > > > match each
> > >
> > > what _is_ different is divisibility testing vs composites removal,
> > > which follows from her in-depth analysis although is never quite
> > > formulated in such words in her article. But nothing matters until the
> > > premature starting up is eliminated, and that key observation is
> > > missing for the article either - worse, it is brushed off with the
> > > casual remark that it 

Re: [Haskell-cafe] Re: FASTER primes

2010-01-02 Thread Daniel Fischer
Am Samstag 02 Januar 2010 14:13:29 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness:
> > > Daniel Fischer  web.de> writes:
> > > > Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
> > > > > > especially the claim that going by primes squares
> > > > > > is "a pleasing but minor optimization",
> > > > >
> > > > > Which it is not. It is a major optimisation. It reduces the
> > > > > algorithmic complexity *and* reduces the constant factors
> > > > > significantly.
> > > >
> > > > D'oh! Thinko while computing sum (takeWhile (<= n) primes) without
> > > > paper. It doesn't change the complexity, and the constant factors are
> > > > reduced far less than I thought.
> > >
> > > I do not understand. Turner's sieve is
> > >
> > >   primes = sieve [2..]
> > >where
> > > sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0]
> > >
> > > and the Postponed Filters is
> > >
> > >   primes = 2: 3: sieve (tail primes) [5,7..]
> > >where
> > > sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0]
> > >   where (h,~(_:t)) = span (< p*p) xs
> > >
> > > Are you saying they both exhibit same complexity?
> >
> > No. They don't.
> > But if you're looking at an imperative (mutable array) sieve (that's
> > simpler to analyse because you don't have to take the book-keeping costs
> > of your priority queue, heap or whatever into account), if you start
> > crossing out
>
> The key question then, is _*WHEN*_ and not _*WHAT*_. As is clearly
> demonstrated by the case of Turner/Postponed filters, the work that is done
> (of crossing numbers off) is the same

Crossing off is only part of the work. Most of the work is checking whether to 
cross off 
the number in this round. And Turner does a lot more of that than the postponed 
filters.

> - _when_ it is actually done - but
> Turner's starts out so _prematurely_ that it is busy doing nothing most of
> the time.

It's not doing nothing. It's just doing a lot of superfluous work. It is 
_achieveing_ 
nothing most of the time, though.
Take 7919, the thousandth prime. The postponed filters decide to keep it when 
fitering out 
the multiples of 89, the twenty-fourth prime. Turner also divides it by all 975 
primes in 
between. That is a lot of real but futile work.

> Thus its function call overhead costs pile up enormously,
> overstaging the actual calculation.

It's not only the function calls. Division is expensive, too.

>
> So analyzing that calculation in the premature execution setting is missing
> the point, although helpful after we fix this, with the Postponed Filters.
> _Only then_ the finer points of this algorithm's analysis can be applied -
> namely, of avoiding testing primes divisibility altogether. And _if_ a fast
> cheap primality test were to have existed, the filtering versions would win

Sorry, I can't follow. What's the point of a primality test here? Every number 
whose 
multiples we want to remove is prime, what's left to test?

> over, because they progressively cull the input sequence so there would be
> no double hits as we have when merging the multiples (whether from lists or
> inside the PQ).

But there's a lot of list constructuion and deconstruction necessary for the 
Euler sieve. 
That may be more work than the multiple hits cause.

>
> > the multiples of p with 2*p, you have
> >
> > sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes]
> >
> > crossings-out, that is Theta(bound*log (log bound)). If you eliminate
> > multiples of some small primes a priori (wheel), you can reduce the
> > constant factor significantly, but the complexity remains the same (you
> > drop a few terms from the front of the sum and multiply the remaining
> > terms with phi(n)/n, where n is the product of the excluded primes).
> >
> > If you start crossing out at p^2, the number is
> >
> > sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes].
> >
> > The difference is basically sum (takeWhile (<= sqrt bound) primes), which
> > I stupidly - I don't remember how - believed to cancel out the main term.
> > It doesn't, it's O(bound/log bound), so the complexity is the same.
> >
> > Now if you take a stream of numbers from which you remove composites,
> > having a priority queue of multiples of primes, things are a little
> > different. If you start crossing out at 2*p, when you are looking at n,
> > you have more multiples in your PQ than if you start crossing out at p^2
> > (about pi(n/2) vs. pi(sqrt n)), so updating the PQ will be more
> > expensive. But updating the PQ is O(log size), I believe, and log pi(n)
> > is O(log pi(sqrt n)), so I think it shouldn't change the complexity here
> > either. I think this would have complexity O(bound*log bound*log (log
> > bound)).
>
> There are two questions here - where to start crossing off numbers, and
> when. If you'd start at 2*p maybe the overall complexity would remain the
> same but it'll add enormo

Re: [Haskell-cafe] Re: FASTER primes

2009-12-30 Thread Daniel Fischer
Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
> > > > especially the claim that going by primes squares
> > > > is "a pleasing but minor optimization",
> > >
> > > Which it is not. It is a major optimisation. It reduces the algorithmic
> > > complexity *and* reduces the constant factors significantly.
> >
> > D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper.
> > It doesn't change the complexity, and the constant factors are reduced
> > far less than I thought.
>
> I do not understand. Turner's sieve is
>
>   primes = sieve [2..]
>where
> sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0]
>
> and the Postponed Filters is
>
>   primes = 2: 3: sieve (tail primes) [5,7..]
>where
> sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0]
>   where (h,~(_:t)) = span (< p*p) xs
>
> Are you saying they both exhibit same complexity?

No. They don't.
But if you're looking at an imperative (mutable array) sieve (that's simpler to 
analyse 
because you don't have to take the book-keeping costs of your priority queue, 
heap or 
whatever into account), if you start crossing out the multiples of p with 2*p, 
you have

sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes]

crossings-out, that is Theta(bound*log (log bound)). If you eliminate multiples 
of some 
small primes a priori (wheel), you can reduce the constant factor 
significantly, but the 
complexity remains the same (you drop a few terms from the front of the sum and 
multiply 
the remaining terms with phi(n)/n, where n is the product of the excluded 
primes).

If you start crossing out at p^2, the number is

sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes].

The difference is basically sum (takeWhile (<= sqrt bound) primes), which I 
stupidly - I 
don't remember how - believed to cancel out the main term. 
It doesn't, it's O(bound/log bound), so the complexity is the same.

Now if you take a stream of numbers from which you remove composites, having a 
priority 
queue of multiples of primes, things are a little different.
 If you start crossing out at 2*p, when you are looking at n, you have more 
multiples in 
your PQ than if you start crossing out at p^2 (about pi(n/2) vs. pi(sqrt n)), 
so updating 
the PQ will be more expensive. But updating the PQ is O(log size), I believe, 
and log 
pi(n) is O(log pi(sqrt n)), so I think it shouldn't change the complexity here 
either. I 
think this would have complexity O(bound*log bound*log (log bound)).

> I was under impression that the first shows O(n^2) approx., and the second 
> one O(n^1.5) 
(for n
> primes produced).

In Turner/postponed filters, things are really different. Actually, Ms. O'Neill 
is right, 
it is a different algorithm. In the above, we match each prime only with its 
multiples 
(plus the next composite in the PQ version). In Turner's algorithm, we match 
each prime 
with all smaller primes (and each composite with all primes <= its smallest 
prime factor, 
but that's less work than the primes). That gives indeed a complexity of 
O(pi(bound)^2).
In the postponed filters, we match each prime p with all primes <= sqrt p 
(again, the 
composites contribute less). I think that gives a complexity of 
O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes produced.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2009-12-30 Thread Daniel Fischer
Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
> > especially the claim that going by primes squares
> > is "a pleasing but minor optimization",
>
> Which it is not. It is a major optimisation. It reduces the algorithmic
> complexity *and* reduces the constant factors significantly.

D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper.
It doesn't change the complexity, and the constant factors are reduced far less 
than I 
thought. The huge performance differences I had must have been due to cache 
misses (unless 
you use a segmented sieve, starting at the square reduces the number of cache 
misses 
significantly).
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes (was: Re: Code and Perf. Data for Prime Finders (was: Genuine Eratosthenes sieve))

2009-12-29 Thread Daniel Fischer
Am Mittwoch 30 Dezember 2009 01:23:32 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Gee, seems my mail server reads your posts very thoroughly today :)
>
> I hope it's not a bad thing. :)
>

It means, twenty minutes after I replied to the previous, I got your hours old 
post which 
mentions points I could have included in the aforementioned reply.

Not really bad, but somewhat inconvenient.

> > Am Dienstag 29 Dezember 2009 14:58:10 schrieb Will Ness:
> > > If I realistically needed primes generated in a real life setting, I'd
> > > probably had to use some C for that.
> >
> > If you need the utmost speed, then probably yes. If you can do with a
> > little less, my STUArray bitsieves take about 35% longer than the
> > equivalent C code and are roughly eight times faster than ONeillPrimes.
> > I can usually live well with that.
>
>
> Wow! That's fast! (that's the code from haskellwiki's primes page, right?)
>

No, it's my own code. Nothing elaborate, just sieving numbers 6k±1, twice as 
fast as the 
haskellwiki code (here) and uses only 1/3 the memory. For the record:

--
{-# LANGUAGE BangPatterns #-}
module SievePrimes where

import Data.Array.Base (unsafeRead, unsafeWrite, unsafeAt)
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad (when)
import Data.Bits

primesUpTo :: Integer -> [Integer]
primesUpTo bound
= 2:3:[fromIntegral (3*i + (if testBit i 0 then 4 else 5))
| i <- [0 .. bd], par `unsafeAt` i]
  where
bd0 = 2*(bound `div` 6) - case bound `mod` 6 of { 0 -> 2; 5 -> 0; _ -> 
1 }
bd = fromInteger bd0
sr = floor (sqrt (fromIntegral bound))
sbd = 2*(sr `div` 6) - case sr `mod` 6 of { 0 -> 2; 5 -> 0; _ -> 1 }
par :: UArray Int Bool
par = runSTUArray $ do
ar <- newArray (0,bd) True
let tick i s1 s2
| bd < i= return ()
| otherwise = do
p <- unsafeRead ar i
when p (unsafeWrite ar i False)
tick (i+s1) s2 s1
sift i
| i > sbd   = return ar
| otherwise = do
p <- unsafeRead ar i
when p (let (!start,!s1,!s2) = if even i then 
(i*(3*i+10)+7,2*i+3,4*i+7) else (i*(3*i+8)+4,4*i+5,2*i+3) in tick start s1 s2)
sift (i+1)
sift 0
--

The index magic is admittedly not obvious, but if you take that on faith, the 
rest is 
pretty clear. To find the n-th prime,

--
module Main (main) where

import SievePrimes (primesUpTo)
import System.Environment (getArgs)

printNthPrime :: Int -> IO ()
printNthPrime n = print (n, primesUpTo (estim n) !! (n-1))

main = do
args <- getArgs
printNthPrime $ read $ head args

estim :: Int -> Integer
estim k
| n < 13= 37
| n < 70= 349
| otherwise = round x
  where
n = fromIntegral k :: Double
x = n * (log n + log (log n) - 1 + 1.05 * log (log n) / log n)
--

If I don't construct the prime list but count directly on the array, it's ~6% 
faster.

> > > If OTOH we're talking about a tutorial
> > > code that is to be as efficient as possible without loosing it clarity,
> > > being a reflection of essentials of the problem, then any overly
> > > complicated advanced Haskell wouldn't be my choice either.
> >
> > +1
> > Though perhaps we view mutable array code differently. In my view, it's
> > neither advanced nor complicated.
>
> convoluted, then. Not using "higher level" concepts, like map and foldr, :)
> or head.until isSingleton (pairwise merge).map wrap , that kind of thing.
> :)
>
> BTW I think a really smart compiler should just get a specification, like
> Turner's sieve, and just derive a good code from that by itself.

Go ahead and write one. I would love such a beast.

>
> Another example would be
>
>   qq n m = sum $ take n $ cycle [1..m]
>
> which should really get compiled into just a math formula, IMO. Now _that_
> I would call a good compiler.

Dream on, dream on, with hope in your heart.

> That way I really won't have to learn how to use STUArray`s you see.
>
> I've seen this question asked a lot, what should be a good programming
> language?
>
> IMO, English (plus math where needed, and maybe some sketches by hand). :)
>

Maybe you'd like 
http://en.wikipedia.org/wiki/Shakespeare_(programming_language) ?

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2009-12-29 Thread Daniel Fischer
Am Mittwoch 30 Dezember 2009 01:04:34 schrieb Will Ness:
>
> > While I haven't detected that with the primes code, I find that in my
> > ghci your code is approximately 2.5 times faster than ONeill or Bayer
> > when interpreted (no difference in scaling observed), while when compiled
> > with -O2, ONeill is approximately three times as fast as your code
>
> that was what I was getting at first too, before I've put into my code the
> _type_signatures_ and the "specialize" _pragmas_ as per her file. Then it
> was only 1.3x slower, when compiled (with about same asymptotics and memory
> usage).
>

Specialising your code to Int makes it half as fast as ONeill here (as an 
executable).
That is largely due to the fact that your code uses much more memory here (54MB 
vs. 2MB 
for the millionth prime), though, the MUT times have a ratio of about 1.5.

Now an interesting question is, why does it use so much memory here?
Can you send me your exact code so I can see how that behaves here?

> >and twice as fast as Bayer as an executable, about twice as fast as your
> > code and slightly slower than Bayer in ghci.
>
> see, this kind of inconsistencies is exactly why I was concentrating only
> on one platform in measuring the speed - the interp'/GHCi combination.

The problem with that is that one is primarily interested in speed for library 
functions, 
which are mostly used as compiled code.

> Especially when developing and trying out several approaches, to test with
> compiler just takes too long. :) And why should it give (sometimes) wildly
> different readings when running inside GHCi or standalone ??

Good question.

>
> > And I have huge memory problems in ghci with your code.
> > That may be due to my implementation of merge and minus, though. You
> > wrote 'standard' and I coded the straightforward methods.
>
> Here's what I'm using (BTW I've put it on the primes haskellwiki page too).
> The memory reported for interpreted is about half of PQ's (IIRC), and
> compiled - the same:
>
>  minus a@(x:xs) b@(y:ys) = case compare x y of
> LT -> x: xs `minus` b
> GT ->a  `minus` ys
> EQ ->xs `minus` ys
>  minus a b  = a
>
>  merge a@(x:xs) b@(y:ys) = case compare x y of
> LT -> x: merge xs b
> EQ -> x: merge xs ys
> GT -> y: merge a  ys
>  merge a b  = if null b then a else b

More or less the same that I wrote.

> > What's wrong with mutable arrays? There are a lot of algorithms which can
> > be easily and efficiently implemented using mutable unboxed arrays while
> > a comparably efficient implementation without mutable arrays is hard. For
> > those, I consider STUArrays the natural choice. Sieving primes falls into
> > that category.
>
> It's just that the mutating code tends to be convoluted, like in the
> example I mentioned of quicksort. One has to read the C code with good
> attention to understand it.

Convoluted is (often) an exaggeration. But I agree that the specification of 
'what' is 
usually easier to understand than that of 'how'.

> "Normal" Haskell is much more visually apparent, like
>
>   primes = 2: 3: sieve (tail primes) [5,7..]
>where
> sieve (p:ps) xs = h ++ sieve ps (t `minus` tail [q,q+2*p..])
>   where (h,~(_:t)) = span (< q) xs
> q  = p*p
>

Yes.

> or
>
>   primes = 2: 3: sieve [] (tail primes) 5
>where
> sieve fs (p:ps) x = [i | i<- [x,x+2..q-2], a!i]
>   ++ sieve ((2*p,q):fs') ps (q+2)
>  where
>   q   = p*p
>   mults   = [ [y+s,y+2*s..q] | (s,y)<- fs]
>   fs' = [ (s,last ms)| ((s,_),ms)<- zip fs mults]
>   a   = accumArray (\a b->False) True (x,q-2)
>  [(i,()) | ms<- mults, i<- ms]
>

Umm, really?
I'd think if you see what that does, you won't have difficulties with a mutable 
array 
sieve.



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes (was: Re: Code and Perf. Data for Prime Finders (was: Genuine Eratosthenes sieve))

2009-12-29 Thread Daniel Fischer
Gee, seems my mail server reads your posts very thoroughly today :)

Am Dienstag 29 Dezember 2009 14:58:10 schrieb Will Ness:
> Eugene Kirpichov  gmail.com> writes:
> > 2009/12/29 Will Ness  yahoo.com>:
> > > Daniel Fischer  web.de> writes:
> > >> Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
> > >> > Now _this_, when tested as interpreted code in GHCi, runs about 2.5x
> > >> > times faster than Priority Queue based code from Melissa O'Neill's
> > >> > ZIP package mentioned at the haskellwiki/Prime_Numbers page, with
> > >> > about half used memory reported, in producing 10,000 to 300,000
> > >> > primes.
> > >> >
> > >> > It is faster than BayerPrimes.hs from the ZIP package too, in the
> > >> > tested range, at about 35 lines of code in total.
> > >>
> > >> That's nice. However, the important criterion is how compiled code
> > >> (-O2)
> > >
> > > fares. Do the relations continue to hold? How does it compare to a
> > > bitsieve?
> > >
> > >
> > > Haven't gotten to that part yet. :)
> > >
> > > But why is it more important? Would that not tell us more about the
> > > compiler performance than the code itself?
> >
> > If you mean "algorithmic complexity", you shouldn't care about a
> > difference of 2.5x.
>
> It's not just at one point; the asymptotics are _the_same_ across the range
> that I've tested (admittedly, somewhat narrow). I measure local behavior
> simply as logBase in base of ratio of problem sizes, of the ratio of run
> times.
>
> > If you mean "actual performance for a particular task", you should
> > measure the performance in realistic conditions. Namely, if you're
> > implementing a program that needs efficient generation of primes,
> > won't you compile it with -O2?
>
> If I realistically needed primes generated in a real life setting, I'd
> probably had to use some C for that.

If you need the utmost speed, then probably yes. If you can do with a little 
less, my 
STUArray bitsieves take about 35% longer than the equivalent C code and are 
roughly eight 
times faster than ONeillPrimes. I can usually live well with that.

> If OTOH we're talking about a tutorial
> code that is to be as efficient as possible without loosing it clarity,
> being a reflection of essentials of the problem, then any overly
> complicated advanced Haskell wouldn't be my choice either.

+1
Though perhaps we view mutable array code differently. In my view, it's neither 
advanced 
nor complicated.

> And seeing that
> this overly-complicated (IMO), steps-jumping PQ-based code was sold to us
> as the only "faithful" rendering of the sieve, I wanted to see for myself
> whether this really holds water.

I can understand that very well.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes

2009-12-29 Thread Daniel Fischer
Am Dienstag 29 Dezember 2009 14:34:03 schrieb Will Ness:
> Daniel Fischer  web.de> writes:
> > Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
> > > Now _this_, when tested as interpreted code in GHCi, runs about 2.5x
> > > times faster than Priority Queue based code from Melissa O'Neill's ZIP
> > > package mentioned at the haskellwiki/Prime_Numbers page, with
> > > about half used memory reported, in producing 10,000 to 300,000 primes.
> > >
> > > It is faster than BayerPrimes.hs from the ZIP package too, in the
> > > tested range, at about 35 lines of code in total.
> >
> > That's nice. However, the important criterion is how compiled code (-O2)
>
> fares. Do the relations continue to hold? How does it compare to a
> bitsieve?
>
>
> Haven't gotten to that part yet. :)
>
> But why is it more important?

I thought the uppercase FASTER in the subject meant you were really interested 
in speed.
If you're only interested in asymptotics, interpreted may be appropriate.
However, it is possible that optimisation can change the perceived asymptotics 
of an 
algorithm (determined strictness eliminating thunks for example).

While I haven't detected that with the primes code, I find that in my ghci your 
code is 
approximately 2.5 times faster than ONeill or Bayer when interpreted (no 
difference in 
scaling observed), while when compiled with -O2, ONeill is approximately three 
times as 
fast as your code and twice as fast as Bayer as an executable, about twice as 
fast as your 
code and slightly slower than Bayer in ghci.
And I have huge memory problems in ghci with your code.
That may be due to my implementation of merge and minus, though. You wrote 
'standard' and 
I coded the straightforward methods.

> Would that not tell us more about the compiler performance than the code 
> itself?

Unless you write machine code or assembly, don't all performance tests tell us 
more about 
the compiler/interpreter performance than the code itself?
That is, of course, with respect to algorithms with the same scaling behaviour.

>
> This code is just an endpoint (so far) in a short procession of natural
> stepwise development of the famous classic Turner's sieve,

That was

sieve (x:xs) = x:sieve (filter ((/= 0) . (`mod` x)) xs)

, was it?

> through the
> "postponed filters", through to Euler's sieve, the merging sieve (i.e.
> Richard Bird's) and on to the tree-fold merging, with wheel. I just wanted
> to see where the simple "normal" (i.e. _beginner_-friendly) functional code
> can get, in a natural way.

Good.

>
> It's not about writing the fastest code in _advanced_ Haskell. It's about
> having clear and simple code that can be understood at a glance - i.e.
> contributes to our understanding of a problem - faithfully reflecting its
> essential elements, and because of _that_, fast. It's kind of like _not_
> using mutable arrays in a quicksort.

What's wrong with mutable arrays? There are a lot of algorithms which can be 
easily and 
efficiently implemented using mutable unboxed arrays while a comparably 
efficient 
implementation without mutable arrays is hard. For those, I consider STUArrays 
the natural 
choice. Sieving primes falls into that category.

>
> Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't
> feel right to me somehow,

I fully agree.

> especially the claim that going by primes squares
> is "a pleasing but minor optimization",

Which it is not. It is a major optimisation. It reduces the algorithmic 
complexity *and* 
reduces the constant facors significantly.

> what with the postponed filters
> (which serves as the framework for all the other variants) achieving the
> orders of magnitude speedup and cutting the Turner's O(n^2) right down to
> O(n^1.5) just by doing that squares optimization (with the final version
> hovering around 1.24..1.17 in the tested range). The Euler's sieve being a
> special case of Eratosthenes's, too, doesn't let credence to claims that
> only the PQ version is somehow uniquely authentic and "faithful" to it.

I never found that claim convincing either.

>
> Turner's sieve should have been always looked at as just a specification,
> not a code, anyway, and actually running it is ridiculous. Postponed
> filters version, is the one to be used as a reference point of the basic
> _code_, precisely because it _does_ use the primes squares optimization,
> which _is_ essential to any basic sieve.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: FASTER primes (was: Re: Code and Perf. Data for Prime Finders (was: Genuine Eratosthenes sieve))

2009-12-29 Thread Eugene Kirpichov
2009/12/29 Will Ness :
> Daniel Fischer  web.de> writes:
>
>>
>>
>> Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
>> > Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times
>> > faster than Priority Queue based code from Melissa O'Neill's ZIP package
>> > mentioned at the haskellwiki/Prime_Numbers page, with
>> > about half used memory reported, in producing 10,000 to 300,000 primes.
>> >
>> > It is faster than BayerPrimes.hs from the ZIP package too, in the tested
>> > range, at about 35 lines of code in total.
>>
>> That's nice. However, the important criterion is how compiled code (-O2)
> fares. Do the relations continue to hold? How does it compare to a bitsieve?
>
>
> Haven't gotten to that part yet. :)
>
> But why is it more important? Would that not tell us more about the compiler
> performance than the code itself?
>

If you mean "algorithmic complexity", you shouldn't care about a
difference of 2.5x.
If you mean "actual performance for a particular task", you should
measure the performance in realistic conditions. Namely, if you're
implementing a program that needs efficient generation of primes,
won't you compile it with -O2?

> This code is just an endpoint (so far) in a short procession of natural 
> stepwise
> development of the famous classic Turner's sieve, through the "postponed
> filters", through to Euler's sieve, the merging sieve (i.e. Richard Bird's) 
> and
> on to the tree-fold merging, with wheel. I just wanted to see where the simple
> "normal" (i.e. _beginner_-friendly) functional code can get, in a natural way.
>
> It's not about writing the fastest code in _advanced_ Haskell. It's about 
> having
> clear and simple code that can be understood at a glance - i.e. contributes to
> our understanding of a problem - faithfully reflecting its essential elements,
> and because of _that_, fast. It's kind of like _not_ using mutable arrays in a
> quicksort.
>
> Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't feel
> right to me somehow, especially the claim that going by primes squares is "a
> pleasing but minor optimization", what with the postponed filters (which 
> serves
> as the framework for all the other variants) achieving the orders of magnitude
> speedup and cutting the Turner's O(n^2) right down to O(n^1.5) just by doing
> that squares optimization (with the final version hovering around 1.24..1.17 
> in
> the tested range). The Euler's sieve being a special case of Eratosthenes's,
> too, doesn't let credence to claims that only the PQ version is somehow 
> uniquely
> authentic and "faithful" to it.
>
> Turner's sieve should have been always looked at as just a specification, not 
> a
> code, anyway, and actually running it is ridiculous. Postponed filters 
> version,
> is the one to be used as a reference point of the basic _code_, precisely
> because it _does_ use the primes squares optimization, which _is_ essential to
> any basic sieve.
>
>
>> ___
>> Haskell-Cafe mailing list
>> Haskell-Cafe  haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>
>
>
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>



-- 
Eugene Kirpichov
Web IR developer, market.yandex.ru
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe