On 15 May 2016 at 01:35, David Feuer <david.fe...@gmail.com> wrote: > Well, a few weeks ago Bertram Felgenhauer came up with a version of IO > that acts more like lazy ST. That could be just the thing. He placed it in > the public domain/CC0 and told me I could put it up on Hackage if I want. > I'll try to do that this week, but no promises. I could forward his email > if you just want to try it out. > That's exactly what I was thinking about. Can you please forward it to me, I will try it out.
Thanks, Harendra > On May 14, 2016 2:31 PM, "Harendra Kumar" <harendra.ku...@gmail.com> > wrote: > >> The difference seems to be entirely due to memory pressure. At list size >> 1000 both pure version and IO version perform equally. But as the size of >> the list increases the pure version scales linearly while the IO version >> degrades exponentially. Here are the execution times per list element in ns >> as the list size increases: >> >> Size of list Pure IO >> 1000 8.7 8.3 >> 10000 8.7 18 >> 100000 8.8 63 >> 1000000 9.3 786 >> >> This seems to be due to increased GC activity in the IO case. The GC >> stats for list size 1 million are: >> >> IO case: %GC time 66.1% (61.1% elapsed) >> Pure case: %GC time 2.6% (3.3% elapsed) >> >> Not sure if there is a way to write this code in IO monad which can >> reduce this overhead. >> >> -harendra >> >> >> On 14 May 2016 at 17:10, Harendra Kumar <harendra.ku...@gmail.com> wrote: >> > >> > You are right about the way IO code is generated because of the >> ordering semantics. The IO version builds the list entirely in a recursive >> fashion before returning it while the pure code builds it lazily. I wrote >> very simple versions to keep things simpler: >> > >> > Pure version: >> > >> > f [] = [] >> > f (x : xs) = x : f xs >> > >> > >> > time 11.08 ms (10.86 ms .. 11.34 ms) >> > Measured for a million elements in the list >> > >> > 104,041,264 bytes allocated in the heap >> > 16,728 bytes copied during GC >> > 35,992 bytes maximum residency (1 sample(s)) >> > 21,352 bytes maximum slop >> > 1 MB total memory in use (0 MB lost due to fragmentation) >> > >> > >> > IO version: >> > f [] = return [] >> > f (x : xs) = do >> > rest <- f xs >> > return $ x : rest >> > >> > time 79.66 ms (75.49 ms .. 82.55 ms) >> > >> > 208,654,560 bytes allocated in the heap >> > 121,045,336 bytes copied during GC >> > 27,679,344 bytes maximum residency (8 sample(s)) >> > 383,376 bytes maximum slop >> > 66 MB total memory in use (0 MB lost due to fragmentation) >> > >> > Even though this is a small program not doing much and therefore >> enhancing even small differences to a great degree, I am not sure if I can >> completely explain the difference in slowness of the order of 7.5x by just >> the recursive vs lazy building of the list. I am wondering if there is >> anything that is worth further investigating and improving here. >> > >> > -harendra >> > >> > On 12 May 2016 at 05:41, Dan Doel <dan.d...@gmail.com> wrote: >> > > >> > > On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar >> > > <harendra.ku...@gmail.com> wrote: >> > > > Thanks Dan, that helped. I did notice and suspect the update frame >> and the >> > > > unboxed tuple but given my limited knowledge about ghc/core/stg/cmm >> I was >> > > > not sure what is going on. In fact I thought that the intermediate >> tuple >> > > > should get optimized out since it is required only because of the >> realworld >> > > > token which is not real. But it might be difficult to see that at >> this >> > > > level? >> > > >> > > The token exists as far as the STG level is concerned, I think, >> > > because that is the only thing ensuring that things happen in the >> > > right order. And the closure must be built to have properly formed >> > > STG. So optimizing away the closure building would have to happen at a >> > > level lower than STG, and I guess there is no such optimization. I'm >> > > not sure how easy it would be to do. >> > > >> > > > What you are saying may be true for the current implementation but >> in theory >> > > > can we eliminate the intermediate closure? >> > > >> > > I don't know. I'm a bit skeptical that building this one closure is >> > > the only thing causing your code to be a factor of 5 slower. For >> > > instance, another difference in the core is that the recursive call >> > > corresponding to the result s2 happens before allocating the >> > > additional closure. That is the case statement that unpacks the >> > > unboxed tuple. And the whole loop happens this way, so it is actually >> > > building a structure corresponding to the entire output list in memory >> > > rather eagerly. >> > > >> > > By contrast, your pure function is able to act in a streaming fashion, >> > > if consumed properly, where only enough of the result is built to keep >> > > driving the rest of the program. It probably runs in constant space, >> > > while your IO-based loop has a footprint linear in the size of the >> > > input list (in addition to having slightly more overhead per character >> > > because of the one extra thunk), because it is a more eager program. >> > > >> > > And having an asymptotically larger memory footprint is more likely to >> > > explain a 5x slowdown than allocating one extra thunk for each list >> > > concatenation, I think. (Of course, it could also be some other >> > > factor, as well.) >> > > >> > > You should probably run with +RTS -s (compiling with -rtsopts), and >> > > see if the IO version has a much larger maximum residency. >> > > >> > > Anyhow, this is fundamental to writing the algorithm using IO. It's an >> > > algorithm that's a sequence of steps that happen in order, the number >> > > of steps is proportional to the input list, and part of those steps is >> > > putting stuff in memory for the results. >> > > >> > > > So why is that? Why can't we generate (++) inline similar to (:)? >> How do we >> > > > make this decision? Is there a theoretical reason for this or this >> is just >> > > > an implementation artifact? >> > > >> > > The difference between these two is that (++) is a function call, and >> > > (:) is a constructor. Constructors are a special case, and don't need >> > > to have all the provisions that functions in general do. The best way >> > > to learn what the differences are is probably to read the paper about >> > > the STG machine. >> > > >> > > Hope this is a more fruitful lead, but it may be that there's not a >> > > lot that can be done, without doing some baroque things to your >> > > algorithm, because of the necessity of its using IO. >> > > >> > > -- Dan >> >> _______________________________________________ >> Glasgow-haskell-users mailing list >> Glasgow-haskell-users@haskell.org >> http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users >> >>
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users