Re: suboptimal ghc code generation in IO vs equivalent pure code case

Harendra Kumar Sat, 14 May 2016 13:15:46 -0700

On 15 May 2016 at 01:35, David Feuer <[email protected]> wrote:

> Well, a few weeks ago Bertram Felgenhauer came up with a version of IO
> that acts more like lazy ST. That could be just the thing. He placed it in
> the public domain/CC0 and told me I could put it up on Hackage if I want.
> I'll try to do that this week, but no promises. I could forward his email
> if you just want to try it out.
>
That's exactly what I was thinking about. Can you please forward it to me,
I will try it out.


Thanks,
Harendra


> On May 14, 2016 2:31 PM, "Harendra Kumar" <[email protected]>
> wrote:
>
>> The difference seems to be entirely due to memory pressure. At list size
>> 1000 both pure version and IO version perform equally. But as the size of
>> the list increases the pure version scales linearly while the IO version
>> degrades exponentially. Here are the execution times per list element in ns
>> as the list size increases:
>>
>> Size of list  Pure       IO
>> 1000           8.7          8.3
>> 10000         8.7          18
>> 100000       8.8          63
>> 1000000     9.3          786
>>
>> This seems to be due to increased GC activity in the IO case. The GC
>> stats for list size 1 million are:
>>
>> IO case:       %GC     time      66.1%  (61.1% elapsed)
>> Pure case:   %GC     time       2.6%  (3.3% elapsed)
>>
>> Not sure if there is a way to write this code in IO monad which can
>> reduce this overhead.
>>
>> -harendra
>>
>>
>> On 14 May 2016 at 17:10, Harendra Kumar <[email protected]> wrote:
>> >
>> > You are right about the way IO code is generated because of the
>> ordering semantics. The IO version builds the list entirely in a recursive
>> fashion before returning it while the pure code builds it lazily. I wrote
>> very simple versions to keep things simpler:
>> >
>> > Pure version:
>> >
>> > f [] = []
>> > f (x : xs) = x : f xs
>> >
>> >
>> > time                11.08 ms   (10.86 ms .. 11.34 ms)
>> > Measured for a million elements in the list
>> >
>> >      104,041,264 bytes allocated in the heap
>> >           16,728 bytes copied during GC
>> >           35,992 bytes maximum residency (1 sample(s))
>> >           21,352 bytes maximum slop
>> >                1 MB total memory in use (0 MB lost due to fragmentation)
>> >
>> >
>> > IO version:
>> > f [] = return []
>> > f (x : xs) = do
>> >     rest <- f xs
>> >     return $ x : rest
>> >
>> > time                 79.66 ms   (75.49 ms .. 82.55 ms)
>> >
>> >      208,654,560 bytes allocated in the heap
>> >      121,045,336 bytes copied during GC
>> >       27,679,344 bytes maximum residency (8 sample(s))
>> >          383,376 bytes maximum slop
>> >               66 MB total memory in use (0 MB lost due to fragmentation)
>> >
>> > Even though this is a small program not doing much and therefore
>> enhancing even small differences to a great degree, I am not sure if I can
>> completely explain the difference in slowness of the order of 7.5x by just
>> the recursive vs lazy building of the list. I am wondering if there is
>> anything that is worth further investigating and improving here.
>> >
>> > -harendra
>> >
>> > On 12 May 2016 at 05:41, Dan Doel <[email protected]> wrote:
>> > >
>> > > On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
>> > > <[email protected]> wrote:
>> > > > Thanks Dan, that helped. I did notice and suspect the update frame
>> and the
>> > > > unboxed tuple but given my limited knowledge about ghc/core/stg/cmm
>> I was
>> > > > not sure what is going on. In fact I thought that the intermediate
>> tuple
>> > > > should get optimized out since it is required only because of the
>> realworld
>> > > > token which is not real. But it might be difficult to see that at
>> this
>> > > > level?
>> > >
>> > > The token exists as far as the STG level is concerned, I think,
>> > > because that is the only thing ensuring that things happen in the
>> > > right order. And the closure must be built to have properly formed
>> > > STG. So optimizing away the closure building would have to happen at a
>> > > level lower than STG, and I guess there is no such optimization. I'm
>> > > not sure how easy it would be to do.
>> > >
>> > > > What you are saying may be true for the current implementation but
>> in theory
>> > > > can we eliminate the intermediate closure?
>> > >
>> > > I don't know. I'm a bit skeptical that building this one closure is
>> > > the only thing causing your code to be a factor of 5 slower. For
>> > > instance, another difference in the core is that the recursive call
>> > > corresponding to the result s2 happens before allocating the
>> > > additional closure. That is the case statement that unpacks the
>> > > unboxed tuple. And the whole loop happens this way, so it is actually
>> > > building a structure corresponding to the entire output list in memory
>> > > rather eagerly.
>> > >
>> > > By contrast, your pure function is able to act in a streaming fashion,
>> > > if consumed properly, where only enough of the result is built to keep
>> > > driving the rest of the program. It probably runs in constant space,
>> > > while your IO-based loop has a footprint linear in the size of the
>> > > input list (in addition to having slightly more overhead per character
>> > > because of the one extra thunk), because it is a more eager program.
>> > >
>> > > And having an asymptotically larger memory footprint is more likely to
>> > > explain a 5x slowdown than allocating one extra thunk for each list
>> > > concatenation, I think. (Of course, it could also be some other
>> > > factor, as well.)
>> > >
>> > > You should probably run with +RTS -s (compiling with -rtsopts), and
>> > > see if the IO version has a much larger maximum residency.
>> > >
>> > > Anyhow, this is fundamental to writing the algorithm using IO. It's an
>> > > algorithm that's a sequence of steps that happen in order, the number
>> > > of steps is proportional to the input list, and part of those steps is
>> > > putting stuff in memory for the results.
>> > >
>> > > > So why is that? Why can't we generate (++) inline similar to (:)?
>> How do we
>> > > > make this decision? Is there a theoretical reason for this or this
>> is just
>> > > > an implementation artifact?
>> > >
>> > > The difference between these two is that (++) is a function call, and
>> > > (:) is a constructor. Constructors are a special case, and don't need
>> > > to have all the provisions that functions in general do. The best way
>> > > to learn what the differences are is probably to read the paper about
>> > > the STG machine.
>> > >
>> > > Hope this is a more fruitful lead, but it may be that there's not a
>> > > lot that can be done, without doing some baroque things to your
>> > > algorithm, because of the necessity of its using IO.
>> > >
>> > > -- Dan
>>
>> _______________________________________________
>> Glasgow-haskell-users mailing list
>> [email protected]
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
>>
>>

_______________________________________________
Glasgow-haskell-users mailing list
[email protected]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

Re: suboptimal ghc code generation in IO vs equivalent pure code case

Reply via email to