Lev Walkin wrote:
Simon Marlow wrote:
Lev Walkin wrote:

I wondered why would a contemporary GHC 6.8.3 exhibit such a leak?
After all, the technique was known in 2000 (and afir by Wadler in '87)
and one would assume Joe English's reference to "most other Haskell
systems" ought to mean GHC.

Thanks for this nice example - Don Stewart pointed me to it, and Simon PJ and I just spent some time this morning diagnosing it.

Incedentally, with GHC 6.8 you can just run the program with "+RTS -hT" to get a basic space profile, there's no need to compile it for profiling - this is tremendously useful for quick profiling jobs. And in this case we see the the heap is filling up with (:) and Tree constructors, no thunks.

Here's the short story: GHC does have the space leak optimisation you refer to, and it is working correctly, but it doesn't cover all the cases you might want it to cover. In particular, optimisations sometimes interact badly with the space leak avoidance, and that's what is happening here. We've known about the problem for some time, but this is the first time I've seen a nice small example that demonstrates it.

    -- Lazily build a tree out of a sequence of tree-building events
    build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String])
    build (Start str : es) =
            let (es', subnodes) = build es
                (spill, siblings) = build es'
            in (spill, (Tree str subnodes : siblings))
    build (Leaf str : es) =
            let (spill, siblings) = build es
            in (spill, Tree str [] : siblings)
    build (Stop : es) = (es, [])
    build [] = ([], [])

[skip]

We don't know of a good way to fix this problem. I'm going to record this example in a ticket for future reference, though.

Simon,

is there a way, perhaps, to rewrite this expression to avoid leaks?
An ad-hoc will do, perhaps split in two modules to avoid intramodular
optimizations?

Tried to avoid this misoptimization by using explicit fst, and
it worked on my synthesized input (probably benefiting of CSE):

build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String])
build (Start str : es) =
        let (_, subnodes) = build es
            (spill, siblings) = build . fst . build $ es
        in (spill, (Tree str subnodes : siblings))
build (Leaf str : es) =
        let (spill, siblings) = build es
        in (spill, Tree str [] : siblings)
build (Stop : es) = (es, [])
build [] = ([], [])

However, while this solution works on a synthesized input (cycle [...]),
it still has memory leak when taken into HXML environment which
operates on files (why?).

Only when I also added Ketil Malde's `par` based hack I finally
was able to parse the big XML file without a space leak. Here's
the diff to HXML 0.2:

======================================================================
--- TreeBuild.hs.old    2008-09-19 17:01:30.000000000 -0700
+++ TreeBuild.hs        2008-09-19 17:04:15.000000000 -0700
@@ -20,6 +20,7 @@
 import XMLParse
 import XML
 import Tree
+import Control.Parallel

 --
 -- TODO: add basic error-checks: matching end-tags, ensure input exhausted
@@ -43,8 +44,9 @@
        addTree t es            = let (s,es') = build es in pair (cons t s) es'
        build []                = pair nil []
        build (e:es) = case e of
-           StartEvent gi atts  -> let (c,es') = build es
-                                  in addNode (ELNode gi atts) c es'
+           StartEvent gi atts  -> let (c, es') = build es
+                                      sbl = build . snd . build $ es
+                                  in sbl `par` (cons (tree (ELNode gi atts) c) 
(fst sbl), snd sbl)
            EndEvent _          -> pair nil es
            EmptyEvent gi atts  -> addLeaf (ELNode gi atts) es
            TextEvent s         -> addLeaf (TXNode s) es
=======================================================================

With that, a 45 mb XML is parsed in constant space in

G4 1.5GHz: 1 minute 48 seconds, taking 16 mb RAM
Pentium D 2x3.0GHz: 12 seconds, taking 9 mb RAM

Compared to 0.2s `wc -l`.

If you
  * remove `par` from there or
  * replace (build . snd . build $ es) with just (es') or
  * forget to specify -threaded (-smp) during ghc compilation
then the space leak will exhibit itself again.

However, removing -threaded will still make this code run without leak
on synthesized input (StartEvent "" [] : cycle [TextEvent ""]).

I believe there's a way to get rid of `par`, perhaps by wrapping
this tree building thing into a optimization-unfriendly monad?
But I don't know how to approach this. Any help?

--
Lev Walkin
[EMAIL PROTECTED]
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to