Last week, Dmitry Astapov posted a message to haskell-cafe describing a HaXml program that was running out of memory on large inputs. I've done some investigation on this; here's what I've discovered so far:
+ The HaXml parser is overly eager; it doesn't produce a value until the entire input has been read. The heap profile for Dmitry's program shows heap usage climbing steadily up to a large peak as the document is being parsed, then a sharp drop as output begins, then climbs again as the output is produced. I suspect that this behaviour is typical of most HaXml programs. I replaced the parser with HXML (recently announced here); this flattens out the first peak, but the second one persists. (HXML uses an incremental parser -- instead of parsing a Document, it parses individual Events (start-tag, character data, end-tag, et cetera) and assembles them into a tree with a separate function.) + nhc98's heap profiler is fantastic. + Under nhc98 (v1.10), 'putStr' doesn't seem to like overly-long strings. This leads to trouble since the HaXml main driver uses a single call to 'hPutStrLn' to write the entire output document. Replacing this with 'mapM_ putChar' (or using GHC 5.02), fixes this problem, but the program still leaks. + The nhc98 heap profiler is amazingly useful. + HaXml uses the Hughes & Peyton Jones Pretty printer to format the output. This appears to have a slow leak. (Later tests show that the peak consists of "void" heap cells from the Pretty module, mostly suspended calls to Prelude.Int- and Prelude.Int+.) Figuring that it's overkill anyway I replaced it with a simpler serializer. + After this change there's *still* a space fault, which with the help of nhc's profiler (which, BTW, is great), I tracked down to the HaXml 'cat' combinator: cat :: [a -> [b]] -> (a -> [b]) cat fs = \e -> concat [f e | f <- fs] Earlier I had reported that rewriting Dmitry's test case to use the HXML internal representation directly instead of converting to HaXml's representation fixed the space leak. As it turns out, I didn't implement the 'cat' combinator, but instead used a binary concatenation operator: (+++) :: (a -> [b]) -> (a -> [b]) -> (a -> [b]) f +++ g = \x -> f x ++ g x and my implementation of mkElem had the signature mkElem :: String -> CFilter -> CFilter instead of HaXml's mkElem :: String -> [CFilter] -> CFilter. which uses 'cat' to process the second argument. Backporting this to HaXml: mkElem' name cf = \x -> [CElem (Elem h [] (cf x))] and modifying Dmitry's test case to use mkElem' and +++ instead of mkElem and cat finally fixes the problem. I'm still not sure *why* this does the trick -- hopefully somebody smarter can explain that -- but the modified program runs in bounded space, no matter how large the input file is. (It even works in Hugs, which I found surprising, since the HXML tree builder has a known problem when run with Hugs' garbage collector.) --Joe English [EMAIL PROTECTED] _______________________________________________ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell