Hi Jason,
Jason Dagit wrote:
After almost two weeks of poking at darcs doing various benchmarks and
profiles I've realized that optimizing Haskell programs is no easy
task. I've been following the advice of numerous people from the
haskell irc channel and learned a lot about darcs in the process. I've
also been using this nifty library that Ian created for this purpose to
get a measure for the non-mmap memory usage:
http://urchin.earth.li/darcs/ian/memory
Potentially useful information about darcs;
1) Uses a slightly modified version of FastPackedStrings.
2) Can use mmap or not to read files (compile time option).
=Experiments and Findings=
I have a summary of some of my experimentation with darcs here:
http://codersbase.com/index.php/Darcs_performance
You can get a quick picture of heap usage with +RTS -Sstderr, by the
way. To find out what's actually in that heap, you'll need heap
profiling (as you know).
Basically what I have found is that the read of the original file does
not cause a spike in memory usage, nor does writing the patch. This
would seem to imply that it's during application of the patch that the
memory spikes. Modifying darcs to read the patch file and print just
the first line of the patch causes some interesting results. The memory
usage according to Ian's memory tool stays very low, at about 150kb max,
but requesting the first line of the patch appears to make darcs read
the entire patch! Darcs will literally grind away for, say, 30 minutes
to just print the first line.
On a side note, I've tried turing off mmap and running some of the above
experiments. Ian's tool reports the same memory usage, and top still
reports large amounts of memory used. Does ghc use mmap to allocate
memory instead of malloc? Even if it does this shouldn't be a problem
for Ian's tool as long as it maps it anonymously.
Yes, GHC's heap is mmap()'d anonymously. You really need to find out
whether the space leak is mmap()'d by GHC's runtime, or by darcs itself
- +RTS -Sstderr or profiling will tell you about GHC's memory usage.
=Questions=
So far I've been tracking this performance problem by reading the output
of ghc --show-iface and --ddump-simpl for strictness information, using
the ghc profiler (although that makes already bad performance much
worse), Ian's memory tool, and a lot of experiments and guess work with
program modifications. Is there a better way?
I'd start by using heap profiling to track down what the space leak
consists of, and hopefully to give you enough information to diagnose
it. Let's see some heap profiles!
Presumably the space leak is just as visible with smaller patches, so
you don't need the full 300M patch to investigate it.
I don't usually resort to -ddump-simpl until I'm optimising the inner
loop, use profiling to find out where the inner loops actually *are* first.
Are there tools or techniques that can help me understand why the memory
consumption peaks when applying a patch? Is it foolish to think that
lazy evaluation is the right approach?
Since you asked, I've never been that keen on mixing laziness and I/O.
Your experiences have strengthened that conviction - if you want strict
control over resource usage, laziness is always going to be problematic.
Sure it's great if you can get it right, the code is shorter and runs
in small constant space. But can you guarantee that it'll still have
the same memory behaviour with the next version of the compiler? With a
different compiler?
If you want guarantees about resource usage, which you clearly do, then
IMHO you should just program the I/O explicitly and avoid laziness.
It'll be a pain in the short term, but a win in the long term.
I'm looking for advice or help in optimizing darcs in this case. I
guess this could be viewed as a challenge for people that felt like the
micro benchmarks of the shootout were unfair to Haskell. Can we
demonstrate that Haskell provides good performance in the real-world
when working with large files? Ideally, darcs could easily work with a
patch that is 10GB in size using only a few megs of ram if need be and
doing so in about the time it takes read the file once or twice and gzip
it.
I'd love to help you look into it, but I don't really have the time.
I'm happy to help out with advice where possible, though.
Cheers,
Simon
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe