[Haskell-cafe] Byte Histogram

Andrew Coppin Thu, 03 Feb 2011 13:14:40 -0800

There now follows a small grab-bag of miscelaneous but related thoughts.(Which probably means that this post will spawn a 2000 message threaddiscussing one tiny side-issue, rather than the main thrust of themessage...)

First of all, benchmarking. We have The Great Language Shootout, whichtests how fast solutions implemented in various programming languagescan be, given unbounded amounts of time, energy, expertise, and compilermodifications. Which is interesting, from a certain point of view.

In another sense, it's less interesting. For example, a C program can beas fast as any Haskell program, since compiling Haskell entailstransforming it *into* C. (Unless you're seriously going to suggest thatGHC's native code generator is any match for the might of a half-decentC compiler...)

That got me thinking about a more interesting question: Not *how fast*can it go, but *how easily* can it go fast? People have written fast Cprograms and fast Haskell programs, but how easy is it to write a"typical" program and have it go fast, without spending years optimising it?

With that in mind, I set of with a plan to do some small-scalebenchmarking. Pick a problem, solve it in both Haskell and C++, in the"simplest, most obvious way", and then apply "the simplest, most obviousoptimisations". Measure performance and compare.

There are benchmarks that just output some data. ("Find all the primenumbers less than 1000...") I dislike these for a couple of reasons, notleast because you can precompute the correct answer and just write aprogram which outputs that. (!) So I decided that all my benchmarkprograms should take a variable-sized data file as input, and save theresults as output.

There's an old maxim of running benchmarks multiple times, to ensurethat whatever time you got wasn't just a fluke. But then, maybe thefirst run pulls the input file into the file cache, and subsequent runsgo faster. If your input data was randomly generated, perhaps you chosea particularly optimal or pessimal data set by accident. So I decidedthat each test run should use a different input file of approximatelythe same characteristics (size, random distribution, etc.)

So I ended up picking benchmarks like "sort the lines of this file intoascending order", "count the number of unique words in this file","produce a histogram of byte values", "compute some statistics of thislist of numbers", etc. The tasks are all extremely simple, so that thereis some hope of it being possible to also implement them in C++.

One benchmark turned out to be particularly interesting: I call it "bytehistogram". The task is simple:

- Open a binary input file.
- Read a stream of bytes from it.
- Count how many times each of the 256 possible byte values appears.

The test inputs are binary files of various sizes, some with a uniformdistribution, some with variously skewed distributions (so that somebytes have a vastly higher count than others).

Assuming we have some suitable import statements, it's quite easy to dothis:


  bytes <- BS.readFile "Input.bin"

let out = BS.foldl' (\ map byte -> MAP.insertWith (+) byte 1 map)MAP.empty byteswriteFile "Output.csv" (unlines $ map (\(k,v) -> show k ++ "," ++show v) $ MAP.toAscList out)

(There is some slight trickiness if you want bytes with zero frequencyto still be listed.)

All of this *works* perfectly - i.e., it produces the correct answers.It's also astronomically slow, and it's very easy to make it eat manygigabytes of RAM while it runs. (If the program starts actuallyswapping, then you will truly know what "slow" is!)

OK, so what happens if we replace Data.Map with Data.IntMap? Well, itgoes slightly faster, and consumes slightly less RAM. Inputs whichdidn't quite complete before will run to completion now. But performanceis still abysmal.

Enough with this tree sillyness. What happens if you use an array? Giventhat the entire program (apart from the last 0.02% of it) spends itstime notionally mutating the data, a mutable array would be the logicalchoise. Dial up an IOArray Word8 Int and the structure of the programchanges slightly, but it's still only a handful of lines of code. Andthe performance? Still glacial.


In particular, suppose we have

  inc :: IOArray Word8 Int -> Word8 -> IO ()
  inc array byte = do
    count <- readArray array byte
    let count' = count + 1
    writeArray array byte count'

in the main loop. The program is giving us the right answer, but usingabsurd amounts of time and space to get it. Now watch this:


  inc :: IOArray Word8 Int -> Word8 -> IO ()
  inc array byte = do
    count <- readArray array byte
    let count' = count + 1
    count' `seq` writeArray array byte count'

And now, suddenly, memory usage becomes constant, regardless of inputsize, and run-time is slashed. The program goes from taking 50 secondsto process a certain file to taking only 0.02 seconds. And from taking400 MB of RAM to taking... less RAM than I can actually measure properly.

If we now replace the IOArray with an IOUArray then the seq call becomessuperfluous, and there's a small performance increase, but nothing major.

None of this should be any surprise to anybody who actually knows howHaskell works. For anybody not yet expert enough to see the problem: The"inc" function doesn't read the array, see that a given byte has a countof 5, and overwrite that with a 6. It sees a 5 and overwrites it with a5+1. By the end of the main loop, every single array cell contains1+(1+(1+(1+(1+(1+....)))), which takes a stackload of RAM to hold.

That explains why RAM usage is so absurd. On top of all that, creatingall this data takes time, and the spiralling RAM usage makes the GC gocrazy trying to find garbage to get rid of (when of course there isnone). And finally, at the end, you have to unwind all theseexpressions. (Curiously, I haven't seen a stack overflow yet...)

Adding the seq call forces 5+1 to actually turn into 6, not eventualybut *right now*, and so each array cell ends up actually containing afully-evaluated integer. (No doubt this change in strictness alsotriggers an avalanche of optimisation opportunities, but the basic thingof importants is that we're not uselessly delaying evaluation now.)

As I say, none of this is a surprise to anybody who actually knows howto get performance out of Haskell. The strict version is actually fasterthan the C++ version. (I imagine I'm doing something horribly wrong withC++, of course...) Not drastically faster, but faster. Which probablyjust means that ByteString is doing nicer buffering than whatever theC++ standard library is called.

The important obsevation is this: One tiny, almost insignificant changecan transform a program from taking 50 seconds and 400 MB of RAM intoone that takes 0.02 seconds and 0.1 MB of RAM. And, at least in thiscase, the simpler version is the slow one.

To say that Haskell is "slow" is both a little bit vague, and not reallybacked up by facts. In about 5 minutes flat, I managed to write aHaskell program that's very simple, and yet faster than a comparablysimple C++ program (and C++ is supposed to be "fast"). So it's not thatHaskell is "slow". It's that Haskell is *tricky*. Tiny, tiny littlechanges that look innocuous can have vast effects on performance. Andthis is a nice little example of that effect.

As we (probably?) all know, strictness is a subtle and fickle thing.Consider:


  or :: Bool -> Bool -> Bool
  or True  True  = True
  or True  False = True
  or False True  = True
  or False False = False

Overlooking the obvious name clash, this seems like a perfectlylegitimate definition for the logical-OR function. But now try this:


  xs `contains` x = foldr or False $ map (x ==) xs

A nice, pretty Haskell definition. But oh dears - this does not have theperformance you would expect at all! But if you change the definitionabove to the apparently equivilent


  or True  _ = True
  or False x = x

then "contains" becomes much faster. If you saw these two functions sideby side, you might well wonder what the hell the difference is. Youmight even wonder if the compiler would transform one to the other(although I'm not sure in which direction). But there is, in fact, avery, very big difference: one is lazier than the other.

For counting bytes, lazy = bad. For searching for matches, lazy = good.Similarly, if you take the byte histogram program and replace the lazyByteString with a strict one, RAM usage spikes sharply.

In all, laziness is a tricky, tricky little hobbitses. I wrote a C++program, in the obvious way, and it was very, very fast. I wrote aHaskell program in several comparably obvious ways, and they were allcripplingly slow. Until I added the magic "seq", and suddenly gotblinding speed. So while it is not correct to say "Haskell is slow", onecould justifyably say "C++ is fast more reliably than Haskell".

Consider for a moment the original implementation with Data.Map. Addinga seq or two here will do no good at all; seq reduces to WHNF. What weare wanting is NF, and I can see no way at all of doing that. (Otherthan doing something very expensive like looking up every possible key,which is far more operations than actually necessary.)

I wrote my own little BST algorithm, customised to the fact that thereare always exactly 256 keys of type Word8. It was also nausiatinglyslow. And then I did something: I inserted a few strictness annotationson the constructor fields. And (as you already worked out), performanceincreased drastically.

I can do that if *I* define the data structure. But if it's in alibrary, I am powerless to make it more strict (or less, if that werewhat I wanted).

That got me thinking... What would happen if, instead of "Integer", wehad two types, "evaluated Integer" and "possibly unevaluated Integer"?What if the strictness or otherwise of a data structure were exposed atthe type level?

The general idea here is that anything declared has having type"evaluated Integer" can never contain an unevaluated expression. It mustalways contain an actual integer. This of course implies that you mightbe able to unbox it or do other interesting things with it, but it alsosays something about evaluation.

The principle question is "what happens if you assign an unevaluatedInteger to an evaluated Integer?" Is that legal? Do you have toexplicitly evaluate it first? Or does that happen implicitly? Presumably"evaluated Integer" means evaluated to WHNF only - which for an integeris the same thing as NF, but for some more complex structure it mightnot be.

If the type system worked like this, I could do "Data.Map with evaluatedinteger keys and possibly evaluated integer values", and it would holdits values lazily. Or I could say "Data.Map with evaluated integer keysand evaluated integer values", and the structure would now magically bestrict. Which is to say, the type system would ensure that any datapassed into the structure was evaluated first, and the compiler could doany applicable optimisations based on the assumption of data beingalready evaluated. (For primitive types, that might be unboxing orpassing in registors. For ADTs, it might just mean skipping anevaluation check.)

Currently, if you want a strict list, you have to implement oneyourself. But is that strict in the spine, or the elements, or what? Youhave to implement every combination that you want. And, let us notforget, then go implement all the functions in Data.List over your newdata structure. Not fun.

If, on the other hand, strictness where in the type system, datastructures would be *parameterised* over the level of strictness required.

I have no idea what the syntax for that would look like, and backwardscompatibility looks like a nightmare. You would need to answer questionslike "what does it mean for a function to return an evaluated type?" and"is evaluation implicit or explicit?" But it seems like a ratherinteresting idea.

(Let us not forget: *Everything* is trivial for the person who doesn'thave to write the implementation...)


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Byte Histogram

Reply via email to