Re: [Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)
On Thu, Jun 2, 2011 at 7:52 AM, Johan Tibell wrote: > I've decided to stick it in a blog post, add some pictures, and > elaborate some more (e.g. provide numbers for all containers and for > Text, so people can refer to the post when needed). I ended up writing two blog posts: Computing the size of a HashMap http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html Memory footprints of some common data types http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)
On Thu, Jun 2, 2011 at 5:10 AM, Jason Dagit wrote: > One of the cool things about SO is that you can answer your own > question. For example, you might do that if you're anticipating an > FAQ. I think asking this question on SO and reposting your answer > from this thread would be great. Good to know. I've decided to stick it in a blog post, add some pictures, and elaborate some more (e.g. provide numbers for all containers and for Text, so people can refer to the post when needed). -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)
On Wed, Jun 1, 2011 at 8:33 AM, Johan Tibell wrote: > On Wed, Jun 1, 2011 at 4:24 PM, Aleksandar Dimitrov > wrote: >> One additional thought: it might be interesting to provide this outside of >> this >> mailing list, perhaps as a documentation addendum to unordered-containers, >> since >> it really explains the size needs for HashMaps of ByteStrings to folks >> without >> knowledge of higher arcane wizardry. > > I think it would be a good answer on StackOverflow, but no one asked > me this question there. I could list the size overhead of a HashMap in > the docs, but I'm about to change it so I won't do so until after the > change. I don't know how big guarantees I want to make in the docs > either, as it might constrain future improvements to the > implementation. Perhaps in an addendum, like you said. One of the cool things about SO is that you can answer your own question. For example, you might do that if you're anticipating an FAQ. I think asking this question on SO and reposting your answer from this thread would be great. Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)
On Wed, Jun 1, 2011 at 4:24 PM, Aleksandar Dimitrov wrote: > One additional thought: it might be interesting to provide this outside of > this > mailing list, perhaps as a documentation addendum to unordered-containers, > since > it really explains the size needs for HashMaps of ByteStrings to folks without > knowledge of higher arcane wizardry. I think it would be a good answer on StackOverflow, but no one asked me this question there. I could list the size overhead of a HashMap in the docs, but I'm about to change it so I won't do so until after the change. I don't know how big guarantees I want to make in the docs either, as it might constrain future improvements to the implementation. Perhaps in an addendum, like you said. > I ended up using Data.Text over SmallString, also because I need to do other > operations on the words (case folding, mass matching, and possibly more) and > Text seemed more attractive for these tasks, but I'll try using SmallString, > too. Text uses 6 words per value, plus the size of the UTF-16 encoded content. There's a Google Summer of Code project this year to convert Text to UTF-8, which should decrease the space usage. In addition, Text values aren't pinned on the heap, unlike ByteStrings, so they should be nicer to the GC. The lowest overhead string type you could imagine (given how GHC implements ByteArray#) would have a 4 word overhead. Text trades 2 extra words to support efficient slicing (just like ByteString). When Text uses UTF-8 internally it should be possible to convert Text to/from SmallString in O(1) time as the underlying ByteArray# could just be wrapped in a SmallString constructor instead of a Text constructor. This means that you could freely convert HashMap keys to SmallString to save some space. > I think, overall, Text will probably use more memory than ByteString, but in > my > particular case, the problem wasn't the size of the data structure, but the > fact > that it seemed to retain chunks of the original input file. Given that ByteString uses 3 words more than Text you'll probably use about the same (or slightly less) amount of space, given your string lengths. > To me, strict byte strings were a high-performance black box I didn't think > about. I thought, if I store stuff as a bytestring, and a strict one, no less, > there's *no* way I could expect to perform any better by switching the string > type. Turns out that was an unjustified assumption. You only have to know very little to use ByteString. Knowing that it does slicing by sharing the underlying buffer (just like Java's Strings) is unfortunately one of them. You can use the 'copy' function to make sure the underlying storage matches the ByteString's length. > The wealth of string types in Haskell land seems kind of confusing to the > newbie I sympathize. My rule of thumb is: use Text for Unicode data and ByteString for (large) byte blobs. Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)
Hello Johan, On Wed, Jun 01, 2011 at 08:52:04AM +0200, Johan Tibell wrote: > I thought it'd be educational to do some back-of-the-envelope > calculations to see how much memory we'd expect to use to store words > in a HashMap ByteString Int. Thank you for your writeup, which is very informative! You should really think about writing that tutorial you mentioned earlier, it would probably be a boon for many. One additional thought: it might be interesting to provide this outside of this mailing list, perhaps as a documentation addendum to unordered-containers, since it really explains the size needs for HashMaps of ByteStrings to folks without knowledge of higher arcane wizardry. > For example, the average English word length is 5 characters, if you > include stop words. We'd expect to use > > (9 + 9 + 2) * 8 + 5 = 165 > > bytes per unique word in our input corpus, on a 64-bit machine. Plus > any GC overhead. This is probably more than one would expect. Using an sqlite database, I committed 1/16 of my whole corpus (which is in German) in ten separate steps and counted the amount of unigrams contained therein: 2,207,369; Assuming German affection for longwords, and the fact that the data contained umlauts in UTF-8, let's up the average word length a bit: say, 15 (I pulled that out of my ass, obviously.) This brings us to 2207369 * 180 bytes, or ~400MB hash table. > You could try using the lower overhead SmallString type from the > smallstring package [6]. It has an overhead of 4 words per string > (instead of 9 like ByteString). There's some runtime overhead involved > in converting a value (i.e. Text) to a SmallString. I don't know if > this overhead will be noticeable in your program. I ended up using Data.Text over SmallString, also because I need to do other operations on the words (case folding, mass matching, and possibly more) and Text seemed more attractive for these tasks, but I'll try using SmallString, too. I think, overall, Text will probably use more memory than ByteString, but in my particular case, the problem wasn't the size of the data structure, but the fact that it seemed to retain chunks of the original input file. > P.S. If your program indeed has a space leak this won't help you, but > it's a good way to figure out if your program uses a reasonable amount > of memory. Yes, it is indeed! It turned out I was leaking space, since ByteStrings were holding references to iteratee-read chunks (I have no idea what happened with the program you gave me, that did not depend on iteratee. I can recall it having a worse performance on memory, which right now seems counterintuitive. I'll try to check it again later this week.) But even without leaking space, this will provide me (and hopefully others) with a good estimate on how well they can expect unordered-containers to perform. To me, strict byte strings were a high-performance black box I didn't think about. I thought, if I store stuff as a bytestring, and a strict one, no less, there's *no* way I could expect to perform any better by switching the string type. Turns out that was an unjustified assumption. The wealth of string types in Haskell land seems kind of confusing to the newbie :-( Thanks again for your hard work on unordered-containers and for explaining this stuff! Regards and best wishes, Aleks signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)
Hi Aleksandar, I thought it'd be educational to do some back-of-the-envelope calculations to see how much memory we'd expect to use to store words in a HashMap ByteString Int. First, lets start by looking at how much memory one ByteString uses. Here's the definition of ByteString [1]: data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) {-# UNPACK #-} !Int-- offset {-# UNPACK #-} !Int-- length The two Int fields are used to support O(1) slicing of ByteStrings. We also need the definitions of ForeignPtr [2] and Int. data ForeignPtr a = ForeignPtr Addr# ForeignPtrContents data ForeignPtrContents = PlainForeignPtr !(IORef (Finalizers, [IO ()])) | MallocPtr (MutableByteArray# RealWorld) !(IORef (Finalizers, [IO ()])) | PlainPtr (MutableByteArray# RealWorld) data Int = I# Int# The UNPACK indicates to the compiler that it should unpack the contents of a constructor field into the constructor itself, removing a level of indirection. We'll end up with a structure that looks like this: data ByteString = PS Addr# ForeignPtrContents Int#-- offset Int#-- length To compute the size of a ByteString, count the number of constructor fields and add one (for the constructor itself). This is how many words the value is going to use. In this case it's 5 words. In addition we need to add the size of the ForeignPtrContents constructor, which happens to be a PlainPtr in this case, so we add two more words. Finally we need to look at the definition of MutableByteArray# [3], which is implemented by a C struct named StgArrWords: typedef struct { StgHeader header; StgWordbytes; StgWordpayload[FLEXIBLE_ARRAY]; } StgArrWords; The StgHeader takes one word (when compiling with profiling turned off) so StgArrWords takes 2 words, plus the actual payload. If we add it all up we get 9 words, plus the size of the payload (i.e. the length of a word encoded using UTF-8 in your case). Now lets look at the definition of HashMap [4]: data HashMap k v = Bin {-# UNPACK #-} !SuffixMask !(HashMap k v) !(HashMap k v) | Tip {-# UNPACK #-} !Hash {-# UNPACK #-} !(FL.FullList k v) | Nil We also need the definition of FullList [5]: data FullList k v = FL !k v !(List k v) data List k v = Nil | Cons !k v !(List k v) For the sake of this discussion lets assume the tree is perfectly balanced. For a HashMap of size N this means that we have N leaves (Tip) and N-1 interior nodes (Bin). Each Bin takes 4 words. The size of the Tip depends on the number of hash collisions. These are quite rare so lets assume that the FullList only has one element. Also, the Nil constructor is free as it can be shared by all instances in the program. After applying the UNPACK pragmas the Tip constructor looks like: | Tip Int# !k v !(List k v) This takes another 5 words. Now when we know the overhead of both Bin and Tip we can compute the overhead per key/value pair as: (5N + 4(N-1)) / N = 9 - 4/N ~= 9 words. Given that an Int (not an Int#) takes 2 words, we can approximate the memory cost of a key/value pair in a HashMap ByteString Int as (9 + 9 + 2) * MachineWordSize + AvgByteStringLength bytes. For example, the average English word length is 5 characters, if you include stop words. We'd expect to use (9 + 9 + 2) * 8 + 5 = 165 bytes per unique word in our input corpus, on a 64-bit machine. Plus any GC overhead. This is probably more than one would expect. I'm working on switching HashMap to use another data structure, a Hash Array Mapped Trie, in its implementation. This will bring the overhead down from 9 words to about 4 words. You could try using the lower overhead SmallString type from the smallstring package [6]. It has an overhead of 4 words per string (instead of 9 like ByteString). There's some runtime overhead involved in converting a value (i.e. Text) to a SmallString. I don't know if this overhead will be noticeable in your program. 1. http://code.haskell.org/bytestring/Data/ByteString/Internal.hs 2. https://github.com/ghc/packages-base/blob/master/GHC/ForeignPtr.hs 3. https://github.com/ghc/ghc/blob/master/includes/rts/storage/Closures.h 4. https://github.com/tibbe/unordered-containers/blob/master/Data/HashMap/Common.hs 5. https://github.com/tibbe/unordered-containers/blob/master/Data/FullList/Lazy.hs 6. http://hackage.haskell.org/package/smallstring P.S. If your program indeed has a space leak this won't help you, but it's a good way to figure out if your program uses a reasonable amount of memory. Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe