Hi Aleksandar, I thought it'd be educational to do some back-of-the-envelope calculations to see how much memory we'd expect to use to store words in a HashMap ByteString Int. First, lets start by looking at how much memory one ByteString uses. Here's the definition of ByteString [1]:
data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) {-# UNPACK #-} !Int -- offset {-# UNPACK #-} !Int -- length The two Int fields are used to support O(1) slicing of ByteStrings. We also need the definitions of ForeignPtr [2] and Int. data ForeignPtr a = ForeignPtr Addr# ForeignPtrContents data ForeignPtrContents = PlainForeignPtr !(IORef (Finalizers, [IO ()])) | MallocPtr (MutableByteArray# RealWorld) !(IORef (Finalizers, [IO ()])) | PlainPtr (MutableByteArray# RealWorld) data Int = I# Int# The UNPACK indicates to the compiler that it should unpack the contents of a constructor field into the constructor itself, removing a level of indirection. We'll end up with a structure that looks like this: data ByteString = PS Addr# ForeignPtrContents Int# -- offset Int# -- length To compute the size of a ByteString, count the number of constructor fields and add one (for the constructor itself). This is how many words the value is going to use. In this case it's 5 words. In addition we need to add the size of the ForeignPtrContents constructor, which happens to be a PlainPtr in this case, so we add two more words. Finally we need to look at the definition of MutableByteArray# [3], which is implemented by a C struct named StgArrWords: typedef struct { StgHeader header; StgWord bytes; StgWord payload[FLEXIBLE_ARRAY]; } StgArrWords; The StgHeader takes one word (when compiling with profiling turned off) so StgArrWords takes 2 words, plus the actual payload. If we add it all up we get 9 words, plus the size of the payload (i.e. the length of a word encoded using UTF-8 in your case). Now lets look at the definition of HashMap [4]: data HashMap k v = Bin {-# UNPACK #-} !SuffixMask !(HashMap k v) !(HashMap k v) | Tip {-# UNPACK #-} !Hash {-# UNPACK #-} !(FL.FullList k v) | Nil We also need the definition of FullList [5]: data FullList k v = FL !k v !(List k v) data List k v = Nil | Cons !k v !(List k v) For the sake of this discussion lets assume the tree is perfectly balanced. For a HashMap of size N this means that we have N leaves (Tip) and N-1 interior nodes (Bin). Each Bin takes 4 words. The size of the Tip depends on the number of hash collisions. These are quite rare so lets assume that the FullList only has one element. Also, the Nil constructor is free as it can be shared by all instances in the program. After applying the UNPACK pragmas the Tip constructor looks like: | Tip Int# !k v !(List k v) This takes another 5 words. Now when we know the overhead of both Bin and Tip we can compute the overhead per key/value pair as: (5N + 4(N-1)) / N = 9 - 4/N ~= 9 words. Given that an Int (not an Int#) takes 2 words, we can approximate the memory cost of a key/value pair in a HashMap ByteString Int as (9 + 9 + 2) * MachineWordSize + AvgByteStringLength bytes. For example, the average English word length is 5 characters, if you include stop words. We'd expect to use (9 + 9 + 2) * 8 + 5 = 165 bytes per unique word in our input corpus, on a 64-bit machine. Plus any GC overhead. This is probably more than one would expect. I'm working on switching HashMap to use another data structure, a Hash Array Mapped Trie, in its implementation. This will bring the overhead down from 9 words to about 4 words. You could try using the lower overhead SmallString type from the smallstring package [6]. It has an overhead of 4 words per string (instead of 9 like ByteString). There's some runtime overhead involved in converting a value (i.e. Text) to a SmallString. I don't know if this overhead will be noticeable in your program. 1. http://code.haskell.org/bytestring/Data/ByteString/Internal.hs 2. https://github.com/ghc/packages-base/blob/master/GHC/ForeignPtr.hs 3. https://github.com/ghc/ghc/blob/master/includes/rts/storage/Closures.h 4. https://github.com/tibbe/unordered-containers/blob/master/Data/HashMap/Common.hs 5. https://github.com/tibbe/unordered-containers/blob/master/Data/FullList/Lazy.hs 6. http://hackage.haskell.org/package/smallstring P.S. If your program indeed has a space leak this won't help you, but it's a good way to figure out if your program uses a reasonable amount of memory. Cheers, Johan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe