Re: [Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

2011-06-09 Thread Johan Tibell
On Thu, Jun 2, 2011 at 7:52 AM, Johan Tibell  wrote:
> I've decided to stick it in a blog post, add some pictures, and
> elaborate some more (e.g. provide numbers for all containers and for
> Text, so people can refer to the post when needed).

I ended up writing two blog posts:

Computing the size of a HashMap
http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html

Memory footprints of some common data types
http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html

Johan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

2011-06-01 Thread Johan Tibell
On Thu, Jun 2, 2011 at 5:10 AM, Jason Dagit  wrote:
> One of the cool things about SO is that you can answer your own
> question.  For example, you might do that if you're anticipating an
> FAQ.  I think asking this question on SO and reposting your answer
> from this thread would be great.

Good to know.

I've decided to stick it in a blog post, add some pictures, and
elaborate some more (e.g. provide numbers for all containers and for
Text, so people can refer to the post when needed).

-- Johan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

2011-06-01 Thread Jason Dagit
On Wed, Jun 1, 2011 at 8:33 AM, Johan Tibell  wrote:
> On Wed, Jun 1, 2011 at 4:24 PM, Aleksandar Dimitrov
>  wrote:
>> One additional thought: it might be interesting to provide this outside of 
>> this
>> mailing list, perhaps as a documentation addendum to unordered-containers, 
>> since
>> it really explains the size needs for HashMaps of ByteStrings to folks 
>> without
>> knowledge of higher arcane wizardry.
>
> I think it would be a good answer on StackOverflow, but no one asked
> me this question there. I could list the size overhead of a HashMap in
> the docs, but I'm about to change it so I won't do so until after the
> change. I don't know how big guarantees I want to make in the docs
> either, as it might constrain future improvements to the
> implementation. Perhaps in an addendum, like you said.

One of the cool things about SO is that you can answer your own
question.  For example, you might do that if you're anticipating an
FAQ.  I think asking this question on SO and reposting your answer
from this thread would be great.

Jason

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

2011-06-01 Thread Johan Tibell
On Wed, Jun 1, 2011 at 4:24 PM, Aleksandar Dimitrov
 wrote:
> One additional thought: it might be interesting to provide this outside of 
> this
> mailing list, perhaps as a documentation addendum to unordered-containers, 
> since
> it really explains the size needs for HashMaps of ByteStrings to folks without
> knowledge of higher arcane wizardry.

I think it would be a good answer on StackOverflow, but no one asked
me this question there. I could list the size overhead of a HashMap in
the docs, but I'm about to change it so I won't do so until after the
change. I don't know how big guarantees I want to make in the docs
either, as it might constrain future improvements to the
implementation. Perhaps in an addendum, like you said.

> I ended up using Data.Text over SmallString, also because I need to do other
> operations on the words (case folding, mass matching, and possibly more) and
> Text seemed more attractive for these tasks, but I'll try using SmallString,
> too.

Text uses 6 words per value, plus the size of the UTF-16 encoded
content. There's a Google Summer of Code project this year to convert
Text to UTF-8, which should decrease the space usage. In addition,
Text values aren't pinned on the heap, unlike ByteStrings, so they
should be nicer to the GC. The lowest overhead string type you could
imagine (given how GHC implements ByteArray#) would have a 4 word
overhead. Text trades 2 extra words to support efficient slicing (just
like ByteString).

When Text uses UTF-8 internally it should be possible to convert Text
to/from SmallString in O(1) time as the underlying ByteArray# could
just be wrapped in a SmallString constructor instead of a Text
constructor. This means that you could freely convert HashMap keys to
SmallString to save some space.

> I think, overall, Text will probably use more memory than ByteString, but in 
> my
> particular case, the problem wasn't the size of the data structure, but the 
> fact
> that it seemed to retain chunks of the original input file.

Given that ByteString uses 3 words more than Text you'll probably use
about the same (or slightly less) amount of space, given your string
lengths.

> To me, strict byte strings were a high-performance black box I didn't think
> about. I thought, if I store stuff as a bytestring, and a strict one, no less,
> there's *no* way I could expect to perform any better by switching the string
> type. Turns out that was an unjustified assumption.

You only have to know very little to use ByteString. Knowing that it
does slicing by sharing the underlying buffer (just like Java's
Strings) is unfortunately one of them. You can use the 'copy' function
to make sure the underlying storage matches the ByteString's length.

> The wealth of string types in Haskell land seems kind of confusing to the 
> newbie

I sympathize. My rule of thumb is: use Text for Unicode data and
ByteString for (large) byte blobs.

Cheers,
Johan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

2011-06-01 Thread Aleksandar Dimitrov
Hello Johan,

On Wed, Jun 01, 2011 at 08:52:04AM +0200, Johan Tibell wrote:
> I thought it'd be educational to do some back-of-the-envelope
> calculations to see how much memory we'd expect to use to store words
> in a HashMap ByteString Int.

Thank you for your writeup, which is very informative! You should really think
about writing that tutorial you mentioned earlier, it would probably be a boon
for many.

One additional thought: it might be interesting to provide this outside of this
mailing list, perhaps as a documentation addendum to unordered-containers, since
it really explains the size needs for HashMaps of ByteStrings to folks without
knowledge of higher arcane wizardry.

> For example, the average English word length is 5 characters, if you
> include stop words. We'd expect to use
> 
> (9 + 9 + 2) * 8 + 5 = 165
> 
> bytes per unique word in our input corpus, on a 64-bit machine. Plus
> any GC overhead. This is probably more than one would expect.

Using an sqlite database, I committed 1/16 of my whole corpus (which is in
German) in ten separate steps and counted the amount of unigrams contained
therein: 2,207,369; Assuming German affection for longwords, and the fact that
the data contained umlauts in UTF-8, let's up the average word length a bit:
say, 15 (I pulled that out of my ass, obviously.) This brings us to 2207369 *
180 bytes, or ~400MB hash table.

> You could try using the lower overhead SmallString type from the
> smallstring package [6]. It has an overhead of 4 words per string
> (instead of 9 like ByteString). There's some runtime overhead involved
> in converting a value (i.e. Text) to a SmallString. I don't know if
> this overhead will be noticeable in your program.

I ended up using Data.Text over SmallString, also because I need to do other
operations on the words (case folding, mass matching, and possibly more) and
Text seemed more attractive for these tasks, but I'll try using SmallString,
too.

I think, overall, Text will probably use more memory than ByteString, but in my
particular case, the problem wasn't the size of the data structure, but the fact
that it seemed to retain chunks of the original input file.

> P.S. If your program indeed has a space leak this won't help you, but
> it's a good way to figure out if your program uses a reasonable amount
> of memory.

Yes, it is indeed! It turned out I was leaking space, since ByteStrings were
holding references to iteratee-read chunks (I have no idea what happened with
the program you gave me, that did not depend on iteratee. I can recall it having
a worse performance on memory, which right now seems counterintuitive. I'll try
to check it again later this week.)

But even without leaking space, this will provide me (and hopefully others) with
a good estimate on how well they can expect unordered-containers to perform.

To me, strict byte strings were a high-performance black box I didn't think
about. I thought, if I store stuff as a bytestring, and a strict one, no less,
there's *no* way I could expect to perform any better by switching the string
type. Turns out that was an unjustified assumption.

The wealth of string types in Haskell land seems kind of confusing to the newbie
:-(

Thanks again for your hard work on unordered-containers and for explaining this
stuff!

Regards and best wishes,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

2011-05-31 Thread Johan Tibell
Hi Aleksandar,

I thought it'd be educational to do some back-of-the-envelope
calculations to see how much memory we'd expect to use to store words
in a HashMap ByteString Int. First, lets start by looking at how much
memory one ByteString uses. Here's the definition of ByteString [1]:

data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8)
 {-# UNPACK #-} !Int-- offset
 {-# UNPACK #-} !Int-- length

The two Int fields are used to support O(1) slicing of ByteStrings.

We also need the definitions of ForeignPtr [2] and Int.

data ForeignPtr a = ForeignPtr Addr# ForeignPtrContents

data ForeignPtrContents
= PlainForeignPtr !(IORef (Finalizers, [IO ()]))
| MallocPtr  (MutableByteArray# RealWorld) !(IORef
(Finalizers, [IO ()]))
| PlainPtr   (MutableByteArray# RealWorld)

data Int = I# Int#

The UNPACK indicates to the compiler that it should unpack the
contents of a constructor field into the constructor itself, removing
a level of indirection. We'll end up with a structure that looks like
this:

data ByteString = PS Addr# ForeignPtrContents
 Int#-- offset
 Int#-- length

To compute the size of a ByteString, count the number of constructor
fields and add one (for the constructor itself). This is how many
words the value is going to use. In this case it's 5 words. In
addition we need to add the size of the ForeignPtrContents
constructor, which happens to be a PlainPtr in this case, so we add
two more words.

Finally we need to look at the definition of MutableByteArray# [3],
which is implemented by a C struct named StgArrWords:

typedef struct {
StgHeader  header;
StgWordbytes;
StgWordpayload[FLEXIBLE_ARRAY];
} StgArrWords;

The StgHeader takes one word (when compiling with profiling turned
off) so StgArrWords takes 2 words, plus the actual payload.

If we add it all up we get 9 words, plus the size of the payload (i.e.
the length of a word encoded using UTF-8 in your case).

Now lets look at the definition of HashMap [4]:

data HashMap k v
= Bin {-# UNPACK #-} !SuffixMask
  !(HashMap k v)
  !(HashMap k v)
| Tip {-# UNPACK #-} !Hash
  {-# UNPACK #-} !(FL.FullList k v)
| Nil

We also need the definition of FullList [5]:

data FullList k v = FL !k v !(List k v)
data List k v = Nil | Cons !k v !(List k v)

For the sake of this discussion lets assume the tree is perfectly
balanced. For a HashMap of size N this means that we have N leaves
(Tip) and N-1 interior nodes (Bin). Each Bin takes 4 words.

The size of the Tip depends on the number of hash collisions. These
are quite rare so lets assume that the FullList only has one element.
Also, the Nil constructor is free as it can be shared by all instances
in the program. After applying the UNPACK pragmas the Tip constructor
looks like:

| Tip Int# !k v !(List k v)

This takes another 5 words.

Now when we know the overhead of both Bin and Tip we can compute the
overhead per key/value pair as: (5N + 4(N-1)) / N = 9 - 4/N ~= 9
words.

Given that an Int (not an Int#) takes 2 words, we can approximate the
memory cost of a key/value pair in a HashMap ByteString Int as

(9 + 9 + 2) * MachineWordSize + AvgByteStringLength

bytes.

For example, the average English word length is 5 characters, if you
include stop words. We'd expect to use

(9 + 9 + 2) * 8 + 5 = 165

bytes per unique word in our input corpus, on a 64-bit machine. Plus
any GC overhead. This is probably more than one would expect.

I'm working on switching HashMap to use another data structure, a Hash
Array Mapped Trie, in its implementation. This will bring the overhead
down from 9 words to about 4 words.

You could try using the lower overhead SmallString type from the
smallstring package [6]. It has an overhead of 4 words per string
(instead of 9 like ByteString). There's some runtime overhead involved
in converting a value (i.e. Text) to a SmallString. I don't know if
this overhead will be noticeable in your program.

1. http://code.haskell.org/bytestring/Data/ByteString/Internal.hs
2. https://github.com/ghc/packages-base/blob/master/GHC/ForeignPtr.hs
3. https://github.com/ghc/ghc/blob/master/includes/rts/storage/Closures.h
4. 
https://github.com/tibbe/unordered-containers/blob/master/Data/HashMap/Common.hs
5. 
https://github.com/tibbe/unordered-containers/blob/master/Data/FullList/Lazy.hs
6. http://hackage.haskell.org/package/smallstring

P.S. If your program indeed has a space leak this won't help you, but
it's a good way to figure out if your program uses a reasonable amount
of memory.

Cheers,
Johan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe