Dear haskell-cafe, I've got an anagram-finder ("puzzler") that uses a "dictionary" datatype, which in turn uses a trie. In src/hs/Main.hs, I create a new dictionary from a word list (a file containing one word per line) and perform a query on it in order to force it to actually load something from disk. By problem is that though my word list is just shy of 7MB, puzzler occupies nearly 1GB of resident memory as soon as the dictionary is loaded. I'm using GHC 6.8.3 and Gtk2hs 0.9.13 on OS X 10.4.11.
To see for yourself, check out the code with git and run it: $ git clone git://github.com/dbueno/puzzler.git $ git checkout -b origin/bsbench $ cabal configure # needs gtk2hs $ cabal build $ ./dist/build/puzzler/puzzler # look at the resident memory Retainer profiling (attachment "puzzler-hr.pdf") shows that `insertWith' on my trie is responsible for most of the memory. As you will see, in `makeDictionary' I try to strictly apply the combining function to combat these retainers. But the problem doesn't go away. More strangely: puzzler has a second executable, puzzler-test, which runs unit tests, including loading the exact same anagram query. However, puzzler-test never uses more than 250MB or so. The dictionary (in module Puzzler.Anagram) is an array of bytestrings paired with a Trie mapping bytestrings to sets of indices in the array. > data Dictionary = Dictionary > { dictWords :: Words > , sortWords :: Trie IntSet } > type Words = Array Int ByteString The Trie is a custom bytestring trie datatype. Each level is mapped to by the character it represents. If that character is a terminal character for some word (Right (t,x)), then that word maps to x. Otherwise no word ends at that character, but there may be longer words, so the trie continues with (Left t). > newtype Trie a = Trie { unTrie :: IntMap (Either (Trie a) (Trie a, a)) } > deriving (Eq, Ord, Show) The following is where _I think_ the error lies (src/hs/Puzzler/Anagram.hs): > -- | Creates an anagram dictionary from a file of words, one per line. The > -- words may be compound, as long as there is one per line. > createDictionary :: FilePath -> IO Dictionary > createDictionary path = (makeDictionary . lines) `liftM` readFile path > > -- | Makes an anagram dictionary from a list of words. > makeDictionary :: [ByteString] -> Dictionary > makeDictionary ws = Dictionary > { dictWords = dw > , sortWords = go end Trie.empty } > where > dw = listArray (0, length ws - 1) ws ; (begin, end) = bounds dw > go i t | seq t $ False = undefined > | i < begin = t > | otherwise = go (i-1) > -- $ Trie.insertWith Set.union (BS.sort (dw!i)) > (Set.singleton i) t > $ insertWith' Set.union (BS.sort (dw!i)) > (Set.singleton i) t > > insertWith' f k v m | seq k $ False = undefined > insertWith' f k v m = case Trie.lookup k m of > Nothing -> Trie.insert k v m > Just s -> (Trie.insert k $! f v s) m I've been scouring haskell-cafe for similar issues, and all seem to have been solved by a function similar to my `insertWith''. Uncommenting `Trie.insertWith' and commenting out the rest doesn't appear to differ noticeably in memory consumption. The implementation of `Trie.insertWith' is: > insertWith :: (a -> a -> a) -> ByteString -> a -> Trie a -> Trie a > insertWith f bs x t@(Trie m) = case uncons bs of > Nothing -> t > Just (c, cs) -> Trie (Map.alter myAlter (ord c) m) > where > myAlter Nothing = Just $ cons (insertWith f cs x > empty, x) > myAlter (Just (Left t)) = Just $ cons (insertWith f cs x t, x) > myAlter (Just (Right (t,x'))) = > Just $ Right (insertWith f cs x t, if isLast then f x x' else > x') > > -- If `cs' is empty then `c' is the last char of the word to insert; > -- that is, we're done. > cons (t, x) = if isLast then Right (t, x) else Left t > isLast = BS.null cs Any ideas? Denis
puzzler-hr.pdf
Description: Adobe PDF document
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe