2008/11/18 kenny lu <[EMAIL PROTECTED]>: > Dear all, > > I am trying to implement the python-style dictionary in Haskell. > > Python dictionary is a data structure that maps one key to one value. > For instance, a python dictionary > d = {'a':1, 'b':2 } > maps key 'a' to 1, 'b' to 2. > Python dictionary allows for update. e.g. the statement > d['a'] = 3 > changes the value pointed by 'a' from 1 to 3. > > Internally, python dictionary is implemented using hash table. > > My first attempt is to use Data.HashTable. However it was immediately > abandoned, as I realize the memory usage is unreasonably huge. > > ... > > I tested all three implementations by building a dictionary of size 1000000. > The result shows that the Map and the Trie approaches handle collision well, > but > the IntMap approach does not. > > > Here is a comparison of memory usage > > Map : 345 MB > IntMap : 146 MB > Trie : 282 MB > Python : 94 MB > > Here is a comparison of execution time (on an intel dual core 2.0G) > > Map: 26 sec > IntMap: 9 sec > Trie: 12 sec > Python: 2.24 sec > > > The above number shows that my implementations of python style dictionary > are space/time in-efficient as compared to python. > > Can some one point out what's wrong with my implementations? >
First of all, you use Strings. That's a very bad thing when you care about memory restrictions. Fire up ghci type something like this: > let aas = replicate (1024*1024*10) 'a' > -- 22 Mb memory usage > length aas > 10485760 > -- 270 Mb memory usage 10 Mb string caused as much as 250 Mb increase in ghci's memory consumption. My guess? Use hashtables with ByteStrings. I rewrote part of your code. Results are quite promising. Haskell: 121 Mb total memory in use INIT time 0.02s ( 0.00s elapsed) MUT time 0.84s ( 1.00s elapsed) GC time 1.97s ( 2.02s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.83s ( 3.02s elapsed) %GC time 69.6% (66.8% elapsed) Python: $ time python dict.py 256 real 0m2.278s user 0m2.233s sys 0m0.078s memory: 101 Mb (as reported by Windows' Task Manager). The code: --- cut --- import qualified Data.ByteString.Lazy.Char8 as BS import Data.Int import Data.Bits (...) parse_a_line_BS :: BS.ByteString -> (BS.ByteString,Int) parse_a_line_BS line = case BS.words line of [key,val] -> (key,(read . BS.unpack) val) _ -> error " parse error. " main :: IO () main = do dict <- HT.new (==) hashByteString indata <- (map parse_a_line_BS `fmap` BS.lines `fmap` BS.readFile "in.txt") mapM_ (\ (k,v) -> HT.insert dict k v) indata HT.lookup dict (BS.pack "key256") >>= \v -> case v of Just vv -> putStrLn (show vv) Nothing -> putStrLn ("Not found") -- derived from Data.HashTable.hashString hashByteString :: BS.ByteString -> Int32 hashByteString = BS.foldl' f golden where f m c = fromIntegral (ord c) * magic + hashInt32 m magic = 0xdeadbeef hashInt32 :: Int32 -> Int32 hashInt32 x = mulHi x golden + x mulHi :: Int32 -> Int32 -> Int32 mulHi a b = fromIntegral (r `shiftR` 32) where r :: Int64 r = fromIntegral a * fromIntegral b golden :: Int32 golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32 --- cut --- I had to rewrite hashString to work for ByteStrings - basically it's just using different foldl'. All best Christopher Skrzętnicki
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe