Claus Reinke ha scritto:
At first guess it sounds like you're holding onto too much, if not the
whole stream perhaps bits within each chunk.

It is possible.

I split the string in lines, then map some functions on each line to parse the data, and finally calling toU, for converting to an UArr.

Just to make sure (code fragments or, better, reduced examples
would make it easier to see what the discussion is about): are you forcing the UArr to be constructed before putting it into the Map?


parse handle =
  contents <- S.hGetContents handle
  let v =  map singleton' $ ratings contents
  let m = foldl1' (unionWith appendU) v
  v `seq` return $! m

  where
    -- Build a Map with a single movie rating
    singleton' :: (Word32, Word8) -> MovieRatings
    singleton' (id, rate) =
      singleton (fromIntegral $ id) (singletonU $ pairS (id, rate))

This function gets called over each file, with

r <- mapM parse' [1..17770]
let movieRatings = foldl1' (unionWith appendU) r



The `ratings` function parse each line of the file, and return a tuple.
For each line of the file I build an IntMap, then merge them together;
The IntMaps, are then further merged in the main function.

NOTE that the memory usage is the same if I remove array concatenation.


There are 100,000,000 ratings, so I create 100,000,000 arrays containing only one element.
However, memory usage is 1 GB just after 800 files.


The data type is:

type Rating = Word32 :*: Word8
type MovieRatings = IntMap (UArr Rating) -- UArr from uvector


Code is here: http://haskell.mperillo.ath.cx/netflix-0.0.1.tar.gz
but it is an old version (where I used lazy ByteString).



Thanks  Manlio Perillo


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to