Re: [Haskell-cafe] How on Earth Do You Reason about Space?
Since frequency counts are an important use of map-like data structures, I did a brief test of the available options. First using regular strings for input, and Data.Map.fromListWith - i.e. the operational bit being: freqs :: [String] -> M.Map String Int freqs = M.fromListWith (+) . map (,1) This runs on a 14M corpus consisting of King James Bible, collected works of Shakespeare, and War and Peace. ./freqstr1 +RTS -s 5,093,386,368 bytes allocated in the heap 2,682,667,904 bytes copied during GC 261,110,368 bytes maximum residency (20 sample(s)) 9,018,000 bytes maximum slop 623 MB total memory in use (10 MB lost due to fragmentation) ./freqstr1 +RTS -s 21.43s user 0.78s system 99% cpu 22.285 total Kinda expensive, 250MB to store word frequencies of 14MB text. Now, changing to freqs :: [String] -> M.Map String Int freqs = foldl' (\m w -> M.insertWith' (+) w 1 m) M.empty i.e. using strict insertion, avoiding the buildup of lazy thunks for the counts. ./freqstr2 +RTS -s -- strings, using strict insertion 4,754,110,096 bytes allocated in the heap 2,089,527,240 bytes copied during GC 27,039,112 bytes maximum residency (66 sample(s)) 613,192 bytes maximum slop 80 MB total memory in use (2 MB lost due to fragmentation) ./freqstr2 +RTS -s 17.48s user 0.13s system 99% cpu 17.665 total This reduced maximam memory consumption to one tenth, still bigger than input corpus, but clearly not too bad. A bit faster, too, in spite of probably doing more work. Using ByteStrings instead, first fromListWith: ./freq +RTS -s (Just 77432,113931) 3,880,059,568 bytes allocated in the heap 1,467,507,808 bytes copied during GC 174,573,752 bytes maximum residency (14 sample(s)) 8,222,600 bytes maximum slop 385 MB total memory in use (6 MB lost due to fragmentation) ./freq +RTS -s 14.26s user 0.49s system 99% cpu 14.798 total About half the memroy of Strings, and 25% faster. With strict insertion: ./freq2 +RTS -s -- map using strict insertion 3,761,614,312 bytes allocated in the heap 849,806,000 bytes copied during GC 23,950,328 bytes maximum residency (35 sample(s)) 2,376,904 bytes maximum slop 58 MB total memory in use (1 MB lost due to fragmentation) ./freq2 +RTS -s 11.14s user 0.13s system 99% cpu 11.295 total Parallel to the String case, this is a lot more frugal with memory, and 30% faster. Now, I tried Data.HashMap from the hashmap library: ./freqH1 +RTS -s-- hashmap using fromListWith 4,552,922,784 bytes allocated in the heap 2,990,287,536 bytes copied during GC 401,247,912 bytes maximum residency (14 sample(s)) 42,098,016 bytes maximum slop 957 MB total memory in use (15 MB lost due to fragmentation) ./freqH1 +RTS -s 15.68s user 1.53s system 99% cpu 17.277 total ./freqH2 +RTS -s -- hashmap using foldl' insertWith 4,518,146,968 bytes allocated in the heap 2,986,973,352 bytes copied during GC 394,502,832 bytes maximum residency (14 sample(s)) 41,020,040 bytes maximum slop 957 MB total memory in use (15 MB lost due to fragmentation) ./freqH2 +RTS -s 15.86s user 1.62s system 99% cpu 17.537 total HashMap doesn't provide a strict insertWith, so this is similar to the lazy insertions above. A bit worse, actually, probably due to the overhead of hashing. Then, I discovered that Johan's hashmap is a different library, and thought I'd try that too for completeness. ./freqHS +RTS -s -- hashmap strict (unordered-containers) 2,628,628,752 bytes allocated in the heap 945,571,872 bytes copied during GC 26,635,744 bytes maximum residency (32 sample(s)) 2,433,504 bytes maximum slop 66 MB total memory in use (1 MB lost due to fragmentation) ./freqHS +RTS -s 6.90s user 0.16s system 99% cpu 7.082 total Memory residency like the other strict versions, but really fast, probably due to faster comparisons of hash values vs comparisons of strings. Conclusion: make sure you are using a strict map, and if your keys are strings or otherwise have expensive comparisons, unordered-containers is the library for you. -k PS: I also tried mapping 'copy' on the input words to avoid storing large slices of the input, but it only worsened things: ./freqHS3 +RTS -s (Just 77432,113931) 3,109,585,024 bytes allocated in the heap 936,724,184 bytes copied during GC 87,831,888 bytes maximum residency (19 sample(s)) 8,835,440 bytes maximum slop 164 MB total memory in use (3 MB lost due to fragmentation) ./freqHS3 +RTS -s 12.71s user 0.31s system 99% cpu 13.060 total Perhaps if you managed to only copy new words it would look better? PPS: I tried to be careful juggling the results around, but there's always the possiblity of a mistake. Caveat lector! (Or should that be 'cave scriptor'?) PPPS: There are some small interface annoyances around, it'd be nice
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
Just out of interest, did you try reading the input as plain old Strings? They may be unfashionable these days, and have their own type of badness in space and time performance, but might perhaps be a win over ByteStrings for extremely large datasets.Regards, MalcolmOn 01 Jun, 2011,at 02:49 PM, Aleksandar Dimitrov wrote:Hi John, > I think the issue is data sharing, as Brandon mentioned. A bytestring > consists of an offset, length, and a pointer. You're using a chunk size of > 64k, which means the generated bytestrings all have a pointer to that 64k of > data. Suppose there's one new word in that 64k, and it's near the beginning > of the chunk. Even though the word may only be a few characters long, it'll > reference the entire chunk and keep it from being GC'd. This seems to be the issue indeed! If the bytestrings on the hash map are holding references to the chunks, it is clear that we're going to consume memory scaling with the size of the input file, in case there are *any* new chunks generated. As I understand it, this what not a problem with the multiple copies of War and Peace, because all byte strings were already found on the hash table! On reading new input, old entries were found on the hash table, so only old chunks were kept in memory, the new ones could be gc'ed. In *realistic* data, however, the Long Tail is the reason that, after a while, some chunks of input would only be kept because there were a few byte strings referencing them. New words are rare, but they need not occur more frequently than once per chunk, in order to keep the whole chunk in memory, even though the chunk was mostly useless (most other data in this chunk would already be on the hash map.) > There are a few solutions to this. The first is to make a copy of the > bytestring so only the required data is retained. In my experiments this > wasn't helpful, but it would depend on your corpus. The second is to start > with smaller chunks. Using a chunk size of 1024 worked fairly well for me. > If your corpus is similar to natural language, I think it'll probably work > better for you as well. I think I solved this problem elegantly: I used Data.Text as hash map keys, instead of Data.ByteString. See the modified program below: > import qualified Data.Iteratee as I > import Data.Iteratee.Char > import Data.Iteratee.IO > > import qualified Data.HashMap.Strict as T > > import Data.Ord (comparing) > import Data.List (sortBy) > import System.Environment (getArgs) > > import qualified Data.ByteString as S > import qualified Data.Text as X > import Data.Text.Encoding (decodeUtf8) > > type Wordcounts = T.HashMap X.Text Int > > f' :: Monad m => I.Iteratee S.ByteString m Wordcounts > f' = I.joinI $ (enumLinesBS I.><> I.filter (not.S.null)) $ I.foldl' > (\t s -> T.insertWith (+) (convert s) 1 t) T.empty > where convert = decodeUtf8 > > main :: IO () > main = getArgs >>= fileDriverVBuf 65536 f'.head >>>= mapM_ (\(w,c)-> putStrLn $ X.unpack w ++ "\t" ++ show c).sortM > where sortM = sortBy (comparing snd) . T.toList Initial benchmarks on the realistic 100MB Gutenberg corpus I downloaded with my script yesterday report the following: htop says 120M memory residency towards the end of the life-cycle. <> 19MB avg, 44MB max residency, 133M in use (which is similar to what htop told me) and the heap profile clearly shows allocation and deallocation of the bytestrings: http://imgur.com/U1nyw (I'm attributing the ruggedness of the profile with the frequent little spikes to the iteratee chunk allocation and deallocation.) It seems I can't get rid of 50% of the heap being lag state. I don't quite understand yet what that is. I also don't know what INHERENT_USE is. But in any case, I now have a program that I can reasonably expect to run if I could fit the input file into memory. I might try to implement an analogous program in C++ or Java, just to see whether that would do better or similarly in terms of memory consumption. > Note that Johan's Ngram code also only keeps the minimum required data, > giving it a good memory profile. I didn't notice this last night because I > was testing with different data, and unfortunately the peculiar distribution > of that data masked this problem. This is kind of the big problem here: whether or not you'll see the particular behaviour I was so upset about seemed to depend on the corpus' distributional properties. In any case, I would like to thank you all for helping me understand and address the issue. I probably still have a long way to go in terms of understanding space-behaviour of Haskell programs, but at least I don't see ByteStrings as this black box of efficiency anymore, but actually understand how they're structured, and what they're good at, and what they aren't good at. (take home lesson: Data.Text is really nice. Also: if iteratee has a space leak, I probably didn't hit it, really. But: if reading byte-strings, one should mind the pointers tha
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
Hi Aleks, On Wed, Jun 1, 2011 at 12:14 AM, Aleksandar Dimitrov wrote: > I implemented your method, with these minimal changes (i.e. just using a main > driver in the same file.) > >> countUnigrams :: Handle -> IO (M.Map S.ByteString Int) >> countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty >> >> main :: IO () >> main = do (f:_) <- getArgs >> openFile f ReadMode >>= countUnigrams >>= print . M.toList > > It seems to perform about 3x worse than the iteratee method in terms of time, > and worse in terms of space :-( On Brandon's War & Peace example, hGetLine > uses > 1.565 seconds for the small file, whereas my iteratee method uses 1.085s for > the > small file, and around 2 minutes for the large file. That's curious. I chatted with Duncan Coutts today and he mentioned that hGetLine can be a bit slow as it needs to take a lock in every read and causes some copying, which could explain why it's slower than iteratee which works in blocks. However, I don't understand why it uses more memory. The ByteStrings that are returned by hGetLine should have an underlying storage of the same size as the ByteString (as reported by length). You can try to verify this by calling 'copy' on the ByteString before inserting it. It looks like hGetLine needs some love. > I also tried sprinkling strictness annotations throughout your above code, > but I > failed to produce good results :-( The strictness of the code I gave should be correct. The problem should be elsewhere. > I, unfortunately, don't really have any contact to "the elders," apart from > what > I read on their respective blogs… You and everyone else. :) I just spent enough time talking to people on IRC, reading good code, blogs and mailing list posts. I think Bryan described the process pretty well in his CUFP keynote: http://www.serpentine.com/blog/2009/09/23/video-of-my-cufp-keynote/ Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
Aleksandar Dimitrov writes: > Now, here's some observations: on a 75M input file (minuscule, compared to > what > I actually need) this program will eat 30M of heap space (says profiling) and > return in 14 secs. > > I have two problems with that: a) that's too much heap space, b) the actual > memory > residency is *much* worse. 30M isn't a lot these days. How does it scale? > ad b) average memory residency is at 38MB (this is OK, given heap consumption) > but max residency is at 130MB, which is unacceptable to me (remember that I > need > to run this on files *much* bigger than just 75M.) I think max residency (depending on how you measure) can be twice the heap size due to using copying GC. If you run short of memory, the runtime will switch to compacting GC which will be slower but use less memory. > I have tried and tried again to avoid writing programs in Haskell that would > leak space like BP likes to leak oil. I know the feeling. I think making a frequency table ought to be as simple as ... M.fromListWith (+) . map (,1) . words =<< readFile ... or at worst ... foldl' (\m w -> M.insertWith' (+) w 1 m) M.empty . words =<< readFile... (which evaluates things strictly, and at least in my small tests, seem to use quite a bit less heap space). -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
Hi John, > I think the issue is data sharing, as Brandon mentioned. A bytestring > consists of an offset, length, and a pointer. You're using a chunk size of > 64k, which means the generated bytestrings all have a pointer to that 64k of > data. Suppose there's one new word in that 64k, and it's near the beginning > of the chunk. Even though the word may only be a few characters long, it'll > reference the entire chunk and keep it from being GC'd. This seems to be the issue indeed! If the bytestrings on the hash map are holding references to the chunks, it is clear that we're going to consume memory scaling with the size of the input file, in case there are *any* new chunks generated. As I understand it, this what not a problem with the multiple copies of War and Peace, because all byte strings were already found on the hash table! On reading new input, old entries were found on the hash table, so only old chunks were kept in memory, the new ones could be gc'ed. In *realistic* data, however, the Long Tail is the reason that, after a while, some chunks of input would only be kept because there were a few byte strings referencing them. New words are rare, but they need not occur more frequently than once per chunk, in order to keep the whole chunk in memory, even though the chunk was mostly useless (most other data in this chunk would already be on the hash map.) > There are a few solutions to this. The first is to make a copy of the > bytestring so only the required data is retained. In my experiments this > wasn't helpful, but it would depend on your corpus. The second is to start > with smaller chunks. Using a chunk size of 1024 worked fairly well for me. > If your corpus is similar to natural language, I think it'll probably work > better for you as well. I think I solved this problem elegantly: I used Data.Text as hash map keys, instead of Data.ByteString. See the modified program below: > import qualified Data.Iteratee as I > import Data.Iteratee.Char > import Data.Iteratee.IO > > import qualified Data.HashMap.Strict as T > > import Data.Ord (comparing) > import Data.List (sortBy) > import System.Environment (getArgs) > > import qualified Data.ByteString as S > import qualified Data.Text as X > import Data.Text.Encoding (decodeUtf8) > > type Wordcounts = T.HashMap X.Text Int > > f' :: Monad m => I.Iteratee S.ByteString m Wordcounts > f' = I.joinI $ (enumLinesBS I.><> I.filter (not.S.null)) $ I.foldl' > (\t s -> T.insertWith (+) (convert s) 1 t) T.empty > where convert = decodeUtf8 > > main :: IO () > main = getArgs >>= fileDriverVBuf 65536 f'.head >>>= mapM_ (\(w,c)-> putStrLn $ X.unpack w ++ "\t" ++ show c).sortM > where sortM = sortBy (comparing snd) . T.toList Initial benchmarks on the realistic 100MB Gutenberg corpus I downloaded with my script yesterday report the following: htop says 120M memory residency towards the end of the life-cycle. <> 19MB avg, 44MB max residency, 133M in use (which is similar to what htop told me) and the heap profile clearly shows allocation and deallocation of the bytestrings: http://imgur.com/U1nyw (I'm attributing the ruggedness of the profile with the frequent little spikes to the iteratee chunk allocation and deallocation.) It seems I can't get rid of 50% of the heap being lag state. I don't quite understand yet what that is. I also don't know what INHERENT_USE is. But in any case, I now have a program that I can reasonably expect to run if I could fit the input file into memory. I might try to implement an analogous program in C++ or Java, just to see whether that would do better or similarly in terms of memory consumption. > Note that Johan's Ngram code also only keeps the minimum required data, > giving it a good memory profile. I didn't notice this last night because I > was testing with different data, and unfortunately the peculiar distribution > of that data masked this problem. This is kind of the big problem here: whether or not you'll see the particular behaviour I was so upset about seemed to depend on the corpus' distributional properties. In any case, I would like to thank you all for helping me understand and address the issue. I probably still have a long way to go in terms of understanding space-behaviour of Haskell programs, but at least I don't see ByteStrings as this black box of efficiency anymore, but actually understand how they're structured, and what they're good at, and what they aren't good at. (take home lesson: Data.Text is really nice. Also: if iteratee has a space leak, I probably didn't hit it, really. But: if reading byte-strings, one should mind the pointers that byte-strings actually are.) Regards, Aleks signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
That sounds like a plausible reason why naive copying explodes space. Something like string interning would be good here... and since you're hashing already... Edward Excerpts from Daniel Fischer's message of Wed Jun 01 06:46:24 -0400 2011: > On Wednesday 01 June 2011 12:28:28, John Lato wrote: > > There are a few solutions to this. The first is to make a copy of the > > bytestring so only the required data is retained. In my experiments > > this wasn't helpful, but it would depend on your corpus. The second is > > to start with smaller chunks. > > The third, check whether the word is already known, and *make a copy if > not*. That should only keep the required parts (including the currently > processed chunk) in memory. > ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
On Wednesday 01 June 2011 12:28:28, John Lato wrote: > There are a few solutions to this. The first is to make a copy of the > bytestring so only the required data is retained. In my experiments > this wasn't helpful, but it would depend on your corpus. The second is > to start with smaller chunks. The third, check whether the word is already known, and *make a copy if not*. That should only keep the required parts (including the currently processed chunk) in memory. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
On Wednesday 01 June 2011 12:13:54, John Lato wrote: > > From: Brandon Moore > > > > > > I was worried data sharing might mean your keys > > retain entire 64K chunks of the input. However, it > > seems enumLines depends on the StringLike ByteString > > instance, which just converts to and from String. > > That can't be efficient, but I suppose it avoids excessive sharing. > > That's true for 'enumLines', however the OP is using 'enumLinesBS', > which operates on bytestrings directly. > > Data sharing certainly could be an issue here. I tried performing > Data.ByteString.copy before inserting the key into the map, but that > used more memory. I don't have an explanation for this; it's not what > I would expect. If you don't copy, the small ByteStrings (the words) point into the large chunk, which peacefully rests in memory. A few ripples come from the chunk boundaries, where more often than not a word will have parts in both chunks, a new ByteString is allocated for those. A bit of space is wasted for the Constructor of the large chunk and for the word-separators ('\n' here), and at the boundaries. Not a big deal, the newlines cost one byte per text-word, but the constructors cost five or so machine words per text-word. If you copy each small ByteString from the chunk, a) the chunk stays in memory until it's completely copied [doesn't matter much with chunk sizes of a few k] and b) you get memory fragmentation because of alignment requirements. > > The other parameter which affects sharing is the chunk size. I got a > much better memory profile when using a chunksize of 1024 instead of > 65536. > > Oddly enough, when using the large chunksize I saw lower memory usage > from Data.Map, but with the small chunksize Data.HashMap has a > significant advantage. > > John Lato ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
On Wed, Jun 1, 2011 at 12:55 AM, Aleksandar Dimitrov < aleks.dimit...@googlemail.com> wrote: > On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote: > > > None of these leak space for me (all compiled with ghc-7.0.3 -O2). > > Performance was pretty comparable for every version, although > Aleksander's > > original did seem to have a very small edge. > > How big were your input corpora? > Today I was using multiple copies of War & Peace, as Brandon specified. Total size is about 90M. > > So it seems that I can't get rid of a factor of around 3x the input file > size. > Luckily, the dependency seems to be linear. Here's some profiling: > > < residency (189 samples), 322M in use, 0.00 INIT (0.00 elapsed), 23.73 MUT > (24.94 elapsed), 26.71 GC (27.10 elapsed) :ghc>> > ../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr 50.44s user 1.50s > system 99% cpu 52.064 total > > ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.) > > Redirecting the program's output to a file will yield a mere 2.2M for the > data > gathered by the above script. Since those 2.2M of data are all I care > about, why > do I need so much more RAM to compute them? > > Are my demands unreasonable? > I think the issue is data sharing, as Brandon mentioned. A bytestring consists of an offset, length, and a pointer. You're using a chunk size of 64k, which means the generated bytestrings all have a pointer to that 64k of data. Suppose there's one new word in that 64k, and it's near the beginning of the chunk. Even though the word may only be a few characters long, it'll reference the entire chunk and keep it from being GC'd. There are a few solutions to this. The first is to make a copy of the bytestring so only the required data is retained. In my experiments this wasn't helpful, but it would depend on your corpus. The second is to start with smaller chunks. Using a chunk size of 1024 worked fairly well for me. If your corpus is similar to natural language, I think it'll probably work better for you as well. Note that Johan's Ngram code also only keeps the minimum required data, giving it a good memory profile. I didn't notice this last night because I was testing with different data, and unfortunately the peculiar distribution of that data masked this problem. John L ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
> > From: Brandon Moore > > > I was worried data sharing might mean your keys > retain entire 64K chunks of the input. However, it > seems enumLines depends on the StringLike ByteString > instance, which just converts to and from String. > That can't be efficient, but I suppose it avoids excessive sharing. That's true for 'enumLines', however the OP is using 'enumLinesBS', which operates on bytestrings directly. Data sharing certainly could be an issue here. I tried performing Data.ByteString.copy before inserting the key into the map, but that used more memory. I don't have an explanation for this; it's not what I would expect. The other parameter which affects sharing is the chunk size. I got a much better memory profile when using a chunksize of 1024 instead of 65536. Oddly enough, when using the large chunksize I saw lower memory usage from Data.Map, but with the small chunksize Data.HashMap has a significant advantage. John Lato ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote: > I can't reproduce the space leak here. I tried Aleksander's original code, > my iteratee version, the Ngrams version posted by Johan Tibell, and a lazy > bytestring version. I unfortunately can't post the actual corpus here, because it's copyrighted. But there's plenty of ways to retrieve large amounts of data from the Internet. See below. > f' :: Monad m => I.Iteratee S.ByteString m Wordcounts > f' = I.joinI $ (enumLinesBS I.><> I.filter (not . S.null)) $ I.foldl' (\t s > -> T.insertWith (+) s 1 t) T.empty Neat, folding the specialised function into the enumeratee is nifty! One can tell that my experience with iteratee/enumerator has been only a day's worth for now :-\ > None of these leak space for me (all compiled with ghc-7.0.3 -O2). > Performance was pretty comparable for every version, although Aleksander's > original did seem to have a very small edge. How big were your input corpora? Here's an absolutely evil shell script that is going to make me donate money to project Gutenberg. It will gather a corpus that is still very small, but at least realistic in its distributional properties. +++ scrape.sh #!/bin/sh textfile=all_text.txt touch $textfile text=0 size=0 for i in `seq 10 300`; do wget -q "http://www.gutenberg.org/files/$i/$i.zip"; if [ -f $i.zip ]; then unzip -qq $i.zip tr -sc '[[:alpha:]]' '\n' < $i.txt >> $textfile text=`dc -e "$text 1 + p"` size=`du -h $textfile | cut -f 1` rm -f $i.zip $i.txt fi echo -n "\rFound $text of $i texts, total $size." done echo "\rFound $text texts, total $size" +++ It'll take a while to run, and the resulting corpus is just a fraction of what I'm using, but it serves well to illustrate the problem. If you want, you can increase the amount of data gathered by simply tweaking the numerical range in the seq statement. (If you make it gobble up more bandwidth, it might be polite to put a sleep somewhere in the inner loop. I'd host the resulting file myself, but I don't know if there aren't any conditions on that.) If you did not tweak the script, it should've gathered some 100MB of data. Running my unmodified original program, htop records 320MB of RAM usage towards the end (*without* profiling being enabled.) So it seems that I can't get rid of a factor of around 3x the input file size. Luckily, the dependency seems to be linear. Here's some profiling: <> ../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr 50.44s user 1.50s system 99% cpu 52.064 total ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.) Redirecting the program's output to a file will yield a mere 2.2M for the data gathered by the above script. Since those 2.2M of data are all I care about, why do I need so much more RAM to compute them? Are my demands unreasonable? > I'd be happy to help track down a space leak in iteratee, but for now I'm > not seeing one. Thank you for your offer! Maybe I'm just seeing ghosts, and there is no space leak. But I really do think that the program is eating too much RAM. Regards, Aleks signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
From: "Edward Z. Yang" > > Hello Aleksandar, > > It is possible that the iteratees library is space leaking; I recall some > recent discussion to this effect. Your example seems simple enough that > you might recompile with a version of iteratees that has -auto-all enabled. > Unfortunately, it's not really a safe bet to assume your libraries are > leak free, and if you've pinpointed it down to a single line, and there > doesn't seem a way to squash the leak, I'd bet it's the library's fault. > > Edward > I can't reproduce the space leak here. I tried Aleksander's original code, my iteratee version, the Ngrams version posted by Johan Tibell, and a lazy bytestring version. my iteratee version (only f' has changed from Aleksander's code): f' :: Monad m => I.Iteratee S.ByteString m Wordcounts f' = I.joinI $ (enumLinesBS I.><> I.filter (not . S.null)) $ I.foldl' (\t s -> T.insertWith (+) s 1 t) T.empty my lazy bytestring version > import Data.Iteratee.Char > import Data.List (foldl')import Data.Char (toLower) > > import Data.Ord (comparing) > import Data.List (sortBy) > import System.Environment (getArgs) > import qualified Data.ByteString.Lazy.Char8 as L > import qualified Data.HashMap.Strict as T > > f'2 = foldl' (\t s -> T.insertWith (+) s 1 t) T.empty . filter (not . L.null) . L.lines > > main2 :: IO () > main2 = getArgs >>= L.readFile .head >>= print . T.keys . f'2 None of these leak space for me (all compiled with ghc-7.0.3 -O2). Performance was pretty comparable for every version, although Aleksander's original did seem to have a very small edge. As someone already pointed out, keep in mind that this will use a lot of memory anyway, unless there's a lot of repetition of words. I'd be happy to help track down a space leak in iteratee, but for now I'm not seeing one. Best, John Lato ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
On Tue, May 31, 2011 at 02:13:14PM -0400, Edward Z. Yang wrote: > It is possible that the iteratees library is space leaking; I recall some > recent discussion to this effect. Your example seems simple enough that > you might recompile with a version of iteratees that has -auto-all enabled. If I understand you correctly, you imply that I should try compiling iteratee with profiling, no? I did install the iteratee library with profiling support (I have the cabal profiling flag globally set in my cabal config,) but my profiles so far seem to be blaming LAGging ByteStrings and HashMaps. I, unfortunately, do not know how I would test iteratee itself for a space leak here. > Unfortunately, it's not really a safe bet to assume your libraries are > leak free, and if you've pinpointed it down to a single line, and there > doesn't seem a way to squash the leak, I'd bet it's the library's fault. Since my knowledge of Haskell, and, in particular, high-performance Haskell, is very lacking, my current m.o. is to blame myself :-) It might be iteratee, but unfortunately, I have not found something that gives me better performance than iteratee yet. Regards, Aleks signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
Hi Johan, > Here's how I would do it: I implemented your method, with these minimal changes (i.e. just using a main driver in the same file.) > countUnigrams :: Handle -> IO (M.Map S.ByteString Int) > countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty > > main :: IO () > main = do (f:_) <- getArgs > openFile f ReadMode >>= countUnigrams >>= print . M.toList It seems to perform about 3x worse than the iteratee method in terms of time, and worse in terms of space :-( On Brandon's War & Peace example, hGetLine uses 1.565 seconds for the small file, whereas my iteratee method uses 1.085s for the small file, and around 2 minutes for the large file. For the large file, the code above starts consuming around 2.5GB of RAM, so it clearly has a space leak somewhere. Where, I don't know. If you want to try it out, here's a short command line to make a test corpus the way Brandon made one: +++ wget 'http://www.gutenberg.org/files/2600/2600.zip'; unzip 2600.zip; touch wnp100.txt; for i in {1..100}; do echo -n "$i "; cat 2600.txt >> wnp100.txt; done; echo "Done. +++ Note, that, as I detailed in my prior email to Brandon, even if you do end up with a (supposedly) non-leaking program for this example corpus, that doesn't mean it'll scale well to real world data. I also tried sprinkling strictness annotations throughout your above code, but I failed to produce good results :-( > We definitely need more accessible material on how to reliably write > fast Haskell code. There are those among us who can, but it shouldn't > be necessary to learn it in the way they did (i.e. by lots of > tinkering, learning from the elders, etc). I'd like to write a 60 (or > so) pages tutorial on the subject, but haven't found the time. I'd be an eager reader :-) Please do announce it on -cafe or the "usual places" should you ever come around to writing it! I, unfortunately, don't really have any contact to "the elders," apart from what I read on their respective blogs… > In addition to RWH, perhaps the slides from the talk on > high-performance Haskell I gave could be useful: > > > http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell.html Thanks, I'll give it a look later tomorrow! Regards, Aleks PS: Sorry I didn't answer you in #haskell, I ended up having to go afk for a short while. Thanks for all your help! signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
On Tue, May 31, 2011 at 11:43:27AM -0700, Brandon Moore wrote: > I can't reproduce heap usage growing with the > size of the input file. > > I made a word list from Project Gutenberg's > copy of "War and Peace" by > > tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt > > Using 1, 25, or 1000 repetitions of this ~3MB wordlist > shows about 100MB of address space used according > to top, and no more than 5MB or so of haskell heap > used according to the memory profile, with a flat > memory profile. This will lead to very small variance in your data. The effect I'm experiencing is so small in this case, that it's barely observable (but it is, see below.) > Is your memory usage growing with the size of the input > file, or the size of the histogram? While the histogram is naturally growing with the size of the input file, memory usage seems to be proportional mainly to the histogram. It is clear that, due to the effect of the Long Tail, the histogram is going to constantly grow in a real setting, as opposed to just replicating the same data. In your test case, the histogram is *not* growing with the size of the input file. The memory usage is proportional to the histogram, which is proportional to the file size. That is not the problem. The problem is, that, compared to the size of the histogram, the memory consumption is *inadequately* high. Here's some more data, using your Tolstoi example: du file.txt 344Mfile.txt <> As you can see, memory residency is at 8 MB avg, 10 MB max. This is the War&Peace file, replicated 100 times. Let's look at the result for the file *without* first replicating it 100 times: du words.txt 3.0Mwords.txt <> 4.8MB avg, 9.1 MB max. It seems input file size *does* have an effect of sorts, but it's negligible. What is more interesting is this: the file is a whopping 3MB big. How on earth does the program consume almost 5 MB *on average*? This is *not* constant memory usage. This is memory usage trivial enough to not be worth a fuss for small inputs, but unfortunately, it gets larger as soon as you increase the file size (in a realistic fashion: i.e. you'll also increase the unigram count.) > I was worried data sharing might mean your keys > retain entire 64K chunks of the input. However, it > seems enumLines depends on the StringLike ByteString > instance, which just converts to and from String. Ouch, that sounds like something worth fixing. > The other thing that occurs to me is that the total size of > your keys would also be approximately the size of the input > file if you were using plain text without each word split onto > a separate line. Well, I am not. The corpus is a word-per-line corpus, I'm reading a word per line, and adding that to my hash map. This should never result in a data structure even close to the size of the original corpus. It could be, in a very unrealistic worst case scenario. But even a corpus of 30GB of Poe and Heidegger isn't going to make that happen. Furthermore, mine is not such a scenario at all. As I said, if you reduce the corpus to a set of words (i.e. a set of unigrams) you get a 40MB file from a 1.4GB corpus. Why is it, that in order to create that 40MB file from a 1.4GB corpus, my trivial little program needs somewhere north of 6-8 GB of RAM? In this trivial example for War and Peace, why is it that in order to create the unigram table for War and Peace, which ends up being a mere 201KB big, we're chomping through 5MB on average, and nearly 10MB max? That's at least 25 times more than we actually *should* have (yes, I know that the RTS is somewhere there, too, but I think it's not a problem to ignore that for now.) Regards, Aleks signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
Wait, do ByteStrings show up on a heap profile, if the space is allocated with malloc? Anyway, I think my tests still show that the memory used by the process doesn't grow simply by adding more data, if you are no longer added keys to the map. - Original Message - > From: Brandon Moore > To: Aleksandar Dimitrov ; > "haskell-cafe@haskell.org" > Cc: > Sent: Tuesday, May 31, 2011 1:43 PM > Subject: Re: [Haskell-cafe] How on Earth Do You Reason about Space? > > I can't reproduce heap usage growing with the > size of the input file. > > I made a word list from Project Gutenberg's > copy of "War and Peace" by > > tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt > > Using 1, 25, or 1000 repetitions of this ~3MB wordlist > shows about 100MB of address space used according > to top, and no more than 5MB or so of haskell heap > used according to the memory profile, with a flat > memory profile. > > > Is your memory usage growing with the size of the input > file, or the size of the histogram? > > I was worried data sharing might mean your keys > retain entire 64K chunks of the input. However, it > seems enumLines depends on the StringLike ByteString > instance, which just converts to and from String. > That can't be efficient, but I suppose it avoids excessive sharing. > > The other thing that occurs to me is that the total size of > your keys would also be approximately the size of the input > file if you were using plain text without each word split onto > a separate line. > > Brandon > ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
I can't reproduce heap usage growing with the size of the input file. I made a word list from Project Gutenberg's copy of "War and Peace" by tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt Using 1, 25, or 1000 repetitions of this ~3MB wordlist shows about 100MB of address space used according to top, and no more than 5MB or so of haskell heap used according to the memory profile, with a flat memory profile. Is your memory usage growing with the size of the input file, or the size of the histogram? I was worried data sharing might mean your keys retain entire 64K chunks of the input. However, it seems enumLines depends on the StringLike ByteString instance, which just converts to and from String. That can't be efficient, but I suppose it avoids excessive sharing. The other thing that occurs to me is that the total size of your keys would also be approximately the size of the input file if you were using plain text without each word split onto a separate line. Brandon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
Hello Aleksandar, It is possible that the iteratees library is space leaking; I recall some recent discussion to this effect. Your example seems simple enough that you might recompile with a version of iteratees that has -auto-all enabled. Unfortunately, it's not really a safe bet to assume your libraries are leak free, and if you've pinpointed it down to a single line, and there doesn't seem a way to squash the leak, I'd bet it's the library's fault. Edward ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
Hi Aleksandar, On Tue, May 31, 2011 at 6:10 PM, Aleksandar Dimitrov wrote: > Say, we have an input file that contains a word per line. I want to find all > unigrams (unique words) in that file, and associate with them the amount of > times they occurred in the file. This would allow me, for example, to make a > list of word frequencies in a given text. Here's how I would do it: {-# LANGUAGE BangPatterns #-} module Ngram (countUnigrams) where import qualified Data.ByteString as S import qualified Data.HashMap.Strict as M import System.IO foldLines :: (a -> S.ByteString -> a) -> a -> Handle -> IO a foldLines f z0 h = go z0 where go !z = do eof <- hIsEOF h if eof then return z else do line <- S.hGetLine h go $ f z line {-# INLINE foldLines #-} -- Example use countUnigrams :: IO (M.HashMap S.ByteString Int) countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty stdin > RANT > > I have tried and tried again to avoid writing programs in Haskell that would > leak space like BP likes to leak oil. However, I have yet to produce a single > instance of a program that would do anything at all and at the same time > consume > less memory than there is actual data in the input file. > > It is very disconcerting to me that I seem to be unable, even after quite some > practice, to identify space leaks in trivial programs like the above. I know > of > no good resource to educate myself in that regard. I have read the GHC manual, > RWH's chapter on profiling, also "Inside T5"'s recent series on the Haskell > heap, but no dice. Even if I can clearly see the exact line where at least > some > of the leaking happens (as I can in this case,) it seems impossible for me to > prevent it. > > *thank you very much* for reading this far. This is probably a mostly useless > email anyhow, I just had to get it off my chest. Maybe, just maybe, someone > among you will have a crucial insight that will save Haskell for me :-) But > currently, I see no justification to not start my next project in Lua, Python > or > Java. Sure, Haskell's code is pretty, and it's fun, but if I can't actually > *run* it, why bother? (Yes, this isn't the first time I've ran into this > problem …) We definitely need more accessible material on how to reliably write fast Haskell code. There are those among us who can, but it shouldn't be necessary to learn it in the way they did (i.e. by lots of tinkering, learning from the elders, etc). I'd like to write a 60 (or so) pages tutorial on the subject, but haven't found the time. In addition to RWH, perhaps the slides from the talk on high-performance Haskell I gave could be useful: http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell.html Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
> In Lag/Drag/Void/Use profiling, Lag is actually heap cells that are created > too _early_. (Drag are those that are kept for longer than necessary.) Lots > of Lag generally means your program is too strict - it is forcing structure > long before it needs to. To fix it, you need to make things lazier. My first > suspicion would fall on ByteString. Indeed, thank you, I mixed those up. I cannot use lazy byte strings here, because of the way Data.Iteratee.Char's enumLinesBS works (it takes strict byte strings.) The only other strictness in there is ($!) and foldl'. The latter is necessary for the program to even run (i.e. not run out of stack space.) The strict application in step's argument seems necessary, since without it, the program consumes 1200 MB of RAM (on my 75MB test data,) and takes very very long. The hb profile indicates that a lot of data is allocated up front, and then gradually eliminated. Interestingly, removing ($!) here seemed to *introduce* unnecessary strictness. Here's the hb profile without ($!): http://imgur.com/Ex7Pd I don't understand what is happening here :-\ I only just started using iteratees. Regards, Aleks signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
ad a) heap consumption is too high for two reasons: firstly, the actual data I care about is much less than there's data on the heap. Secondly, about half the heap space is in LAG state. Here are profiles that will illustrate this: http://imgur.com/wBWmJ&XN1mW
Re: [Haskell-cafe] How on Earth Do You Reason about Space?
On Tue, May 31, 2011 at 06:10:00PM +0200, Aleksandar Dimitrov wrote: > ad a) heap consumption is too high for two reasons: firstly, the actual data I > care about is much less than there's data on the heap. Secondly, about half > the > heap space is in LAG state. Here are profiles that will illustrate this: > http://imgur.com/wBWmJ&XN1mW
[Haskell-cafe] How on Earth Do You Reason about Space?
Dear Cafe, (Excuse the probably very ranty email; I am, unfortunately, at the end of my wits, and I hope that as fellow programmers, you will understand that this is among the most dreadful situations for our kind to be in.) Say, we have an input file that contains a word per line. I want to find all unigrams (unique words) in that file, and associate with them the amount of times they occurred in the file. This would allow me, for example, to make a list of word frequencies in a given text. Simple enough task. Here's an implementation using iteratees (lazy IO is evil) and unordered-containers' Data.HashMap.Strict, which enforces WHNF in values and keys: > import qualified Data.ByteString.Char8 as S > import qualified Data.Iteratee as I > import Data.Iteratee.IO > > import qualified Data.HashMap.Strict as T > > import Data.Iteratee.Char > import Data.List (foldl') > import Data.Char (toLower) > > import Data.Ord (comparing) > import Data.List (sortBy) > import System.Environment (getArgs) > > type Wordcounts = T.HashMap S.ByteString Int > > f' :: Monad m => I.Iteratee S.ByteString m Wordcounts > f' = I.joinI $ enumLinesBS (I.liftI $ step T.empty) > where step t (I.Chunk str) = I.liftI (step $! foldl' maybeIncrement t str) > step t stream = I.idone t stream > maybeIncrement t s > | s == S.empty = t > | otherwise= {-# SCC "m-I-other" #-} T.insertWith (+) s 1 t > > main :: IO () > main = getArgs >>= fileDriverVBuf 65536 f'.head >>= print.prettyList > where prettyList = -- sortBy (comparing snd) . T.toList >T.keys Some lines are empty, and I don't want them to be recorded, so that's why maybeIncrement is necessary. hpaste of this code: http://hpaste.org/47300/spaceleak (ignore convert, that's yet another issue.) Now, here's some observations: on a 75M input file (minuscule, compared to what I actually need) this program will eat 30M of heap space (says profiling) and return in 14 secs. I have two problems with that: a) that's too much heap space, b) the actual memory residency is *much* worse. ad b) average memory residency is at 38MB (this is OK, given heap consumption) but max residency is at 130MB, which is unacceptable to me (remember that I need to run this on files *much* bigger than just 75M.) <> In fact, htop reports total memory residency of the program at around 320 MB at the end of its life-cycle (compiled and ran without profiling options.) I tried running this program on a 1.4GB file, and had to kill it when at 3.5GB memory consumption, my computer started paging. The actual hash-map, however, shouldn't be much bigger than in the 75MB case (I expect about twice the size,) since the input is natural language. I redirected the output of the program (showing a list of assoc pairs that were in the hash map) to a file, and that file measured 11MB in the case of a 75MB corpus, and 40MB when I ran the program on a 1.4GB corpus (I had to use a machine with 24GB of RAM to be able to run this.) ad a) heap consumption is too high for two reasons: firstly, the actual data I care about is much less than there's data on the heap. Secondly, about half the heap space is in LAG state. Here are profiles that will illustrate this: http://imgur.com/wBWmJ&XN1mW