Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-02 Thread Ketil Malde

Since frequency counts are an important use of map-like data structures,
I did a brief test of the available options.  First using regular
strings for input, and Data.Map.fromListWith - i.e. the operational bit being:

  freqs :: [String] -> M.Map String Int
  freqs = M.fromListWith (+) . map (,1)

This runs on a 14M corpus consisting of King James Bible, collected
works of Shakespeare, and War and Peace.

./freqstr1 +RTS -s 
   5,093,386,368 bytes allocated in the heap
   2,682,667,904 bytes copied during GC
 261,110,368 bytes maximum residency (20 sample(s))
   9,018,000 bytes maximum slop
 623 MB total memory in use (10 MB lost due to fragmentation)
./freqstr1 +RTS -s  21.43s user 0.78s system 99% cpu 22.285 total

Kinda expensive, 250MB to store word frequencies of 14MB text.

Now, changing to 

  freqs :: [String] -> M.Map String Int
  freqs = foldl' (\m w -> M.insertWith' (+) w 1 m) M.empty

i.e. using strict insertion, avoiding the buildup of lazy thunks for the
counts.

./freqstr2 +RTS -s  -- strings, using strict insertion
   4,754,110,096 bytes allocated in the heap
   2,089,527,240 bytes copied during GC
  27,039,112 bytes maximum residency (66 sample(s))
 613,192 bytes maximum slop
  80 MB total memory in use (2 MB lost due to fragmentation)
./freqstr2 +RTS -s  17.48s user 0.13s system 99% cpu 17.665 total

This reduced maximam memory consumption to one tenth, still bigger than
input corpus, but clearly not too bad.  A bit faster, too, in spite of
probably doing more work.

Using ByteStrings instead, first fromListWith:

./freq +RTS -s
(Just 77432,113931)
   3,880,059,568 bytes allocated in the heap
   1,467,507,808 bytes copied during GC
 174,573,752 bytes maximum residency (14 sample(s))
   8,222,600 bytes maximum slop
 385 MB total memory in use (6 MB lost due to fragmentation)
./freq +RTS -s  14.26s user 0.49s system 99% cpu 14.798 total

About half the memroy of Strings, and 25% faster.  With strict insertion:

./freq2 +RTS -s   -- map using strict insertion
   3,761,614,312 bytes allocated in the heap
 849,806,000 bytes copied during GC
  23,950,328 bytes maximum residency (35 sample(s))
   2,376,904 bytes maximum slop
  58 MB total memory in use (1 MB lost due to fragmentation)
./freq2 +RTS -s  11.14s user 0.13s system 99% cpu 11.295 total

Parallel to the String case, this is a lot more frugal with memory, and
30% faster.  Now, I tried Data.HashMap from the hashmap library:

./freqH1 +RTS -s-- hashmap using fromListWith
   4,552,922,784 bytes allocated in the heap
   2,990,287,536 bytes copied during GC
 401,247,912 bytes maximum residency (14 sample(s))
  42,098,016 bytes maximum slop
 957 MB total memory in use (15 MB lost due to fragmentation)
./freqH1 +RTS -s  15.68s user 1.53s system 99% cpu 17.277 total

./freqH2 +RTS -s   -- hashmap using foldl' insertWith
   4,518,146,968 bytes allocated in the heap
   2,986,973,352 bytes copied during GC
 394,502,832 bytes maximum residency (14 sample(s))
  41,020,040 bytes maximum slop
 957 MB total memory in use (15 MB lost due to fragmentation)
./freqH2 +RTS -s  15.86s user 1.62s system 99% cpu 17.537 total

HashMap doesn't provide a strict insertWith, so this is similar to the
lazy insertions above.  A bit worse, actually, probably due to the
overhead of hashing.

Then, I discovered that Johan's hashmap is a different library, and
thought I'd try that too for completeness.

./freqHS +RTS -s  -- hashmap strict (unordered-containers)
   2,628,628,752 bytes allocated in the heap
 945,571,872 bytes copied during GC
  26,635,744 bytes maximum residency (32 sample(s))
   2,433,504 bytes maximum slop
  66 MB total memory in use (1 MB lost due to fragmentation)

./freqHS +RTS -s  6.90s user 0.16s system 99% cpu 7.082 total

Memory residency like the other strict versions, but really fast,
probably due to faster comparisons of hash values vs comparisons of
strings. 

Conclusion: make sure you are using a strict map, and if your keys are
strings or otherwise have expensive comparisons, unordered-containers is
the library for you.

-k

PS: I also tried mapping 'copy' on the input words to avoid storing
large slices of the input, but it only worsened things:

./freqHS3 +RTS -s 
(Just 77432,113931)
   3,109,585,024 bytes allocated in the heap
 936,724,184 bytes copied during GC
  87,831,888 bytes maximum residency (19 sample(s))
   8,835,440 bytes maximum slop
 164 MB total memory in use (3 MB lost due to fragmentation)
./freqHS3 +RTS -s  12.71s user 0.31s system 99% cpu 13.060 total

Perhaps if you managed to only copy new words it would look better?

PPS: I tried to be careful juggling the results around, but there's
always the possiblity of a mistake.  Caveat lector!  (Or should that be
'cave scriptor'?)

PPPS: There are some small interface annoyances around, it'd be nice

Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread malcolm.wallace
Just out of interest, did you try reading the input as plain old Strings?  They may be unfashionable these days, and have their own type of badness in space and time performance, but might perhaps be a win over ByteStrings for extremely large datasets.Regards,
MalcolmOn 01 Jun, 2011,at 02:49 PM, Aleksandar Dimitrov  wrote:Hi John,
> I think the issue is data sharing, as Brandon mentioned.  A bytestring
> consists of an offset, length, and a pointer.  You're using a chunk size of
> 64k, which means the generated bytestrings all have a pointer to that 64k of
> data.  Suppose there's one new word in that 64k, and it's near the beginning
> of the chunk.  Even though the word may only be a few characters long, it'll
> reference the entire chunk and keep it from being GC'd.

This seems to be the issue indeed! If the bytestrings on the hash map are
holding references to the chunks, it is clear that we're going to consume memory
scaling with the size of the input file, in case there are *any* new chunks
generated.

As I understand it, this what not a problem with the multiple copies of War and
Peace, because all byte strings were already found on the hash table! On reading
new input, old entries were found on the hash table, so only old chunks were
kept in memory, the new ones could be gc'ed.

In *realistic* data, however, the Long Tail is the reason that, after a while,
some chunks of input would only be kept because there were a few byte strings
referencing them. New words are rare, but they need not occur more frequently
than once per chunk, in order to keep the whole chunk in memory, even though the
chunk was mostly useless (most other data in this chunk would already be on the
hash map.)

> There are a few solutions to this.  The first is to make a copy of the
> bytestring so only the required data is retained.  In my experiments this
> wasn't helpful, but it would depend on your corpus.  The second is to start
> with smaller chunks.  Using a chunk size of 1024 worked fairly well for me.
>  If your corpus is similar to natural language, I think it'll probably work
> better for you as well.

I think I solved this problem elegantly: I used Data.Text as hash map keys,
instead of Data.ByteString. See the modified program below:

> import qualified Data.Iteratee as I
> import Data.Iteratee.Char
> import Data.Iteratee.IO
> 
> import qualified Data.HashMap.Strict as T
> 
> import Data.Ord (comparing)
> import Data.List (sortBy)
> import System.Environment (getArgs)
> 
> import qualified Data.ByteString as S
> import qualified Data.Text as X
> import Data.Text.Encoding (decodeUtf8)
> 
> type Wordcounts = T.HashMap X.Text Int
> 
> f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
> f' = I.joinI $ (enumLinesBS I.><> I.filter (not.S.null)) $ I.foldl'
> (\t s -> T.insertWith (+) (convert s) 1 t) T.empty
> where convert = decodeUtf8
> 
> main :: IO ()
> main = getArgs >>= fileDriverVBuf 65536 f'.head
>>>= mapM_ (\(w,c)-> putStrLn $ X.unpack w ++ "\t" ++ show c).sortM
> where sortM = sortBy (comparing snd) . T.toList

Initial benchmarks on the realistic 100MB Gutenberg corpus I downloaded with my
script yesterday report the following: htop says 120M memory residency towards
the end of the life-cycle.

<>

19MB avg, 44MB max residency, 133M in use (which is similar to what htop told
me) and the heap profile clearly shows allocation and deallocation of the
bytestrings: http://imgur.com/U1nyw (I'm attributing the ruggedness of the
profile with the frequent little spikes to the iteratee chunk allocation and
deallocation.) It seems I can't get rid of 50% of the heap being lag state. I
don't quite understand yet what that is. I also don't know what INHERENT_USE is.

But in any case, I now have a program that I can reasonably expect to run if I
could fit the input file into memory. I might try to implement an analogous
program in C++ or Java, just to see whether that would do better or similarly in
terms of memory consumption.

> Note that Johan's Ngram code also only keeps the minimum required data,
> giving it a good memory profile.   I didn't notice this last night because I
> was testing with different data, and unfortunately the peculiar distribution
> of that data masked this problem.

This is kind of the big problem here: whether or not you'll see the particular
behaviour I was so upset about seemed to depend on the corpus' distributional
properties.

In any case, I would like to thank you all for helping me understand and address
the issue. I probably still have a long way to go in terms of understanding
space-behaviour of Haskell programs, but at least I don't see ByteStrings as
this black box of efficiency anymore, but actually understand how they're
structured, and what they're good at, and what they aren't good at.

(take home lesson: Data.Text is really nice. Also: if iteratee has a space leak,
I probably didn't hit it, really. But: if reading byte-strings, one should mind
the pointers tha

Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Johan Tibell
Hi Aleks,

On Wed, Jun 1, 2011 at 12:14 AM, Aleksandar Dimitrov
 wrote:
> I implemented your method, with these minimal changes (i.e. just using a main
> driver in the same file.)
>
>> countUnigrams :: Handle -> IO (M.Map S.ByteString Int)
>> countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty
>>
>> main :: IO ()
>> main = do (f:_) <- getArgs
>>           openFile f ReadMode >>= countUnigrams >>= print . M.toList
>
> It seems to perform about 3x worse than the iteratee method in terms of time,
> and worse in terms of space :-( On Brandon's War & Peace example, hGetLine 
> uses
> 1.565 seconds for the small file, whereas my iteratee method uses 1.085s for 
> the
> small file, and around 2 minutes for the large file.

That's curious. I chatted with Duncan Coutts today and he mentioned
that hGetLine can be a bit slow as it needs to take a lock in every
read and causes some copying, which could explain why it's slower than
iteratee which works in blocks. However, I don't understand why it
uses more memory. The ByteStrings that are returned by hGetLine should
have an underlying storage of the same size as the ByteString (as
reported by length). You can try to verify this by calling 'copy' on
the ByteString before inserting it.

It looks like hGetLine needs some love.

> I also tried sprinkling strictness annotations throughout your above code, 
> but I
> failed to produce good results :-(

The strictness of the code I gave should be correct. The problem
should be elsewhere.

> I, unfortunately, don't really have any contact to "the elders," apart from 
> what
> I read on their respective blogs…

You and everyone else. :) I just spent enough time talking to people
on IRC, reading good code, blogs and mailing list posts. I think Bryan
described the process pretty well in his CUFP keynote:

http://www.serpentine.com/blog/2009/09/23/video-of-my-cufp-keynote/

Cheers,
Johan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Ketil Malde
Aleksandar Dimitrov  writes:

> Now, here's some observations: on a 75M input file (minuscule, compared to 
> what
> I actually need) this program will eat 30M of heap space (says profiling) and
> return in 14 secs.
>
> I have two problems with that: a) that's too much heap space, b) the actual 
> memory
> residency is *much* worse.

30M isn't a lot these days.  How does it scale?

> ad b) average memory residency is at 38MB (this is OK, given heap consumption)
> but max residency is at 130MB, which is unacceptable to me (remember that I 
> need
> to run this on files *much* bigger than just 75M.)

I think max residency (depending on how you measure) can be twice the
heap size due to using copying GC.  If you run short of memory, the
runtime will switch to compacting GC which will be slower but use less
memory. 

> I have tried and tried again to avoid writing programs in Haskell that would
> leak space like BP likes to leak oil.

I know the feeling.  I think making a frequency table ought to be as
simple as 

   ... M.fromListWith (+) . map (,1) . words =<< readFile ...

or at worst

  ... foldl' (\m w -> M.insertWith' (+) w 1 m) M.empty . words =<< readFile...

(which evaluates things strictly, and at least in my small tests, seem
to use quite a bit less heap space).

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Aleksandar Dimitrov
Hi John,
> I think the issue is data sharing, as Brandon mentioned.  A bytestring
> consists of an offset, length, and a pointer.  You're using a chunk size of
> 64k, which means the generated bytestrings all have a pointer to that 64k of
> data.  Suppose there's one new word in that 64k, and it's near the beginning
> of the chunk.  Even though the word may only be a few characters long, it'll
> reference the entire chunk and keep it from being GC'd.

This seems to be the issue indeed! If the bytestrings on the hash map are
holding references to the chunks, it is clear that we're going to consume memory
scaling with the size of the input file, in case there are *any* new chunks
generated.

As I understand it, this what not a problem with the multiple copies of War and
Peace, because all byte strings were already found on the hash table! On reading
new input, old entries were found on the hash table, so only old chunks were
kept in memory, the new ones could be gc'ed.

In *realistic* data, however, the Long Tail is the reason that, after a while,
some chunks of input would only be kept because there were a few byte strings
referencing them. New words are rare, but they need not occur more frequently
than once per chunk, in order to keep the whole chunk in memory, even though the
chunk was mostly useless (most other data in this chunk would already be on the
hash map.)

> There are a few solutions to this.  The first is to make a copy of the
> bytestring so only the required data is retained.  In my experiments this
> wasn't helpful, but it would depend on your corpus.  The second is to start
> with smaller chunks.  Using a chunk size of 1024 worked fairly well for me.
>  If your corpus is similar to natural language, I think it'll probably work
> better for you as well.

I think I solved this problem elegantly: I used Data.Text as hash map keys,
instead of Data.ByteString. See the modified program below:

> import qualified Data.Iteratee as I
> import Data.Iteratee.Char
> import Data.Iteratee.IO
> 
> import qualified Data.HashMap.Strict as T
> 
> import Data.Ord (comparing)
> import Data.List (sortBy)
> import System.Environment (getArgs)
> 
> import qualified Data.ByteString as S
> import qualified Data.Text as X
> import Data.Text.Encoding (decodeUtf8)
> 
> type Wordcounts = T.HashMap X.Text Int
> 
> f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
> f' = I.joinI $ (enumLinesBS I.><> I.filter (not.S.null)) $ I.foldl'
> (\t s -> T.insertWith (+) (convert s) 1 t) T.empty
> where convert = decodeUtf8
> 
> main :: IO ()
> main = getArgs >>= fileDriverVBuf 65536 f'.head
>>>= mapM_ (\(w,c)-> putStrLn $ X.unpack w ++ "\t" ++ show c).sortM
> where sortM = sortBy (comparing snd) . T.toList

Initial benchmarks on the realistic 100MB Gutenberg corpus I downloaded with my
script yesterday report the following: htop says 120M memory residency towards
the end of the life-cycle.

<>

19MB avg, 44MB max residency, 133M in use (which is similar to what htop told
me) and the heap profile clearly shows allocation and deallocation of the
bytestrings: http://imgur.com/U1nyw (I'm attributing the ruggedness of the
profile with the frequent little spikes to the iteratee chunk allocation and
deallocation.) It seems I can't get rid of 50% of the heap being lag state. I
don't quite understand yet what that is. I also don't know what INHERENT_USE is.

But in any case, I now have a program that I can reasonably expect to run if I
could fit the input file into memory. I might try to implement an analogous
program in C++ or Java, just to see whether that would do better or similarly in
terms of memory consumption.

> Note that Johan's Ngram code also only keeps the minimum required data,
> giving it a good memory profile.   I didn't notice this last night because I
> was testing with different data, and unfortunately the peculiar distribution
> of that data masked this problem.

This is kind of the big problem here: whether or not you'll see the particular
behaviour I was so upset about seemed to depend on the corpus' distributional
properties.

In any case, I would like to thank you all for helping me understand and address
the issue. I probably still have a long way to go in terms of understanding
space-behaviour of Haskell programs, but at least I don't see ByteStrings as
this black box of efficiency anymore, but actually understand how they're
structured, and what they're good at, and what they aren't good at.

(take home lesson: Data.Text is really nice. Also: if iteratee has a space leak,
I probably didn't hit it, really. But: if reading byte-strings, one should mind
the pointers that byte-strings actually are.)

Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Edward Z. Yang
That sounds like a plausible reason why naive copying explodes space.
Something like string interning would be good here... and since you're
hashing already...

Edward

Excerpts from Daniel Fischer's message of Wed Jun 01 06:46:24 -0400 2011:
> On Wednesday 01 June 2011 12:28:28, John Lato wrote:
> > There are a few solutions to this.  The first is to make a copy of the
> > bytestring so only the required data is retained.  In my experiments
> > this wasn't helpful, but it would depend on your corpus.  The second is
> > to start with smaller chunks.
> 
> The third, check whether the word is already known, and *make a copy if 
> not*. That should only keep the required parts (including the currently 
> processed chunk) in memory.
> 

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Daniel Fischer
On Wednesday 01 June 2011 12:28:28, John Lato wrote:
> There are a few solutions to this.  The first is to make a copy of the
> bytestring so only the required data is retained.  In my experiments
> this wasn't helpful, but it would depend on your corpus.  The second is
> to start with smaller chunks.

The third, check whether the word is already known, and *make a copy if 
not*. That should only keep the required parts (including the currently 
processed chunk) in memory.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Daniel Fischer
On Wednesday 01 June 2011 12:13:54, John Lato wrote:
> > From: Brandon Moore 
> > 
> > 
> > I was worried data sharing might mean your keys
> > retain entire 64K chunks of the input. However, it
> > seems enumLines depends on the StringLike ByteString
> > instance, which just converts to and from String.
> > That can't be efficient, but I suppose it avoids excessive sharing.
> 
> That's true for 'enumLines', however the OP is using 'enumLinesBS',
> which operates on bytestrings directly.
> 
> Data sharing certainly could be an issue here.  I tried performing
> Data.ByteString.copy before inserting the key into the map, but that
> used more memory.  I don't have an explanation for this; it's not what
> I would expect.

If you don't copy, the small ByteStrings (the words) point into the large 
chunk, which peacefully rests in memory. A few ripples come from the chunk 
boundaries, where more often than not a word will have parts in both 
chunks, a new ByteString is allocated for those.
A bit of space is wasted for the Constructor of the large chunk and for the 
word-separators ('\n' here), and at the boundaries. Not a big deal, the 
newlines cost one byte per text-word, but the constructors cost five or so 
machine words per text-word.

If you copy each small ByteString from the chunk, a) the chunk stays in 
memory until it's completely copied [doesn't matter much with chunk sizes 
of a few k] and b) you get memory fragmentation because of alignment 
requirements.

> 
> The other parameter which affects sharing is the chunk size.  I got a
> much better memory profile when using a chunksize of 1024 instead of 
> 65536.
> 
> Oddly enough, when using the large chunksize I saw lower memory usage
> from Data.Map, but with the small chunksize Data.HashMap has a
> significant advantage.
> 
> John Lato

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread John Lato
On Wed, Jun 1, 2011 at 12:55 AM, Aleksandar Dimitrov <
aleks.dimit...@googlemail.com> wrote:

> On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote:
>
> > None of these leak space for me (all compiled with ghc-7.0.3 -O2).
> > Performance was pretty comparable for every version, although
> Aleksander's
> > original did seem to have a very small edge.
>
> How big were your input corpora?
>

Today I was using multiple copies of War & Peace, as Brandon specified.
 Total size is about 90M.


>
> So it seems that I can't get rid of a factor of around 3x the input file
> size.
> Luckily, the dependency seems to be linear. Here's some profiling:
>
> < residency (189 samples), 322M in use, 0.00 INIT (0.00 elapsed), 23.73 MUT
> (24.94 elapsed), 26.71 GC (27.10 elapsed) :ghc>>
> ../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr  50.44s user 1.50s
> system 99% cpu 52.064 total
>
> ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.)
>
> Redirecting the program's output to a file will yield a mere 2.2M for the
> data
> gathered by the above script. Since those 2.2M of data are all I care
> about, why
> do I need so much more RAM to compute them?
>
> Are my demands unreasonable?
>

I think the issue is data sharing, as Brandon mentioned.  A bytestring
consists of an offset, length, and a pointer.  You're using a chunk size of
64k, which means the generated bytestrings all have a pointer to that 64k of
data.  Suppose there's one new word in that 64k, and it's near the beginning
of the chunk.  Even though the word may only be a few characters long, it'll
reference the entire chunk and keep it from being GC'd.

There are a few solutions to this.  The first is to make a copy of the
bytestring so only the required data is retained.  In my experiments this
wasn't helpful, but it would depend on your corpus.  The second is to start
with smaller chunks.  Using a chunk size of 1024 worked fairly well for me.
 If your corpus is similar to natural language, I think it'll probably work
better for you as well.

Note that Johan's Ngram code also only keeps the minimum required data,
giving it a good memory profile.   I didn't notice this last night because I
was testing with different data, and unfortunately the peculiar distribution
of that data masked this problem.

John L
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread John Lato
>
> From: Brandon Moore 
>
>
> I was worried data sharing might mean your keys
> retain entire 64K chunks of the input. However, it
> seems enumLines depends on the StringLike ByteString
> instance, which just converts to and from String.
> That can't be efficient, but I suppose it avoids excessive sharing.


That's true for 'enumLines', however the OP is using 'enumLinesBS', which
operates on bytestrings directly.

Data sharing certainly could be an issue here.  I tried performing
Data.ByteString.copy before inserting the key into the map, but that used
more memory.  I don't have an explanation for this; it's not what I would
expect.

The other parameter which affects sharing is the chunk size.  I got a much
better memory profile when using a chunksize of 1024 instead of  65536.

Oddly enough, when using the large chunksize I saw lower memory usage from
Data.Map, but with the small chunksize Data.HashMap has a significant
advantage.

John Lato
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote:
> I can't reproduce the space leak here.  I tried Aleksander's original code,
> my iteratee version, the Ngrams version posted by Johan Tibell, and a lazy
> bytestring version.

I unfortunately can't post the actual corpus here, because it's copyrighted. But
there's plenty of ways to retrieve large amounts of data from the Internet. See
below.


> f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
> f' = I.joinI $ (enumLinesBS I.><> I.filter (not . S.null)) $ I.foldl' (\t s
> -> T.insertWith (+) s 1 t) T.empty

Neat, folding the specialised function into the enumeratee is nifty! One can
tell that my experience with iteratee/enumerator has been only a day's worth for
now :-\

> None of these leak space for me (all compiled with ghc-7.0.3 -O2).
> Performance was pretty comparable for every version, although Aleksander's
> original did seem to have a very small edge.

How big were your input corpora?

Here's an absolutely evil shell script that is going to make me donate money to
project Gutenberg. It will gather a corpus that is still very small, but at
least realistic in its distributional properties.

+++ scrape.sh

#!/bin/sh

textfile=all_text.txt

touch $textfile

text=0
size=0
for i in `seq 10 300`; do
wget -q "http://www.gutenberg.org/files/$i/$i.zip";
if [ -f $i.zip ]; then
unzip -qq $i.zip
tr -sc '[[:alpha:]]' '\n' < $i.txt >> $textfile
text=`dc -e "$text 1 + p"`
size=`du -h $textfile | cut -f 1`
rm -f $i.zip $i.txt
fi
echo -n "\rFound $text of $i texts, total $size."
done
echo "\rFound $text texts, total $size"

+++

It'll take a while to run, and the resulting corpus is just a fraction of what
I'm using, but it serves well to illustrate the problem. If you want, you can
increase the amount of data gathered by simply tweaking the numerical range in
the seq statement. (If you make it gobble up more bandwidth, it might be polite
to put a sleep somewhere in the inner loop. I'd host the resulting file myself,
but I don't know if there aren't any conditions on that.)

If you did not tweak the script, it should've gathered some 100MB of data.

Running my unmodified original program, htop records 320MB of RAM usage towards
the end (*without* profiling being enabled.)

So it seems that I can't get rid of a factor of around 3x the input file size.
Luckily, the dependency seems to be linear. Here's some profiling:

<>
../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr  50.44s user 1.50s 
system 99% cpu 52.064 total

ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.)

Redirecting the program's output to a file will yield a mere 2.2M for the data
gathered by the above script. Since those 2.2M of data are all I care about, why
do I need so much more RAM to compute them?

Are my demands unreasonable?

> I'd be happy to help track down a space leak in iteratee, but for now I'm
> not seeing one.

Thank you for your offer! Maybe I'm just seeing ghosts, and there is no space
leak. But I really do think that the program is eating too much RAM.

Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread John Lato
From: "Edward Z. Yang" 

>
> Hello Aleksandar,
>
> It is possible that the iteratees library is space leaking; I recall some
> recent discussion to this effect.  Your example seems simple enough that
> you might recompile with a version of iteratees that has -auto-all enabled.
> Unfortunately, it's not really a safe bet to assume your libraries are
> leak free, and if you've pinpointed it down to a single line, and there
> doesn't seem a way to squash the leak, I'd bet it's the library's fault.
>
> Edward
>

I can't reproduce the space leak here.  I tried Aleksander's original code,
my iteratee version, the Ngrams version posted by Johan Tibell, and a lazy
bytestring version.

my iteratee version (only f' has changed from Aleksander's code):

f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
f' = I.joinI $ (enumLinesBS I.><> I.filter (not . S.null)) $ I.foldl' (\t s
-> T.insertWith (+) s 1 t) T.empty

my lazy bytestring version

> import Data.Iteratee.Char
> import Data.List (foldl')import Data.Char (toLower)
>
> import Data.Ord (comparing)
> import Data.List (sortBy)
> import System.Environment (getArgs)
> import qualified Data.ByteString.Lazy.Char8 as L
> import qualified Data.HashMap.Strict as T
>
> f'2 = foldl' (\t s -> T.insertWith (+) s 1 t) T.empty . filter (not .
L.null) . L.lines
>
> main2 :: IO ()
> main2 = getArgs >>= L.readFile .head >>= print . T.keys . f'2

None of these leak space for me (all compiled with ghc-7.0.3 -O2).
Performance was pretty comparable for every version, although Aleksander's
original did seem to have a very small edge.

As someone already pointed out, keep in mind that this will use a lot of
memory anyway, unless there's a lot of repetition of words.

I'd be happy to help track down a space leak in iteratee, but for now I'm
not seeing one.

Best,
John Lato
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
On Tue, May 31, 2011 at 02:13:14PM -0400, Edward Z. Yang wrote:
> It is possible that the iteratees library is space leaking; I recall some
> recent discussion to this effect.  Your example seems simple enough that
> you might recompile with a version of iteratees that has -auto-all enabled.

If I understand you correctly, you imply that I should try compiling iteratee
with profiling, no? I did install the iteratee library with profiling support (I
have the cabal profiling flag globally set in my cabal config,) but my profiles
so far seem to be blaming LAGging ByteStrings and HashMaps.

I, unfortunately, do not know how I would test iteratee itself for a space leak
here.

> Unfortunately, it's not really a safe bet to assume your libraries are
> leak free, and if you've pinpointed it down to a single line, and there
> doesn't seem a way to squash the leak, I'd bet it's the library's fault.

Since my knowledge of Haskell, and, in particular, high-performance Haskell, is
very lacking, my current m.o. is to blame myself :-) It might be iteratee, but
unfortunately, I have not found something that gives me better performance than
iteratee yet.

Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
Hi Johan,

> Here's how I would do it:

I implemented your method, with these minimal changes (i.e. just using a main
driver in the same file.)

> countUnigrams :: Handle -> IO (M.Map S.ByteString Int)
> countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty
> 
> main :: IO ()
> main = do (f:_) <- getArgs
>   openFile f ReadMode >>= countUnigrams >>= print . M.toList

It seems to perform about 3x worse than the iteratee method in terms of time,
and worse in terms of space :-( On Brandon's War & Peace example, hGetLine uses
1.565 seconds for the small file, whereas my iteratee method uses 1.085s for the
small file, and around 2 minutes for the large file.

For the large file, the code above starts consuming around 2.5GB of RAM,
so it clearly has a space leak somewhere. Where, I don't know.

If you want to try it out, here's a short command line to make a test corpus the
way Brandon made one:

+++

wget 'http://www.gutenberg.org/files/2600/2600.zip';
unzip 2600.zip;
touch wnp100.txt;
for i in {1..100}; do echo -n "$i "; cat 2600.txt >> wnp100.txt; done;
echo "Done.

+++

Note, that, as I detailed in my prior email to Brandon, even if you do end up
with a (supposedly) non-leaking program for this example corpus, that doesn't
mean it'll scale well to real world data.

I also tried sprinkling strictness annotations throughout your above code, but I
failed to produce good results :-(

> We definitely need more accessible material on how to reliably write
> fast Haskell code. There are those among us who can, but it shouldn't
> be necessary to learn it in the way they did (i.e. by lots of
> tinkering, learning from the elders, etc). I'd like to write a 60 (or
> so) pages tutorial on the subject, but haven't found the time.

I'd be an eager reader :-) Please do announce it on -cafe or the "usual places"
should you ever come around to writing it!

I, unfortunately, don't really have any contact to "the elders," apart from what
I read on their respective blogs…

> In addition to RWH, perhaps the slides from the talk on
> high-performance Haskell I gave could be useful:
> 
> 
> http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell.html

Thanks, I'll give it a look later tomorrow!

Regards,
Aleks

PS: Sorry I didn't answer you in #haskell, I ended up having to go afk for a
short while. Thanks for all your help!


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
On Tue, May 31, 2011 at 11:43:27AM -0700, Brandon Moore wrote:
> I can't reproduce heap usage growing with the
> size of the input file.
> 
> I made a word list from Project Gutenberg's
> copy of "War and Peace" by
> 
> tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt
> 
> Using 1, 25, or 1000 repetitions of this ~3MB wordlist
> shows about 100MB of address space used according
> to top, and no more than 5MB or so of haskell heap
> used according to the memory profile, with a flat
> memory profile.

This will lead to very small variance in your data. The effect I'm experiencing
is so small in this case, that it's barely observable (but it is, see below.)

> Is your memory usage growing with the size of the input
> file, or the size of the histogram?

While the histogram is naturally growing with the size of the input file, memory
usage seems to be proportional mainly to the histogram. It is clear that, due to
the effect of the Long Tail, the histogram is going to constantly grow in a real
setting, as opposed to just replicating the same data.  In your test case, the
histogram is *not* growing with the size of the input file.

The memory usage is proportional to the histogram, which is proportional to the
file size. That is not the problem. The problem is, that, compared to the size
of the histogram, the memory consumption is *inadequately* high. Here's some
more data, using your Tolstoi example:

du file.txt
344Mfile.txt

<>

As you can see, memory residency is at 8 MB avg, 10 MB max. This is the
War&Peace file, replicated 100 times. Let's look at the result for the file
*without* first replicating it 100 times:

du words.txt
3.0Mwords.txt

<>

4.8MB avg, 9.1 MB max. It seems input file size *does* have an effect of sorts,
but it's negligible. What is more interesting is this: the file is a whopping
3MB big. How on earth does the program consume almost 5 MB *on average*? This is
*not* constant memory usage. This is memory usage trivial enough to not be worth
a fuss for small inputs, but unfortunately, it gets larger as soon as you
increase the file size (in a realistic fashion: i.e. you'll also increase the
unigram count.)

> I was worried data sharing might mean your keys
> retain entire 64K chunks of the input. However, it
> seems enumLines depends on the StringLike ByteString
> instance, which just converts to and from String.

Ouch, that sounds like something worth fixing.

> The other thing that occurs to me is that the total size of
> your keys would also be approximately the size of the input
> file if you were using plain text without each word split onto
> a separate line.

Well, I am not. The corpus is a word-per-line corpus, I'm reading a word per
line, and adding that to my hash map. This should never result in a data
structure even close to the size of the original corpus.

It could be, in a very unrealistic worst case scenario. But even a corpus of
30GB of Poe and Heidegger isn't going to make that happen. Furthermore, mine is 
not
such a scenario at all. As I said, if you reduce the corpus to a set of words
(i.e. a set of unigrams) you get a 40MB file from a 1.4GB corpus. Why is it,
that in order to create that 40MB file from a 1.4GB corpus, my trivial little
program needs somewhere north of 6-8 GB of RAM?

In this trivial example for War and Peace, why is it that in order to create the
unigram table for War and Peace, which ends up being a mere 201KB big, we're
chomping through 5MB on average, and nearly 10MB max? That's at least 25 times
more than we actually *should* have (yes, I know that the RTS is somewhere
there, too, but I think it's not a problem to ignore that for now.)

Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Brandon Moore
Wait, do ByteStrings show up on a heap profile, if the space is
allocated with malloc?

Anyway, I think my tests still show that the memory used by the
process doesn't grow simply by adding more data, if you
are no longer added keys to the map.



- Original Message -
> From: Brandon Moore 
> To: Aleksandar Dimitrov ; 
> "haskell-cafe@haskell.org" 
> Cc: 
> Sent: Tuesday, May 31, 2011 1:43 PM
> Subject: Re: [Haskell-cafe] How on Earth Do You Reason about Space?
> 
> I can't reproduce heap usage growing with the
> size of the input file.
> 
> I made a word list from Project Gutenberg's
> copy of "War and Peace" by
> 
> tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt
> 
> Using 1, 25, or 1000 repetitions of this ~3MB wordlist
> shows about 100MB of address space used according
> to top, and no more than 5MB or so of haskell heap
> used according to the memory profile, with a flat
> memory profile.
> 
> 
> Is your memory usage growing with the size of the input
> file, or the size of the histogram?
> 
> I was worried data sharing might mean your keys
> retain entire 64K chunks of the input. However, it
> seems enumLines depends on the StringLike ByteString
> instance, which just converts to and from String.
> That can't be efficient, but I suppose it avoids excessive sharing.
> 
> The other thing that occurs to me is that the total size of
> your keys would also be approximately the size of the input
> file if you were using plain text without each word split onto
> a separate line.
> 
> Brandon
>

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Brandon Moore
I can't reproduce heap usage growing with the
size of the input file.

I made a word list from Project Gutenberg's
copy of "War and Peace" by

tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt

Using 1, 25, or 1000 repetitions of this ~3MB wordlist
shows about 100MB of address space used according
to top, and no more than 5MB or so of haskell heap
used according to the memory profile, with a flat
memory profile.


Is your memory usage growing with the size of the input
file, or the size of the histogram?

I was worried data sharing might mean your keys
retain entire 64K chunks of the input. However, it
seems enumLines depends on the StringLike ByteString
instance, which just converts to and from String.
That can't be efficient, but I suppose it avoids excessive sharing.

The other thing that occurs to me is that the total size of
your keys would also be approximately the size of the input
file if you were using plain text without each word split onto
a separate line.

Brandon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Edward Z. Yang
Hello Aleksandar,

It is possible that the iteratees library is space leaking; I recall some
recent discussion to this effect.  Your example seems simple enough that
you might recompile with a version of iteratees that has -auto-all enabled.
Unfortunately, it's not really a safe bet to assume your libraries are
leak free, and if you've pinpointed it down to a single line, and there
doesn't seem a way to squash the leak, I'd bet it's the library's fault.

Edward

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Johan Tibell
Hi Aleksandar,

On Tue, May 31, 2011 at 6:10 PM, Aleksandar Dimitrov
 wrote:
> Say, we have an input file that contains a word per line. I want to find all
> unigrams (unique words) in that file, and associate with them the amount of
> times they occurred in the file. This would allow me, for example, to make a
> list of word frequencies in a given text.

Here's how I would do it:

{-# LANGUAGE BangPatterns #-}
module Ngram (countUnigrams) where

import qualified Data.ByteString as S
import qualified Data.HashMap.Strict as M
import System.IO

foldLines :: (a -> S.ByteString -> a) -> a -> Handle -> IO a
foldLines f z0 h = go z0
  where
go !z = do
eof <- hIsEOF h
if eof
then return z
else do
line <- S.hGetLine h
go $ f z line
{-# INLINE foldLines #-}

-- Example use
countUnigrams :: IO (M.HashMap S.ByteString Int)
countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty stdin

> RANT
>
> I have tried and tried again to avoid writing programs in Haskell that would
> leak space like BP likes to leak oil. However, I have yet to produce a single
> instance of a program that would do anything at all and at the same time 
> consume
> less memory than there is actual data in the input file.
>
> It is very disconcerting to me that I seem to be unable, even after quite some
> practice, to identify space leaks in trivial programs like the above. I know 
> of
> no good resource to educate myself in that regard. I have read the GHC manual,
> RWH's chapter on profiling, also "Inside T5"'s recent series on the Haskell
> heap, but no dice. Even if I can clearly see the exact line where at least 
> some
> of the leaking happens (as I can in this case,) it seems impossible for me to
> prevent it.
>
> *thank you very much* for reading this far. This is probably a mostly useless
> email anyhow, I just had to get it off my chest. Maybe, just maybe, someone
> among you will have a crucial insight that will save Haskell for me :-) But
> currently, I see no justification to not start my next project in Lua, Python 
> or
> Java. Sure, Haskell's code is pretty, and it's fun, but if I can't actually
> *run* it, why bother?  (Yes, this isn't the first time I've ran into this
> problem …)

We definitely need more accessible material on how to reliably write
fast Haskell code. There are those among us who can, but it shouldn't
be necessary to learn it in the way they did (i.e. by lots of
tinkering, learning from the elders, etc). I'd like to write a 60 (or
so) pages tutorial on the subject, but haven't found the time.

In addition to RWH, perhaps the slides from the talk on
high-performance Haskell I gave could be useful:


http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell.html

Cheers,
Johan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
> In Lag/Drag/Void/Use profiling, Lag is actually heap cells that are created
> too _early_.  (Drag are those that are kept for longer than necessary.)  Lots
> of Lag generally means your program is too strict - it is forcing structure
> long before it needs to.  To fix it, you need to make things lazier.  My first
> suspicion would fall on ByteString.

Indeed, thank you, I mixed those up. I cannot use lazy byte strings here,
because of the way Data.Iteratee.Char's enumLinesBS works (it takes strict
byte strings.)

The only other strictness in there is ($!) and foldl'. The latter is necessary
for the program to even run (i.e. not run out of stack space.)

The strict application in step's argument seems necessary, since without it, the
program consumes 1200 MB of RAM (on my 75MB test data,) and takes very very 
long.
The hb profile indicates that a lot of data is allocated up front, and then
gradually eliminated. Interestingly, removing ($!) here seemed to *introduce*
unnecessary strictness. Here's the hb profile without ($!):
http://imgur.com/Ex7Pd I don't understand what is happening here :-\ I only just
started using iteratees.

Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread malcolm.wallace
ad a) heap consumption is too high for two reasons: firstly, the actual data I
care about is much less than there's data on the heap. Secondly, about half the
heap space is in LAG state. Here are profiles that will illustrate this:
http://imgur.com/wBWmJ&XN1mW

Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
On Tue, May 31, 2011 at 06:10:00PM +0200, Aleksandar Dimitrov wrote:
> ad a) heap consumption is too high for two reasons: firstly, the actual data I
> care about is much less than there's data on the heap. Secondly, about half 
> the
> heap space is in LAG state. Here are profiles that will illustrate this:
> http://imgur.com/wBWmJ&XN1mW

[Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
Dear Cafe,

(Excuse the probably very ranty email; I am, unfortunately, at the end of my
wits, and I hope that as fellow programmers, you will understand that this is
among the most dreadful situations for our kind to be in.)

Say, we have an input file that contains a word per line. I want to find all
unigrams (unique words) in that file, and associate with them the amount of
times they occurred in the file. This would allow me, for example, to make a
list of word frequencies in a given text.

Simple enough task. Here's an implementation using iteratees (lazy IO is evil)
and unordered-containers' Data.HashMap.Strict, which enforces WHNF in values
and keys:

> import qualified Data.ByteString.Char8 as S
> import qualified Data.Iteratee as I
> import Data.Iteratee.IO
> 
> import qualified Data.HashMap.Strict as T
> 
> import Data.Iteratee.Char
> import Data.List (foldl')
> import Data.Char (toLower)
> 
> import Data.Ord (comparing)
> import Data.List (sortBy)
> import System.Environment (getArgs)
> 
> type Wordcounts = T.HashMap S.ByteString Int
> 
> f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
> f' = I.joinI $ enumLinesBS (I.liftI $ step T.empty)
> where step t (I.Chunk str) = I.liftI (step $! foldl' maybeIncrement t str)
>   step t stream = I.idone t stream
>   maybeIncrement t s
>   | s == S.empty = t
>   | otherwise= {-# SCC "m-I-other" #-} T.insertWith (+) s 1 t
> 
> main :: IO ()
> main = getArgs >>= fileDriverVBuf 65536 f'.head >>= print.prettyList
> where prettyList = -- sortBy (comparing snd) . T.toList
>T.keys

Some lines are empty, and I don't want them to be recorded, so that's why
maybeIncrement is necessary.
hpaste of this code: http://hpaste.org/47300/spaceleak (ignore convert, that's
yet another issue.)

Now, here's some observations: on a 75M input file (minuscule, compared to what
I actually need) this program will eat 30M of heap space (says profiling) and
return in 14 secs.

I have two problems with that: a) that's too much heap space, b) the actual 
memory
residency is *much* worse.

ad b) average memory residency is at 38MB (this is OK, given heap consumption)
but max residency is at 130MB, which is unacceptable to me (remember that I need
to run this on files *much* bigger than just 75M.)

<>

In fact, htop reports total memory residency of the program at around 320 MB at
the end of its life-cycle (compiled and ran without profiling options.) I tried
running this program on a 1.4GB file, and had to kill it when at 3.5GB memory
consumption, my computer started paging. The actual hash-map, however, shouldn't
be much bigger than in the 75MB case (I expect about twice the size,) since the
input is natural language. I redirected the output of the program (showing a
list of assoc pairs that were in the hash map) to a file, and that file measured
11MB in the case of a 75MB corpus, and 40MB when I ran the program on a 1.4GB
corpus (I had to use a machine with 24GB of RAM to be able to run this.)

ad a) heap consumption is too high for two reasons: firstly, the actual data I
care about is much less than there's data on the heap. Secondly, about half the
heap space is in LAG state. Here are profiles that will illustrate this:
http://imgur.com/wBWmJ&XN1mW