Beginners Digest, Vol 88, Issue 25

beginners-request Tue, 27 Oct 2015 06:58:26 -0700

Send Beginners mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://mail.haskell.org/cgi-bin/mailman/listinfo/beginners
or, via email, send a message with subject or body 'help' to
        [email protected]


You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Beginners digest..."


Today's Topics:

   1.  Space consumption and strictness in strict map (John Lusk)
   2. Re:  Space consumption and strictness in strict   map
      (David McBride)


----------------------------------------------------------------------

Message: 1
Date: Tue, 27 Oct 2015 09:38:40 -0400
From: John Lusk <[email protected]>
To: [email protected]
Subject: [Haskell-beginners] Space consumption and strictness in
        strict map
Message-ID:
        <cajqkmbyftwqaxapkysjsamnkb5_mchk4fq68itci2as4vvc...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hello, [email protected].  I posted this a few days ago, but it looks
like attachments get pulled out by Mailman, whether or not they are small
text files ( :>( ), so I'm trying again, sans attachment.
_________________________________________

Hey, I have a problem I "solved" by sprinkling !s around like pixie dust,
but I'd like to know what's going on.  My program is below, in all its
messy glory, but the whole ball of wax is at
https://github.com/JohnL4/PassphraseGenerator (mostFrequent.hs).  It takes
as input one of the 2012 Google ngram (specifically, 1-gram) raw data
files.  The expected input format is documented in several places.  (For
testing, I took the first million lines.)

The bang that worked is on the 3rd argument of 'wordCount', the map.  I
tried commenting out the "(Map.insertWith (+) ngram matchCount aMap)" part
(and just returning the input map).  When I did that, boy was space
consumption small, so that line is part of the problem.

So, what's going on without the bang?

I guess I'm just building up a bunch of thunks w/that Map.insertWith call,
but what kind of thunks? Unevaluated calls to splitOn and (!!) and read and
(+) and lines?

I thought a strict Map would avoid that.  I guess WHNF isn't enough?
Looking at https://wiki.haskell.org/Weak_head_normal_form, I don't *think* I
have any constructors in there, but profiling with -hd shows me that (:) is
the most frequently occurring closure. Where is *that* coming from?  Is the
occurrence of (:) so high because it's either a built-in function applied
to too few arguments or a lambda abstraction? Which one?  (Also, is "lambda
abstraction" the same as "lambda expresson"?)

And then... what does that bang on the map argument do?  Does it force
evaluation of the passed argument all the way down to primitives, so that
we truly get a data structure containing only strings and ints (and no
thunks)?  What does that do to time complexity?  Does it have to traverse
the entire map looking for thunks, even though I only added one at some
random location in the map?

Is there a better way?

I guess I need to force strictness somewhere else, but I'm not sure how.  I
tried using seq (a little half-heartedly), and ($!), but I guess I did it
wrong and only wound up with more thunks to seq and id ($!), right?  So, I
wound up with lazy strictness?

Thanks for any help.

John.

My code:

-- | Reads stdin and prints the n most-frequent words occurring after year
y to stdout, along with
-- | their counts.

{-# LANGUAGE BangPatterns #-}

import GHC.IO.Handle (hPutStr)
import GHC.IO.Handle.FD (stderr)
import Data.Char (toLower)
import Data.List (intersperse, sortBy)
import qualified Data.Map.Strict as Map
import Data.List.Split (splitOn)

{-
Test data:
let filetext = "aaa\t1900\t2\naaa\t1950\t3\nbbb\t1950\t5\nbbb_VERB\t1980\t9"
-}

-- | Number of most-frequent words we want.
n :: Int
n = 15000

-- | Earliest year we want to count while building frequency table
y :: Int
y = 1950

main :: IO()
main = do
{-
Read line, split on tabs, take first three fields, which are: 1-gram
(word), year, match-count.
If year >= y:
   Trim parts of speech (POS) from 1-gram (leading, trailing known
fragments delimited by "_")
      Known fragments: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET
(determiners and articles), ADP
      (prepositions, postpositions), NUM, CONJ, PRT (particles), X
(miscellaneous)

   Split on "_", discard expected known fragments, complain if there are
more than one fragments left.  (Note that I
   have verified that each ngram has 0 or 1 trailing attributes, so the
check for an unexpected number of parts
   is unnecessary.)

   Find word in dictionary and add match-count to that entry.
At end of input, sort dictionary entries by match-counts (descending) and
take first n entries for output.
-}
  allInput <- getContents
  putStrLn (concat (intersperse "\n" (map fst (take n (sortBy
countDescending
                                                       (Map.toList
(wordCounts y (lines allInput) Map.empty)))))))
  hPutStr stderr "Done.\n"

-- | Orders inputs by 2nd element (count), descending
countDescending :: (String,Int) -> (String,Int) -> Ordering
countDescending (_, countA) (_, countB)
  | countA < countB  = GT
  | countA == countB = EQ
  | otherwise        = LT

-- | Returns map of all counts (summed) for words occurring on or after
given year.
wordCounts :: Int               -- ^ Year
           -> [String]          -- ^ Lines in form
"word\tyear\tcount\totherStuffWeDontCareAbout"
           -> Map.Map String Int -- ^ Input map
           -> Map.Map String Int -- ^ Output map
wordCounts _ [] aMap = aMap
wordCounts aYear (aLine:restLines) !aMap =
  let fields      = splitOn "\t" aLine
      ngramParts  = splitOn "_" (fields!!0)
      ngram       = map toLower (ngramParts!!0) -- "ngram" is the same as
"word", in this case.
      year        = read (fields!!1) :: Int
      matchCount  = read (fields!!2) :: Int
  in if (year < aYear)
     then (wordCounts aYear restLines aMap)
     else (wordCounts aYear restLines
           (Map.insertWith (+) ngram matchCount aMap)
           -- aMap
           )
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.haskell.org/pipermail/beginners/attachments/20151027/d7a28da2/attachment-0001.html>

------------------------------

Message: 2
Date: Tue, 27 Oct 2015 09:57:40 -0400
From: David McBride <[email protected]>
To: The Haskell-Beginners Mailing List - Discussion of primarily
        beginner-level topics related to Haskell <[email protected]>
Subject: Re: [Haskell-beginners] Space consumption and strictness in
        strict  map
Message-ID:
        <can+tr41yp3kzvf9jnbjgzram0sepkoktaegrgspyjm_hoz8...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Not 100% on this, but I would say the map is strict in keys and values as
it says in the haddocks, but it is still an immutable structure.  If you
don't force the map while you are building it, you end up with unevaluated
thunks, just like you would with +.

insertWith + x y (insertWith + x y (insertWith + x y ...)))

On the bright side because it is strict, the addition you are doing on the
values of this map are being evaluated, so the + you are using is not also
leaving behind thunks like value + value + value.  I'd say the reason (:)
is so common is because the arguments to insertWith both come from a list
that is not being cleaned up until you force the map.

On Tue, Oct 27, 2015 at 9:38 AM, John Lusk <[email protected]> wrote:

> Hello, [email protected].  I posted this a few days ago, but it looks
> like attachments get pulled out by Mailman, whether or not they are small
> text files ( :>( ), so I'm trying again, sans attachment.
> _________________________________________
>
> Hey, I have a problem I "solved" by sprinkling !s around like pixie dust,
> but I'd like to know what's going on.  My program is below, in all its
> messy glory, but the whole ball of wax is at
> https://github.com/JohnL4/PassphraseGenerator (mostFrequent.hs).  It
> takes as input one of the 2012 Google ngram (specifically, 1-gram) raw data
> files.  The expected input format is documented in several places.  (For
> testing, I took the first million lines.)
>
> The bang that worked is on the 3rd argument of 'wordCount', the map.  I
> tried commenting out the "(Map.insertWith (+) ngram matchCount aMap)"
> part (and just returning the input map).  When I did that, boy was space
> consumption small, so that line is part of the problem.
>
> So, what's going on without the bang?
>
> I guess I'm just building up a bunch of thunks w/that Map.insertWith
> call, but what kind of thunks? Unevaluated calls to splitOn and (!!) and
> read and (+) and lines?
>
> I thought a strict Map would avoid that.  I guess WHNF isn't enough?
> Looking at https://wiki.haskell.org/Weak_head_normal_form, I don't *think* I
> have any constructors in there, but profiling with -hd shows me that (:) is
> the most frequently occurring closure. Where is *that* coming from?  Is
> the occurrence of (:) so high because it's either a built-in function
> applied to too few arguments or a lambda abstraction? Which one?  (Also, is
> "lambda abstraction" the same as "lambda expresson"?)
>
> And then... what does that bang on the map argument do?  Does it force
> evaluation of the passed argument all the way down to primitives, so that
> we truly get a data structure containing only strings and ints (and no
> thunks)?  What does that do to time complexity?  Does it have to traverse
> the entire map looking for thunks, even though I only added one at some
> random location in the map?
>
> Is there a better way?
>
> I guess I need to force strictness somewhere else, but I'm not sure how.
> I tried using seq (a little half-heartedly), and ($!), but I guess I did
> it wrong and only wound up with more thunks to seq and id ($!), right?
> So, I wound up with lazy strictness?
>
> Thanks for any help.
>
> John.
>
> My code:
>
> -- | Reads stdin and prints the n most-frequent words occurring after year
> y to stdout, along with
> -- | their counts.
>
> {-# LANGUAGE BangPatterns #-}
>
> import GHC.IO.Handle (hPutStr)
> import GHC.IO.Handle.FD (stderr)
> import Data.Char (toLower)
> import Data.List (intersperse, sortBy)
> import qualified Data.Map.Strict as Map
> import Data.List.Split (splitOn)
>
> {-
> Test data:
> let filetext =
> "aaa\t1900\t2\naaa\t1950\t3\nbbb\t1950\t5\nbbb_VERB\t1980\t9"
> -}
>
> -- | Number of most-frequent words we want.
> n :: Int
> n = 15000
>
> -- | Earliest year we want to count while building frequency table
> y :: Int
> y = 1950
>
> main :: IO()
> main = do
> {-
> Read line, split on tabs, take first three fields, which are: 1-gram
> (word), year, match-count.
> If year >= y:
>    Trim parts of speech (POS) from 1-gram (leading, trailing known
> fragments delimited by "_")
>       Known fragments: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET
> (determiners and articles), ADP
>       (prepositions, postpositions), NUM, CONJ, PRT (particles), X
> (miscellaneous)
>
>    Split on "_", discard expected known fragments, complain if there are
> more than one fragments left.  (Note that I
>    have verified that each ngram has 0 or 1 trailing attributes, so the
> check for an unexpected number of parts
>    is unnecessary.)
>
>    Find word in dictionary and add match-count to that entry.
> At end of input, sort dictionary entries by match-counts (descending) and
> take first n entries for output.
> -}
>   allInput <- getContents
>   putStrLn (concat (intersperse "\n" (map fst (take n (sortBy
> countDescending
>                                                        (Map.toList
> (wordCounts y (lines allInput) Map.empty)))))))
>   hPutStr stderr "Done.\n"
>
> -- | Orders inputs by 2nd element (count), descending
> countDescending :: (String,Int) -> (String,Int) -> Ordering
> countDescending (_, countA) (_, countB)
>   | countA < countB  = GT
>   | countA == countB = EQ
>   | otherwise        = LT
>
> -- | Returns map of all counts (summed) for words occurring on or after
> given year.
> wordCounts :: Int               -- ^ Year
>            -> [String]          -- ^ Lines in form
> "word\tyear\tcount\totherStuffWeDontCareAbout"
>            -> Map.Map String Int -- ^ Input map
>            -> Map.Map String Int -- ^ Output map
> wordCounts _ [] aMap = aMap
> wordCounts aYear (aLine:restLines) !aMap =
>   let fields      = splitOn "\t" aLine
>       ngramParts  = splitOn "_" (fields!!0)
>       ngram       = map toLower (ngramParts!!0) -- "ngram" is the same as
> "word", in this case.
>       year        = read (fields!!1) :: Int
>       matchCount  = read (fields!!2) :: Int
>   in if (year < aYear)
>      then (wordCounts aYear restLines aMap)
>      else (wordCounts aYear restLines
>            (Map.insertWith (+) ngram matchCount aMap)
>            -- aMap
>            )
>
>
> _______________________________________________
> Beginners mailing list
> [email protected]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/beginners
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.haskell.org/pipermail/beginners/attachments/20151027/5c889db0/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
Beginners mailing list
[email protected]
http://mail.haskell.org/cgi-bin/mailman/listinfo/beginners


------------------------------

End of Beginners Digest, Vol 88, Issue 25
*****************************************

Beginners Digest, Vol 88, Issue 25

Reply via email to