Beginners Digest, Vol 40, Issue 12

beginners-request Mon, 10 Oct 2011 03:00:28 -0700

Send Beginners mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://www.haskell.org/mailman/listinfo/beginners
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Beginners digest..."

Today's Topics:

   1. Re:  How would you improve this program? (Daniel Fischer)
   2. Re:  How would you improve this program? (Lorenzo Bolla)

----------------------------------------------------------------------

Message: 1
Date: Mon, 10 Oct 2011 00:08:01 +0200
From: Daniel Fischer <[email protected]>
Subject: Re: [Haskell-beginners] How would you improve this program?
To: [email protected]
Message-ID: <[email protected]>
Content-Type: Text/Plain;  charset="utf-8"

On Sunday 09 October 2011, 22:11:35, Lorenzo Bolla wrote:
> Hi all,
> I'm new to Haskell and I'd like you to take a look at one of my programs
> and tell me how you would improve it (in terms of efficiency, style,
> and so on!).
> 
> The source code is here:
> https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs
> The program is an implementation of this problem:
> http://www.scs.stanford.edu/11au-cs240h/labs/lab1.html (basically,
> counting how many times a word appear in a text.)
> (I'm not a Stanford student, so by helping me out you won't help me to
> cheat my exam, don't worry!)

A few problems:
1) your input cleaning is not correct, running it over the source file, I 
get (among others)

--              #################################################
->              ##################################
                #############################
$               ########################

Now, they say they aren't going to be tricky, so most of this isn't 
necessary to take care of, but dashes/hyphens -- e.g. for parenthetical 
remarks -- should probably be dealt with---also the style with one em-dash 
not surrounded by spaces. 
Also quotes,

"Bob said: 'I will go there!'", Alice told the listeners.

2) If you have a looong but rare word in the text, you will get too many 
spaces between the words and the markers, that's not good.

3) you calculate the number of '#'s using the total line length and then 
subtract the length of (word + spaces), skewing your output. You should 
calculate (spaceForHashes = lineWidth - maxLength - 1) and the use that to 
determine the number of hashes.
[Although it should be more complicated really, maxLength should only 
include those words that are actually output.]

Style is okay, although like Michael Xavier, I'm more of a where-guy.

> 
> I've implemented 3 versions of the algorithm:
> 
>    1. a Haskell version using the standard "sort": read all the words
> from stdin, sort them and group them.

Not much you can do there.

>    2. a Haskell version using map: read all the words from stdin, stick
> each word in a Data.Map incrementing a counter if the word is already
> present in the map.

You build the map via fromListWith (+). That's bad if you have input with 
large counts. fromListWith (+) gives you thunks of the form
(...(1+1)+1)...+1) as map-entries. That costs a lot of space and time.
Better build the map using insertWith',

  mp = foldl' ins empty words
    where
      ins m w = insertWith' (+) w 1 m

>    3. a Python version using defaultdict.
> 
> I timed the different versions and the results are here:
> https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png.
> The python version is the quickest (I stripped out the fancy formatting
> before benchmarking, so IO is not responsible for the time difference).
> Any comments on the graph, too?

sort is O(n*log n), map and python are O(n*log d), where n is the number of 
words and d the number of distinct words. In the range of the graph both 
look quite linear.
The number of distinct words is much smaller than the number of words, so 
that's a smaller 'constant' (logarithmic) factor.
Python seems to have a smaller factor than map, I'm not sure whether that's 
because Python can mutate the dictionary in-place while Haskell's immutable 
Map requires some copying on each insert or whether it's due to the thunks 
and the excessive space usage caused by them.

Would be interesting to see the graph for the more efficient map-building.

------------------------------

Message: 2
Date: Mon, 10 Oct 2011 09:13:45 +0100
From: Lorenzo Bolla <[email protected]>
Subject: Re: [Haskell-beginners] How would you improve this program?
To: [email protected]
Message-ID:
        <CADjgTRzLY2APVfE1sHwJdXO4M4E+n918UixXRS7QnMwUwKDv=q...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

On Sun, Oct 9, 2011 at 11:08 PM, Daniel Fischer <
[email protected]> wrote:

>
> A few problems:
> 1) your input cleaning is not correct, running it over the source file, I
> get (among others)
>

Thanks, I'll fix it asap.

> 2) If you have a looong but rare word in the text, you will get too many
> spaces between the words and the markers, that's not good.
>

Thanks.

> 3) you calculate the number of '#'s using the total line length and then
> subtract the length of (word + spaces), skewing your output. You should
> calculate (spaceForHashes = lineWidth - maxLength - 1) and the use that to
> determine the number of hashes.
> [Although it should be more complicated really, maxLength should only
> include those words that are actually output.]
>

Fixed, thanks.

Style is okay, although like Michael Xavier, I'm more of a where-guy.
>

A part from style, is there any reason why one should prefer "where" over
"let..in"?

> You build the map via fromListWith (+). That's bad if you have input with
> large counts. fromListWith (+) gives you thunks of the form
> (...(1+1)+1)...+1) as map-entries. That costs a lot of space and time.
> Better build the map using insertWith',
>
>  mp = foldl' ins empty words
>    where
>      ins m w = insertWith' (+) w 1 m
>

Please find the "strict" version here:
https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs.

Plot of times is now:
https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png

Python is still slightly faster, but I'm getting there.

>    3. a Python version using defaultdict.
> sort is O(n*log n), map and python are O(n*log d), where n is the number of
> words and d the number of distinct words.

I believe Python's defaultdict have (optimistically) O(1) lookup and insert,
because it is implemented as a hashmap.

Thanks everybody for your tips. They are most welcome.

L.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://www.haskell.org/pipermail/beginners/attachments/20111010/73b0cdb3/attachment-0001.htm>

------------------------------

_______________________________________________
Beginners mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/beginners

End of Beginners Digest, Vol 40, Issue 12
*****************************************

Beginners Digest, Vol 40, Issue 12

Reply via email to