Send Beginners mailing list submissions to beginners@haskell.org To subscribe or unsubscribe via the World Wide Web, visit http://www.haskell.org/mailman/listinfo/beginners or, via email, send a message with subject or body 'help' to beginners-requ...@haskell.org
You can reach the person managing the list at beginners-ow...@haskell.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Beginners digest..." Today's Topics: 1. Re: How would you improve this program? (Daniel Fischer) 2. Re: How would you improve this program? (Lorenzo Bolla) ---------------------------------------------------------------------- Message: 1 Date: Mon, 10 Oct 2011 00:08:01 +0200 From: Daniel Fischer <daniel.is.fisc...@googlemail.com> Subject: Re: [Haskell-beginners] How would you improve this program? To: beginners@haskell.org Message-ID: <201110100008.01364.daniel.is.fisc...@googlemail.com> Content-Type: Text/Plain; charset="utf-8" On Sunday 09 October 2011, 22:11:35, Lorenzo Bolla wrote: > Hi all, > I'm new to Haskell and I'd like you to take a look at one of my programs > and tell me how you would improve it (in terms of efficiency, style, > and so on!). > > The source code is here: > https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs > The program is an implementation of this problem: > http://www.scs.stanford.edu/11au-cs240h/labs/lab1.html (basically, > counting how many times a word appear in a text.) > (I'm not a Stanford student, so by helping me out you won't help me to > cheat my exam, don't worry!) A few problems: 1) your input cleaning is not correct, running it over the source file, I get (among others) -- ################################################# -> ################################## ############################# $ ######################## Now, they say they aren't going to be tricky, so most of this isn't necessary to take care of, but dashes/hyphens -- e.g. for parenthetical remarks -- should probably be dealt with---also the style with one em-dash not surrounded by spaces. Also quotes, "Bob said: 'I will go there!'", Alice told the listeners. 2) If you have a looong but rare word in the text, you will get too many spaces between the words and the markers, that's not good. 3) you calculate the number of '#'s using the total line length and then subtract the length of (word + spaces), skewing your output. You should calculate (spaceForHashes = lineWidth - maxLength - 1) and the use that to determine the number of hashes. [Although it should be more complicated really, maxLength should only include those words that are actually output.] Style is okay, although like Michael Xavier, I'm more of a where-guy. > > I've implemented 3 versions of the algorithm: > > 1. a Haskell version using the standard "sort": read all the words > from stdin, sort them and group them. Not much you can do there. > 2. a Haskell version using map: read all the words from stdin, stick > each word in a Data.Map incrementing a counter if the word is already > present in the map. You build the map via fromListWith (+). That's bad if you have input with large counts. fromListWith (+) gives you thunks of the form (...(1+1)+1)...+1) as map-entries. That costs a lot of space and time. Better build the map using insertWith', mp = foldl' ins empty words where ins m w = insertWith' (+) w 1 m > 3. a Python version using defaultdict. > > I timed the different versions and the results are here: > https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png. > The python version is the quickest (I stripped out the fancy formatting > before benchmarking, so IO is not responsible for the time difference). > Any comments on the graph, too? sort is O(n*log n), map and python are O(n*log d), where n is the number of words and d the number of distinct words. In the range of the graph both look quite linear. The number of distinct words is much smaller than the number of words, so that's a smaller 'constant' (logarithmic) factor. Python seems to have a smaller factor than map, I'm not sure whether that's because Python can mutate the dictionary in-place while Haskell's immutable Map requires some copying on each insert or whether it's due to the thunks and the excessive space usage caused by them. Would be interesting to see the graph for the more efficient map-building. ------------------------------ Message: 2 Date: Mon, 10 Oct 2011 09:13:45 +0100 From: Lorenzo Bolla <lbo...@gmail.com> Subject: Re: [Haskell-beginners] How would you improve this program? To: beginners@haskell.org Message-ID: <CADjgTRzLY2APVfE1sHwJdXO4M4E+n918UixXRS7QnMwUwKDv=q...@mail.gmail.com> Content-Type: text/plain; charset="utf-8" Hi, On Sun, Oct 9, 2011 at 11:08 PM, Daniel Fischer < daniel.is.fisc...@googlemail.com> wrote: > > A few problems: > 1) your input cleaning is not correct, running it over the source file, I > get (among others) > Thanks, I'll fix it asap. > 2) If you have a looong but rare word in the text, you will get too many > spaces between the words and the markers, that's not good. > Thanks. > 3) you calculate the number of '#'s using the total line length and then > subtract the length of (word + spaces), skewing your output. You should > calculate (spaceForHashes = lineWidth - maxLength - 1) and the use that to > determine the number of hashes. > [Although it should be more complicated really, maxLength should only > include those words that are actually output.] > Fixed, thanks. Style is okay, although like Michael Xavier, I'm more of a where-guy. > A part from style, is there any reason why one should prefer "where" over "let..in"? > You build the map via fromListWith (+). That's bad if you have input with > large counts. fromListWith (+) gives you thunks of the form > (...(1+1)+1)...+1) as map-entries. That costs a lot of space and time. > Better build the map using insertWith', > > mp = foldl' ins empty words > where > ins m w = insertWith' (+) w 1 m > Please find the "strict" version here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs. Plot of times is now: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png Python is still slightly faster, but I'm getting there. > 3. a Python version using defaultdict. > sort is O(n*log n), map and python are O(n*log d), where n is the number of > words and d the number of distinct words. I believe Python's defaultdict have (optimistically) O(1) lookup and insert, because it is implemented as a hashmap. Thanks everybody for your tips. They are most welcome. L. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.haskell.org/pipermail/beginners/attachments/20111010/73b0cdb3/attachment-0001.htm> ------------------------------ _______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners End of Beginners Digest, Vol 40, Issue 12 *****************************************