Jonathan Ellis wrote:
#!/usr/bin/python2.4
import sys
WORDS_FNAME = '/usr/share/dict/words'
words = dict((line.rstrip(), 0) for line in file(WORDS_FNAME))
source = len(sys.argv) > 1 and file(sys.argv[1]) or sys.stdin
for line in source:
for word in line.strip().split():
if not word:
continue
try:
words[word] += 1
except KeyError:
pass
for word, count in words.iteritems():
if count > 0:
print word, count
Here's a version of that code that runs in about 40% less time.
Optimizations:
- Locals are faster than globals, so I moved the code into a function.
- Since we don't report words not used in the document, and a given
document is likely to use only a small fraction of the dictionary, it's
better to build a new dictionary of word frequency rather than filter
the larger dictionary later.
- There are several possible ways to copy words in a file to the keys of
a dictionary; this way ran faster than the other ways.
- There's no need to strip() a line if all you're going to do with it is
split() it.
- Exception handling is more expensive than conditions, so switched to
"if word in words".
I though of another tiny optimization that shaves around .5%, but
reduces readability: pre-fetch the .get attribute of the 'freq'
dictionary. I doubt it's worth the penalty.
Shane
#!/usr/bin/python2.4
import sys
def main():
words_fname = '/usr/share/dict/words'
if len(sys.argv) > 1:
source = open(sys.argv[1])
else:
source = sys.stdin
words = {}
for line in open(words_fname):
words[line.rstrip()] = 0
freq = {}
for line in source:
for word in line.split():
if word in words:
freq[word] = freq.get(word, 0) + 1
for word, count in freq.iteritems():
print word, count
main()
/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/