Re: [Tutor] count words
Coupla nits: On Tue, 15 Feb 2005 14:39:30 -0500, Kent Johnson <[EMAIL PROTECTED]> wrote: > from string import punctuation > from time import time > > > words = open(r'D:\Personal\Tutor\ArtOfWar.txt').read().split() Another advantage of the first method is that it allows a more elegant word counting algorithm if you choose not to read the entire file into memory. It's a better general practice to consume lines from a file via the "for line in f" idiom. > words = [ word.strip(punctuation) for word in words ] And, be careful with this - punctuation does not include whitespace characters. Although that is no problem in this example, because split() strips its component strings automatically, people should be aware that punctuation won't work on strings that haven't had their whitespace stripped. Otherwise though, good stuff. Peace Bill Mill bill.mill at gmail.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] count words
Ryan Davis wrote: Here's one way to iterate over that to get the counts. I'm sure there are dozens. ### x = 'asdf foo bar foo' counts = {} for word in x.split(): ... counts[word] = x.count(word) ... counts {'foo': 2, 'bar': 1, 'asdf': 1} ### The dictionary takes care of duplicates. If you are using a really big file, it might pay to eliminate duplicates from the list before running x.count(word) Be wary of using the count() function for this, it can be very slow. The problem is that every time you call count(), Python has to look at every element of the list to see if it matches the word you passed to count(). So if the list has n words in it, you will make n*n comparisons. In contrast, the method that directly accumulates counts in a dictionary just makes one pass over the list. For small lists this doesn't matter much, but for a longer list you will definitely see the difference. For example, I downloaded "The Art of War" from Project Gutenberg (http://www.gutenberg.org/dirs/1/3/132/132a.txt) and tried both methods. Here is a program that times how long it takes to do the counts using two different methods: # WordCountTest.py ''' Count words two different ways ''' from string import punctuation from time import time def countWithDict(words): ''' Word count by accumulating counts in a dictionary ''' counts = {} for word in words: counts[word] = counts.get(word, 0) + 1 return counts def countWithCount(words): ''' Word count by calling count() for each word ''' counts = {} for word in words: counts[word] = words.count(word) return counts def timeOne(f, words): ''' Time how long it takes to do f(words) ''' startTime = time() f(words) endTime = time() print '%s: %f' % (f.__name__, endTime-startTime) # Get the word list and strip off punctuation words = open(r'D:\Personal\Tutor\ArtOfWar.txt').read().split() words = [ word.strip(punctuation) for word in words ] # How many words is it, anyway? print len(words), 'words' # Time the word counts c1 = timeOne(countWithDict, words) c2 = timeOne(countWithCount, words) # Check that both get the same result assert c1 == c2 The results (times are in seconds): 14253 words countWithDict: 0.01 countWithCount: 9.183000 It takes 900 times longer to count() each word individually! Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] count words
On Tue, 15 Feb 2005 18:03:57 +, Max Noel <[EMAIL PROTECTED]> wrote: > > On Feb 15, 2005, at 17:19, Ron Nixon wrote: > > > Thanks to everyone who replied to my post. All of your > > suggestions seem to work. My thanks > > > > Ron > > Watch out, though, for all of this to work flawlessly you first have > to remove all punctuation (either with regexes or with multiple > foo.replace('[symbol]', '')), and to remove the case of each word > (foo.upper() or foo.lower() will do). To remove all punctuation from the beginning and end of words, at least in 2.4, you can just use: word.strip('.!?\n\t ') plus any other characters that you'd like to strip. In action: >>> word = "?testing..!.\n\t " >>> word.strip('?.!\n\t ') 'testing' Peace Bill Mill bill.mill at gmail.com > > -- Max > maxnoel_fr at yahoo dot fr -- ICQ #85274019 > "Look at you hacker... A pathetic creature of meat and bone, panting > and sweating as you run through my corridors... How can you challenge a > perfect, immortal machine?" > > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] count words
> Other than using a several print statments to look for > seperate words like this, is there a way to do it so > that I get a individual count of each word: > > word1 xxx > word2 xxx > words xxx The classic approach is to create a dictionary. Add each word as you come to it and increment the value by one. At the end the dictionaru contains all unique words with the count for each one. Does that work for you? Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] count words
On Feb 15, 2005, at 17:19, Ron Nixon wrote: Thanks to everyone who replied to my post. All of your suggestions seem to work. My thanks Ron Watch out, though, for all of this to work flawlessly you first have to remove all punctuation (either with regexes or with multiple foo.replace('[symbol]', '')), and to remove the case of each word (foo.upper() or foo.lower() will do). -- Max maxnoel_fr at yahoo dot fr -- ICQ #85274019 "Look at you hacker... A pathetic creature of meat and bone, panting and sweating as you run through my corridors... How can you challenge a perfect, immortal machine?" ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
RE: [Tutor] count words
Thanks to everyone who replied to my post. All of your suggestions seem to work. My thanks Ron --- Ryan Davis <[EMAIL PROTECTED]> wrote: > You could use split() to split the contents of the > file into a list of strings. > > ### > >>> x = 'asdf foo bar foo' > >>> x.split() > ['asdf', 'foo', 'bar', 'foo'] > ### > > Here's one way to iterate over that to get the > counts. I'm sure there are dozens. > ### > >>> x = 'asdf foo bar foo' > >>> counts = {} > >>> for word in x.split(): > ... counts[word] = x.count(word) > ... > >>> counts > {'foo': 2, 'bar': 1, 'asdf': 1} > ### > The dictionary takes care of duplicates. If you are > using a really big file, it might pay to eliminate > duplicates from the list > before running x.count(word) > > Thanks, > Ryan > > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Ron > Nixon > Sent: Tuesday, February 15, 2005 11:22 AM > To: tutor@python.org > Subject: [Tutor] count words > > > I know that you can do this to get a count of home > many times a word appears in a file > > > f = open('text.txt').read() > print f.count('word') > > Other than using a several print statments to look > for > seperate words like this, is there a way to do it so > that I get a individual count of each word: > > word1 xxx > word2 xxx > words xxx > > etc. > > > > > > __ > Do you Yahoo!? > Yahoo! Mail - Find what you need with new enhanced > search. > http://info.mail.yahoo.com/mail_250 > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > __ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] count words
On Tue, 15 Feb 2005, Ron Nixon wrote: > I know that you can do this to get a count of home many times a word > appears in a file > > > f = open('text.txt').read() > print f.count('word') > > Other than using a several print statments to look for seperate words > like this, is there a way to do it so that I get a individual count of > each word: Hi Ron, Let's modify the problem a bit. Let's say that we have a list of words: ### words = """one ring to rule them all one ring to find them one ring to bring them all and in the darkness bind them in the land of mordor where the shadows lie""".split() ### What happens if we sort() this list? ### >>> words.sort() >>> words ['all', 'all', 'and', 'bind', 'bring', 'darkness', 'find', 'in', 'in', 'land', 'lie', 'mordor', 'of', 'one', 'one', 'one', 'ring', 'ring', 'ring', 'rule', 'shadows', 'the', 'the', 'the', 'them', 'them', 'them', 'them', 'to', 'to', 'to', 'where'] ### Would this be easier to process? If you have more questions, please feel free to ask! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] count words
Ron Nixon wrote: I know that you can do this to get a count of home many times a word appears in a file f = open('text.txt').read() print f.count('word') Other than using a several print statments to look for seperate words like this, is there a way to do it so that I get a individual count of each word: word1 xxx word2 xxx words xxx etc. Like this? A 14 AND 1 Abantes 3 Abarbarea 1 Abas 1 Abians 1 Ablerus 1 About 2 Abydos 3 Acamas 11 Accept 2 Acessamenus 1 Achaea 1 Achaean 34 Achaeans 540 Achelous 2 Achilles 423 Acrisius 1 Actaea 1 Actor 8 Adamas 5 Admetus 4 Adrastus 2 Adresteia 1 Adrestus 8 Aeacus 20 Aegae 2 Aegaeon 1 Aegeus 1 Aegialeia 1 Aegialus 1 Aegilips 1 Aegina 1 Aegium 1 Aeneas 86 Aenus 1 Aeolus 1 Aepea 2 Aepytus 1 Aesculapius 7 Aesepus 2 Aesopus 4 Aesyetes 2 Aesyme 1 Aesymnus 1 ... wronged 2 wronging 1 wrongs 1 wroth 1 wrought 24 wrung 1 yard 3 yarded 1 yards 2 yawned 1 ye 3 yea 1 year 13 yearling 2 yearned 4 yearning 2 years 15 yellow 5 yesterday 5 yet 160 yield 10 yielded 3 yielding 3 yieldit 1 yoke 24 yoked 11 yokes 1 yokestraps 1 yolking 1 yonder 3 you 1712 young 44 younger 9 youngest 6 your 592 yourelf 1 yours 7 yourself 60 yourselves 17 youselves 1 youth 17 youths 18 zeal 2 I ran the following script on "The Iliad": #!/usr/bin/env python import string text = open('iliad.txt', 'r').read() for punct in string.punctuation: text = text.replace(punct, ' ') words = text.split() word_dict = {} for word in words: word_dict[word] = word_dict.get(word, 0) + 1 word_list = word_dict.keys() word_list.sort() for word in word_list: print "%-25s%d" % (word, word_dict[word]) Jeremy Jones ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
RE: [Tutor] count words
You could use split() to split the contents of the file into a list of strings. ### >>> x = 'asdf foo bar foo' >>> x.split() ['asdf', 'foo', 'bar', 'foo'] ### Here's one way to iterate over that to get the counts. I'm sure there are dozens. ### >>> x = 'asdf foo bar foo' >>> counts = {} >>> for word in x.split(): ... counts[word] = x.count(word) ... >>> counts {'foo': 2, 'bar': 1, 'asdf': 1} ### The dictionary takes care of duplicates. If you are using a really big file, it might pay to eliminate duplicates from the list before running x.count(word) Thanks, Ryan -Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ron Nixon Sent: Tuesday, February 15, 2005 11:22 AM To: tutor@python.org Subject: [Tutor] count words I know that you can do this to get a count of home many times a word appears in a file f = open('text.txt').read() print f.count('word') Other than using a several print statments to look for seperate words like this, is there a way to do it so that I get a individual count of each word: word1 xxx word2 xxx words xxx etc. __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] count words
Ron, is there a way to do it so > that I get a individual count of each word: > > word1 xxx > word2 xxx > words xxx > > etc. Ron, I'm gonna throw some untested code at you. Let me know if you understand it or not: word_counts = {} for line in f: for word in line.split(): if word in word_counts: word_counts[word] += 1 else: word_counts[word] = 1 for word in word_counts: print "%s %d" % (word, word_counts[word]) Peace Bill Mill bill.mill at gmail.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] count words
I know that you can do this to get a count of home many times a word appears in a file f = open('text.txt').read() print f.count('word') Other than using a several print statments to look for seperate words like this, is there a way to do it so that I get a individual count of each word: word1 xxx word2 xxx words xxx etc. __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor