Switanek, Nick wrote: > Thanks very much for your help. > > I did indeed neglect to put the "print" in the code that I sent to the > list. > > It appears that the step that is taking a long time, and that therefore > makes me think that the script is somehow broken, is creating a > dictionary of frequencies from the list of ngrams. To do this, I've > written, for example: > > bigramDict = {} > bigrams = [' '.join(wordlist[i:i+2]) for i in range(len(wordlist)-2+1)] > for bigram in bigrams: > if bigram in bigramDict.keys(): bigramDict[bigram] += 1 > else: bigramDict[bigram] = 1
Ouch! bigramDict.keys() creates a *new* *list* of all the keys in bigramDict. You then search the list - a linear search! - for bigram. I'm not surprised that this gets slow. If you change that line to if bigram in bigramDict: bigramDict[bigram] += 1 you should see a dramatic improvement. Kent > > > With around 500,000 bigrams, this is taking over 25 minutes to run (and > I haven't sat around to let it finish) on an XP machine at 3.0GHz and > 1.5GB RAM. I bet I'm trying to reinvent the wheel here, and that there > are faster algorithms available in some package. I think possibly an > indexing package like PyLucene would help create frequency dictionaries, > but I can't figure it out from the online material available. Any > suggestions? > > Thanks, > Nick _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor