"possibilitybox" <[EMAIL PROTECTED]> writes: > this code here: > > > def wordcount(lines): > for i in range(len(lines)/8): > words = lines[i].split(" ") > if not locals().has_key("frequency"): > frequency = {} > for word in words: > if frequency.has_key(word): > frequency[word] += 1 > else: > frequency[word] = 1 > return frequency > wordcount(lines) > > is taking over six minutes to run on a two megabyte text file. i > realize that's a big text file, a really big one (it's actually the > full text of don quixote.). i'm trying to figure out how. is there a > better way for me to do a frequency count of all the words in the text?
2MB is not that large. Your method is ok and shouldn't be that slow unless you're on a pretty slow PC. Could your machine be short of memory and paging a lot? You could tweak the code somewhat by moving the initialization of the frequency dict out of the loop and combining a few other statements. Also you should use xrange instead of range, to avoid allocating a big list in memory: def wordcount(lines): frequency = {} for i in xrange(len(lines)/8): for word in lines[i].split(): frequency[word] = 1 + frequency.get(word, 0) return frequency wordcount(lines) > it seems to me like this should scale linearly, but perhaps it isn't? > i don't know much about algorithmic complexity. if someone could give > a breakdown of this functions complexity as well i'd be much obliged. It should be close to linear, or at worst n log n, depending on what happens when dicts have to be enlarged as the # of elements increases. Why are you only processing 1/8th of the lines? -- http://mail.python.org/mailman/listinfo/python-list