On 30Sep2017 22:42, Sri G. <srigal...@gmail.com> wrote:
I’ve written the code below for finding the most common words in a text
file that has about 1.1 million words. It's working fine, but I believe
there is always room for improvement.

When run, the function in the script gets a text file from the command-line
argument sys.argv[1], opens the file in read mode, converts the text to
lowercase, makes a list of words from the text after removing any
whitespaces or empty strings, and stores the list elements as dictionary
keys and values in a collections.Counter object. Finally, it returns a
dictionary of the most common words and their counts. The
words.most_common() method gets its argument from the optional top
parameter.

Thank you, this is a nice clear explaination.

import collections

It is common to import specific names from modules; whether you just import the module or specific names depends on your needs. However, in this case, instead of importing "collections" and creating a counter with:

 collections.Counter()

I would import the name "Counter" and use it like this:

 from collections import Counter
 ...
 ... Counter() ...

def find_most_common_words(textfile, top=10):

It is often good form to externalise the "10" as a constant (Python doesn't have constants per se, but it has a convention for value which would be constants in other languages):

 DEFAULT_TOP_SIZE = 10

 def find_most_common_words(textfile, top=DEFAULT_TOP_SIZE):

In a library of functions you can then put all the "constants" up the top where they can be seen or modified, and reference then lower down where the functions are defined.

   ''' Returns the most common words in the textfile.'''
   textfile = open(textfile)

You shouldn't reuse "textfile" as a file. Keep the filename value distinct. I tend to use names like "textpath" for the filename and "textfile" for the open file, eg:

 textfile = open(textpath)

Of course you change the parameter name to match.

   textfile = open(textfile)
   text = textfile.read().lower()
   textfile.close()

The preferred way of opening files looks like this:

 with open(textpath) as textfile:
   ... work on the data from textfile ...
   ...
   ...

This automatically closes textfile on exiting the "with" suite, even if an exception occurs (which is very handy in situations where the exception might be caught in outer code - you never miss closing the file). It is also shorter and more readable.

   text = textfile.read().lower()

This code reads the entire text into memory. It is more frugal to read the file progressively, which is good because it scales to files of any size. This concern is prominent in my mind because I started programming on systems with a 16 bit address space, so 64K max including the program itself _and_ the OS. But it applies even today when many data files are very large.

So one might process file file as lines of text:

 C = Counter()
 for line in textfile:
   words = line.lower().split()
   C.update(words)

This avoids the need for more than a single line to be stored in memory at any given time.

Cheers,
Cameron Simpson <c...@cskk.id.au> (formerly c...@zip.com.au)
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to