Re: [Tutor] Most common words in a text file
On 09/30/2017 11:12 AM, Sri G. wrote: > I'm learning programming with Python. > > I’ve written the code below for finding the most common words in a text > file that has about 1.1 million words. It's working fine, but I believe > there is always room for improvement. > > When run, the function in the script gets a text file from the command-line > argument sys.argv[1], opens the file in read mode, converts the text to > lowercase, makes a list of words from the text after removing any > whitespaces or empty strings, and stores the list elements as dictionary > keys and values in a collections.Counter object. Finally, it returns a > dictionary of the most common words and their counts. The > words.most_common() method gets its argument from the optional top > parameter. > > import sysimport collections > def find_most_common_words(textfile, top=10): > ''' Returns the most common words in the textfile.''' > > textfile = open(textfile) > text = textfile.read().lower() > textfile.close() > words = collections.Counter(text.split()) # how often each word appears > > return dict(words.most_common(top)) > > filename = sys.argv[1] > top_five_words = find_most_common_words(filename, 5) > > I need your comments please. Others have made some pertinent comments already. How much you spend time to "improve" a bit of code depends on what you're going to do with it. If you've solved your problem, and it's a one-shot: don't worry much about it. Nothing wrong with a bit of throwaway code (although things do sometimes take on a life much longer than you intended, I can say from experience!!!) I'd ask a question or two to think about: first off, if you know the intended use of this function always will be to get a "top 10 list" - then why convert it back to a dict (unsorted) to return after Counter.most_common() has just given you a sorted list? You're most likely going to have to take your top_five_words dictionary and turn it back into something sorted to report back out on the counts. Second, if your function is likely to be called from many places in your code, is it too specialized? Can you design it as a more general API that could be used for other things than this specific purpose? For example, perhaps you just want to return the Counter instance without the further processing of "most_common", and let the client decide what it wants to do with that object. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Most common words in a text file
On 30/09/17 18:12, Sri G. wrote: > import sysimport collections I assume that should be two lines? But you can also import multiple modules on a single line. import sys, collections Although some folks don't like that style. > def find_most_common_words(textfile, top=10): > ''' Returns the most common words in the textfile.''' The comment is slightly inaccurate since you really return a dict of the most common words *with the counts* added. It is good practice to specify the precise return type (list, tuple, dict etc) since that tells the user what they can do with it once they have it. Also by using the parameter textfile it is not clear whether I should pass a file object or a file name. Again it helps users if the comment is as specific as possible. > textfile = open(textfile) > text = textfile.read().lower() potential memory hog, others have already suggested reading line by line > textfile.close() > words = collections.Counter(text.split()) # how often each word appears > > return dict(words.most_common(top)) -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Most common words in a text file
On 30/09/2017 18:12, Sri G. wrote: I'm learning programming with Python. I’ve written the code below for finding the most common words in a text file that has about 1.1 million words. It's working fine, but I believe there is always room for improvement. When run, the function in the script gets a text file from the command-line argument sys.argv[1], opens the file in read mode, converts the text to lowercase, makes a list of words from the text after removing any whitespaces or empty strings, and stores the list elements as dictionary keys and values in a collections.Counter object. Finally, it returns a dictionary of the most common words and their counts. The words.most_common() method gets its argument from the optional top parameter. import sysimport collections def find_most_common_words(textfile, top=10): ''' Returns the most common words in the textfile.''' textfile = open(textfile) text = textfile.read().lower() textfile.close() The modern Pythonic way is:- with open(textfile) as textfile: text = textfile.read().lower() The file close is handled automatically for you. For those who don't know this construct using the "with" keyword is called a context manager, here's an article about them https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/ words = collections.Counter(text.split()) # how often each word appears return dict(words.most_common(top)) filename = sys.argv[1] How about some error handling if the user forgets the filename? The Pythonic way is to use a try/except looking for an IndexError, but there's nothing wrong with checking the length of sys.argv. top_five_words = find_most_common_words(filename, 5) I need your comments please. Sri Pretty good all in all :) -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email has been checked for viruses by AVG. http://www.avg.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Most common words in a text file
On 30Sep2017 22:42, Sri G. wrote: I’ve written the code below for finding the most common words in a text file that has about 1.1 million words. It's working fine, but I believe there is always room for improvement. When run, the function in the script gets a text file from the command-line argument sys.argv[1], opens the file in read mode, converts the text to lowercase, makes a list of words from the text after removing any whitespaces or empty strings, and stores the list elements as dictionary keys and values in a collections.Counter object. Finally, it returns a dictionary of the most common words and their counts. The words.most_common() method gets its argument from the optional top parameter. Thank you, this is a nice clear explaination. import collections It is common to import specific names from modules; whether you just import the module or specific names depends on your needs. However, in this case, instead of importing "collections" and creating a counter with: collections.Counter() I would import the name "Counter" and use it like this: from collections import Counter ... ... Counter() ... def find_most_common_words(textfile, top=10): It is often good form to externalise the "10" as a constant (Python doesn't have constants per se, but it has a convention for value which would be constants in other languages): DEFAULT_TOP_SIZE = 10 def find_most_common_words(textfile, top=DEFAULT_TOP_SIZE): In a library of functions you can then put all the "constants" up the top where they can be seen or modified, and reference then lower down where the functions are defined. ''' Returns the most common words in the textfile.''' textfile = open(textfile) You shouldn't reuse "textfile" as a file. Keep the filename value distinct. I tend to use names like "textpath" for the filename and "textfile" for the open file, eg: textfile = open(textpath) Of course you change the parameter name to match. textfile = open(textfile) text = textfile.read().lower() textfile.close() The preferred way of opening files looks like this: with open(textpath) as textfile: ... work on the data from textfile ... ... ... This automatically closes textfile on exiting the "with" suite, even if an exception occurs (which is very handy in situations where the exception might be caught in outer code - you never miss closing the file). It is also shorter and more readable. text = textfile.read().lower() This code reads the entire text into memory. It is more frugal to read the file progressively, which is good because it scales to files of any size. This concern is prominent in my mind because I started programming on systems with a 16 bit address space, so 64K max including the program itself _and_ the OS. But it applies even today when many data files are very large. So one might process file file as lines of text: C = Counter() for line in textfile: words = line.lower().split() C.update(words) This avoids the need for more than a single line to be stored in memory at any given time. Cheers, Cameron Simpson (formerly c...@zip.com.au) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] Most common words in a text file
I'm learning programming with Python. I’ve written the code below for finding the most common words in a text file that has about 1.1 million words. It's working fine, but I believe there is always room for improvement. When run, the function in the script gets a text file from the command-line argument sys.argv[1], opens the file in read mode, converts the text to lowercase, makes a list of words from the text after removing any whitespaces or empty strings, and stores the list elements as dictionary keys and values in a collections.Counter object. Finally, it returns a dictionary of the most common words and their counts. The words.most_common() method gets its argument from the optional top parameter. import sysimport collections def find_most_common_words(textfile, top=10): ''' Returns the most common words in the textfile.''' textfile = open(textfile) text = textfile.read().lower() textfile.close() words = collections.Counter(text.split()) # how often each word appears return dict(words.most_common(top)) filename = sys.argv[1] top_five_words = find_most_common_words(filename, 5) I need your comments please. Sri ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor