I guess this may help you --------------------------
import operator from string import whitespace as space from string import punctuation as punc class TextProcessing(object): """.""" def __init__(self): """.""" self.file = None self.sorted_list = [] self.words_and_occurence = {} def __sort_dict_by_value(self): """.""" sorted_in_rev = sorted(self.words_and_occurence.items(), key=lambda x: x[1]) self.sorted_list = sorted_in_rev[::-1] def __validate_words(self, word): """.""" if word in self.words_and_occurence: self.words_and_occurence[word] += 1 else: self.words_and_occurence[word] = 1 def __parse_file(self, file_name): """.""" fp = open(file_name, 'r') line = fp.readline() while line: split_line = [self.__validate_words(word.strip(punc + space)) \ for word in line.split() if word.strip(punc + space)] line = fp.readline() fp.close() def parse_file(self, file_name=None): """.""" if file_name is None: raise Exception("Please pass the file to be parsed") if not file_name.endswith(r".txt"): raise Exception("*** Error *** Not a valid text file") self.__parse_file(file_name) self.__sort_dict_by_value() def print_top_n(self, n): """.""" print "Top {0} words:".format(n), [self.sorted_list[i][0] for i in xrange(n)] def print_unique_words(self): """.""" print "Unique words:", [self.sorted_list[i][0] for i in xrange(len(self.sorted_list))] if __name__ == "__main__": """.""" obj = TextProcessing() obj.parse_file(r'test_input.txt') obj.print_top_n(4) obj.print_unique_words() *-- Regards --* * * * Siva Cn* *Python Developer* * * *+91 9620339598* *http://www.cnsiva.com* --------------------- On Thu, Oct 17, 2013 at 7:58 PM, <tutor-requ...@python.org> wrote: > Send Tutor mailing list submissions to > tutor@python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/tutor > or, via email, send a message with subject or body 'help' to > tutor-requ...@python.org > > You can reach the person managing the list at > tutor-ow...@python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Tutor digest..." > > > Today's Topics: > > 1. Re: Help please (Alan Gauld) > 2. Re: Help please (Peter Otten) > 3. Re: Help please (Dominik George) > 4. Re: Help please (Kengesbayev, Askar) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 17 Oct 2013 14:13:07 +0100 > From: Alan Gauld <alan.ga...@btinternet.com> > To: tutor@python.org > Subject: Re: [Tutor] Help please > Message-ID: <l3onop$oin$1...@ger.gmane.org> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > On 16/10/13 19:49, Pinedo, Ruben A wrote: > > I was given this code and I need to modify it so that it will: > > > > #1. Error handling for the files to ensure reading only .txt file > > I'm not sure what is meant here since your code only ever opens > 'emma.txt', so it is presumably a text file... Or are you > supposed to make the filename a user provided value maybe > (using raw_input maybe?) > > > #2. Print a range of top words... ex: print top 10-20 words > > I assume 'top' here means the most common? Whoever is writing the > specification for this problem needs to be a bit more specific > in their definitions. > > If so you need to fix the bugs in process_line() and > process_file(). I don;t know if these are deliberate bugs > or somebody is just sloppy. But neither work as expected > right now. (Hint: Consider the return values of each) > > Once you've done that you can figure out how to extract > the required number of words from your (unsorted) dictionary. > and put that in a reporting function and print the output. > You might be able to use the two common words functions, > although watch out because they don't do exactly what > you want and one of them is basically broken... > > > #3. Print only the words with > 3 characters > > Modify the above to discard words of 3 letters or less. > > > #4. Modify the printing function to print top 1 or 2 or 3 .... > > I assume this means take a parameter that speciffies the > number of words to print. Or it could be the length of > word to ignore. Again the specification is woolly > In either case its a small modification to your > reporting function. > > > #5. How many unique words are there in the book of length 1, 2, 3 etc > > This is slicing the data slightly differently but > again not that different to the earlier requirement. > > > I am fairly new to python and am completely lost, i looked in my book as > > to how to do number one but i cannot figure out what to modify and/or > > delete to add the print selection. This is the code: > > You need to modify the two brokemn functions and add a > new reporting function. (Despite the reference to a > printing function I'd suggest keeping the data extraction > and printing seperate. > > > import string > > > > def process_file(filename): > > hist = dict() > > fp = open(filename) > > for line in fp: > > process_line(line, hist) > > return hist > > > > def process_line(line, hist): > > line = line.replace('-', ' ') > > for word in line.split(): > > word = word.strip(string.punctuation + string.whitespace) > > word = word.lower() > > hist[word] = hist.get(word, 0) + 1 > > > > def common_words(hist): > > t = [] > > for key, value in hist.items(): > > t.append((value, key)) > > t.sort(reverse=True) > > return t > > > > def most_common_words(hist, num=100): > > t = common_words(hist) > > print 'The most common words are:' > > for freq, word in t[:num]: > > print freq, '\t', word > > hist = process_file('emma.txt') > > print 'Total num of Words:', sum(hist.values()) > > print 'Total num of Unique Words:', len(hist) > > most_common_words(hist, 50) > > > > Any help would be greatly appreciated because i am struggling in this > > class. Thank you in advance > > hth > -- > Alan G > Author of the Learn to Program web site > http://www.alan-g.me.uk/ > http://www.flickr.com/photos/alangauldphotos > > > > ------------------------------ > > Message: 2 > Date: Thu, 17 Oct 2013 15:37:49 +0200 > From: Peter Otten <__pete...@web.de> > To: tutor@python.org > Subject: Re: [Tutor] Help please > Message-ID: <l3op59$8n6$1...@ger.gmane.org> > Content-Type: text/plain; charset="ISO-8859-1" > > Alan Gauld wrote: > > [Ruben Pinedo] > > > def process_file(filename): > > hist = dict() > > fp = open(filename) > > for line in fp: > > process_line(line, hist) > > return hist > > > > def process_line(line, hist): > > line = line.replace('-', ' ') > > > > for word in line.split(): > > word = word.strip(string.punctuation + string.whitespace) > > word = word.lower() > > > > hist[word] = hist.get(word, 0) + 1 > > [Alan Gauld] > > > If so you need to fix the bugs in process_line() and > > process_file(). I don;t know if these are deliberate bugs > > or somebody is just sloppy. But neither work as expected > > right now. (Hint: Consider the return values of each) > > I fail to see the bug. > > process_line() mutates its `hist` argument, so there's no need to return > something. Or did you mean something else that escapes me? > > > > ------------------------------ > > Message: 3 > Date: Thu, 17 Oct 2013 16:17:27 +0200 > From: Dominik George <n...@naturalnet.de> > To: Todd Matsumoto <c.t.matsum...@gmail.com>,tutor@python.org > Subject: Re: [Tutor] Help please > Message-ID: <f310f0be-858d-48e2-ae88-5ad720518...@email.android.com> > Content-Type: text/plain; charset=UTF-8 > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > Todd Matsumoto <c.t.matsum...@gmail.com> schrieb: > >> #1. Error handling for the files to ensure reading only .txt file > >Look up exceptions. > >Find out what the string method endswith() does. > > One should note that the OP probably meant files of the type text/plain > rather than .txt files. File name extensions are a convenience to identify > a file on first glance, but they tell absolutely nothing about the contents. > > So, look up MIME types as well ;)! > > - -nik > -----BEGIN PGP SIGNATURE----- > Version: APG v1.0.8-fdroid > > iQFNBAEBCgA3BQJSX/F3MBxEb21pbmlrIEdlb3JnZSAobW9iaWxlIGtleSkgPG5p > a0BuYXR1cmFsbmV0LmRlPgAKCRAvLbGk0zMOJZxHB/9TGh6F1vRzgZmSMHt48arc > jruTRfvOK9TZ5MWm6L2ZpxqKr3zBP7KSf1ZWSeXIovat9LetETkEwZ9bzHBuN8Ve > m8YsOVX3zR6VWqGkRYYer3MbWo9DCONlJUKGMs/qjB180yxxhQ12Iw9WAHqam1Ti > n0CCWsf4l5B3WBe+t2aTOlQNmo//6RuBK1LfCrnYX0XV2Catv1075am0KaTvbxfB > rfHHnR4tdIYmZ8P/SkO3t+9JzJU9e+H2W90++K9EkMTBJxUhsa4AuZIEr8WqEfSe > EheQMUp23tlMgKRp6UHiRJBljEsQJ0XFuYa+zj6hXCXoru/9ReHTRWcvJEpfXxEC > =hJ0m > -----END PGP SIGNATURE----- > > > > ------------------------------ > > Message: 4 > Date: Thu, 17 Oct 2013 14:21:17 +0000 > From: "Kengesbayev, Askar" <askar.kengesba...@etrade.com> > To: "Pinedo, Ruben A" <rapin...@miners.utep.edu>, "tutor@python.org" > <tutor@python.org> > Subject: Re: [Tutor] Help please > Message-ID: > < > 6fad14604b087b438f6ff64d9875a40c68f5a...@atl1ex10mbx4.corp.etradegrp.com> > > Content-Type: text/plain; charset="us-ascii" > > Ruben, > > #1 you can try something like this > try: > with open('my_file.txt') as file: > pass > except IOError as e: > print "Unable to open file" #Does not exist or you do not have > read permission > > #2. I would try to use regular expression push words to array and then you > can manipulate array. Not sure if it is efficient way but it should work. > #3 . easy way would be to use regular expression. Re module. > #4. Once you will have array in #2 you can sort it and print whatever top > words you need. > #5. I am not sure the best way on this but you can play with array from > #2. > > Thanks, > Askar > > From: Pinedo, Ruben A [mailto:rapin...@miners.utep.edu] > Sent: Wednesday, October 16, 2013 2:49 PM > To: tutor@python.org > Subject: [Tutor] Help please > > I was given this code and I need to modify it so that it will: > > #1. Error handling for the files to ensure reading only .txt file > #2. Print a range of top words... ex: print top 10-20 words > #3. Print only the words with > 3 characters > #4. Modify the printing function to print top 1 or 2 or 3 .... > #5. How many unique words are there in the book of length 1, 2, 3 etc > > I am fairly new to python and am completely lost, i looked in my book as > to how to do number one but i cannot figure out what to modify and/or > delete to add the print selection. This is the code: > > > import string > > def process_file(filename): > hist = dict() > fp = open(filename) > for line in fp: > process_line(line, hist) > return hist > > def process_line(line, hist): > line = line.replace('-', ' ') > > for word in line.split(): > word = word.strip(string.punctuation + string.whitespace) > word = word.lower() > > hist[word] = hist.get(word, 0) + 1 > > def common_words(hist): > t = [] > for key, value in hist.items(): > t.append((value, key)) > > t.sort(reverse=True) > return t > > def most_common_words(hist, num=100): > t = common_words(hist) > print 'The most common words are:' > for freq, word in t[:num]: > print freq, '\t', word > > hist = process_file('emma.txt') > print 'Total num of Words:', sum(hist.values()) > print 'Total num of Unique Words:', len(hist) > most_common_words(hist, 50) > > Any help would be greatly appreciated because i am struggling in this > class. Thank you in advance > > Respectfully, > > Ruben Pinedo > Computer Information Systems > College of Business Administration > University of Texas at El Paso > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/tutor/attachments/20131017/ea525e7b/attachment.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > Tutor maillist - Tutor@python.org > https://mail.python.org/mailman/listinfo/tutor > > > ------------------------------ > > End of Tutor Digest, Vol 116, Issue 37 > ************************************** >
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor