Message: 1 Date: Fri, 03 May 2013 23:05:32 +0100 From: Alan Gauld <alan.ga...@btinternet.com> To: tutor@python.org Subject: Re: [Tutor] creating a corpus from a csv file Message-ID: <km1cb8$ist$1...@ger.gmane.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
On 03/05/13 21:48, Treder, Robert wrote: > I'm very new to python and am trying to figure out how to > make a corpus from a text file. Hi, I for one have no idea what a corpus is or looks like so you will need to help us out a little before we can help you. > I have a csv file (actually pipe '|' delimited) where each row > corresponds to a different text document. > Each row contains a communication note. > Other columns correspond to categories of types of communications. > I am able to read the csv file and print the notes column as follows: > > import csv > with open('notes.txt', 'rb') as infile: > reader = csv.reader(infile, delimiter = '|') > i = 0 > for row in reader: > if i <= 25: print row[8] > i = i+1 You don't need to manually manage 'i'. you could do this instead: with open('notes.txt', 'rb') as infile: reader = csv.reader(infile, delimiter = '|') for count, row in enumerate(reader): if count <= 25: print row[8] # I assume indented? else: break # save time if its a big file > I would like to convert this to a categorized corpus with > some of the other columns corresponding to the categories. You might be able to use a dictionary but for now I'm still not clear what you mean. Can you show us some sample input and output data? > documentation on how to use csv.reader with PlaintextCorpusReader never heard of the latter - is it an external module? HTH -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ Message: 7 Date: Sat, 04 May 2013 10:29:57 +0200 From: Peter Otten <__pete...@web.de> To: tutor@python.org Subject: Re: [Tutor] creating a corpus from a csv file Message-ID: <km2gtu$o7a$1...@ger.gmane.org> Content-Type: text/plain; charset="ISO-8859-1" Treder, Robert wrote: > I'm very new to python and am trying to figure out how to make a corpus > from a text file. I have a csv file (actually pipe '|' delimited) where > each row corresponds to a different text document. Each row contains a > communication note. Other columns correspond to categories of types of > communications. I am able to read the csv file and print the notes column > as follows: > > import csv > with open('notes.txt', 'rb') as infile: > reader = csv.reader(infile, delimiter = '|') > i = 0 > for row in reader: > if i <= 25: print row[8] > i = i+1 > > I would like to convert this to a categorized corpus with some of the > other columns corresponding to the categories. All of the columns are text > (i.e., strings). I have looked for documentation on how to use csv.reader > with PlaintextCorpusReader but have been unsuccessful in finding a > example similar to what I want to do. Can someone please help? This mailing list is for learning Python. For problems with a specific library you should use the general python list <http://mail.python.org/mailman/listinfo/python-list> or a forum dedicated to that library <http://groups.google.com/group/nltk-users> If you ask on a general forum you should give some context -- the name of the library would be the bare minimum. The following comes with no warranties as I'm not an nltk user: import csv from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader from itertools import islice, chain LIMIT_SIZE = 25 # set to None if not debugging def pairs(filename): """Generate (filename, list_of_categories) pairs from a csv file """ with open(filename, "rb") as infile: rows = islice(csv.reader(infile, delimiter="|"), LIMIT_SIZE) for row in rows: # assume that columns 10 and above contain categories yield row[8], row[9:] if __name__ == "__main__": import random FILENAME = "notes.txt" # assume that every filename occurs only once in the file file_to_categories = dict(pairs(FILENAME)) files = list(file_to_categories) all_categories = set(chain.from_iterable(file_to_categories.itervalues())) reader = CategorizedPlaintextCorpusReader(".", files, cat_map=file_to_categories) # print words for a random category category = random.choice(list(all_categories)) print "words for category {}:".format(category) print sorted(set(reader.words(categories=category))) ------------------------------ Alan, Peter, Thanks for your responses. Sorry about the lack of context and module information in my initial post. Peter got the context right - creating python object(s) from a collection of text documents (the corpus) in preparation to doing text mining and modeling. The modified script from Peter follows. I dropped the size limitation and have included some test data below. Problems still exist. The code attempts to read files with names based on concatenating the first and third columns, the data that is coming form the yield . Consequently, I'm convinced I will need to write a custom csvCorpusReader. I've received some tips for that from an nltk email group. If anyone has additional suggestions or comments I would love to hear them. Thanks, Bob ##### Code below here ##### import csv from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader from itertools import islice, chain #filename = 'L:/gps_pa/DEV/TextMining/EmailTickerInterest/Data/testNotes.txt'# set to None if not debugging filename = "C:/nltk_data/corpora/notes/testNotes.txt" # set to None if not debugging def pairs(filename): """Generate (filename, list_of_categories) pairs from a csv file """ with open(filename, "rb") as infile: rows = csv.reader(infile, delimiter="|") for row in rows: yield row[0], row[2] print row[0], row[2] if __name__ == "__main__": import random FILENAME = "C:/nltk_data/corpora/notes/testNotes.txt" # assume that every filename occurs only once in the file file_to_categories = dict(pairs(FILENAME)) files = list(file_to_categories) all_categories = set(chain.from_iterable(file_to_categories.itervalues())) print all_categories reader = CategorizedPlaintextCorpusReader(".", files, cat_map=file_to_categories) # print words for a random category category = random.choice(list(all_categories)) print "words for category {}:".format(category) print sorted(set(reader.words(categories=category))) Some test data looks like the following, the first row being column headers: CID|X|MID|note 1|not|101|note 1 2|any|102|note 2 3|thing|103|note 3 4|tbd|104|note 4 Modifying Peter's code to get it to run as far as possible. -------------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor