Re: [Tutor] creating a corpus from a csv file

Treder, Robert Mon, 13 May 2013 07:25:04 -0700

Message: 1
Date: Fri, 03 May 2013 23:05:32 +0100
From: Alan Gauld <[email protected]>
To: [email protected]
Subject: Re: [Tutor] creating a corpus from a csv file
Message-ID: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 03/05/13 21:48, Treder, Robert wrote:

> I'm very new to python and am trying to figure out how to
 > make a corpus from a text file.

Hi, I for one have no idea what a corpus is or looks like so you will need to 
help us out a little before we can help you.

> I have a csv file (actually pipe '|' delimited) where each row 
> corresponds to a different text document.

> Each row contains a communication note.
 > Other columns correspond to categories of types of communications.

> I am able to read the csv file and print the notes column as follows:
>
> import csv
> with open('notes.txt', 'rb') as infile:
>      reader = csv.reader(infile, delimiter = '|')
>      i = 0
>      for row in reader:
>      if i <= 25: print row[8]
>      i = i+1

You don't need to manually manage 'i'.

you could do this instead:

with open('notes.txt', 'rb') as infile:
      reader = csv.reader(infile, delimiter = '|')
      for count, row in enumerate(reader):
          if count <= 25: print row[8]  # I assume indented?
          else: break                   # save time if its a big file

> I would like to convert this to a categorized corpus with
 > some of the other columns corresponding to the categories.

You might be able to use a dictionary but for now I'm still not clear what you 
mean. Can you show us some sample input and output data?

 > documentation on how to use csv.reader with PlaintextCorpusReader

never heard of the latter - is it an external module?

HTH
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

Message: 7
Date: Sat, 04 May 2013 10:29:57 +0200
From: Peter Otten <[email protected]>
To: [email protected]
Subject: Re: [Tutor] creating a corpus from a csv file
Message-ID: <[email protected]>
Content-Type: text/plain; charset="ISO-8859-1"

Treder, Robert wrote:

> I'm very new to python and am trying to figure out how to make a corpus
> from a text file. I have a csv file (actually pipe '|' delimited) where
> each row corresponds to a different text document. Each row contains a
> communication note. Other columns correspond to categories of types of
> communications. I am able to read the csv file and print the notes column
> as follows:
>  
> import csv
> with open('notes.txt', 'rb') as infile:
>     reader = csv.reader(infile, delimiter = '|')
>     i = 0
>     for row in reader:
>     if i <= 25: print row[8]
>     i = i+1
> 
> I would like to convert this to a categorized corpus with some of the
> other columns corresponding to the categories. All of the columns are text
> (i.e., strings). I have looked for documentation on how to use csv.reader
> with PlaintextCorpusReader but have been unsuccessful in finding a 
> example similar to what I want to do. Can someone please help?

This mailing list is for learning Python. For problems with a specific 
library you should use the general python list 

<http://mail.python.org/mailman/listinfo/python-list>

or a forum dedicated to that library

<http://groups.google.com/group/nltk-users>

If you ask on a general forum you should give some context -- the name of 
the library would be the bare minimum.

The following comes with no warranties as I'm not an nltk user:

import csv
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader
from itertools import islice, chain

LIMIT_SIZE = 25 # set to None if not debugging

def pairs(filename):
    """Generate (filename, list_of_categories) pairs from a csv file
    """
    with open(filename, "rb") as infile:
        rows = islice(csv.reader(infile, delimiter="|"), LIMIT_SIZE)
        for row in rows:
            # assume that columns 10 and above contain categories
            yield row[8], row[9:]

if __name__ == "__main__":
    import random
    FILENAME = "notes.txt"

    # assume that every filename occurs only once in the file
    file_to_categories = dict(pairs(FILENAME))

    files = list(file_to_categories)

    all_categories = 
set(chain.from_iterable(file_to_categories.itervalues()))

    reader = CategorizedPlaintextCorpusReader(".", files, 
cat_map=file_to_categories)

    # print words for a random category
    category = random.choice(list(all_categories))
    print "words for category {}:".format(category)
    print sorted(set(reader.words(categories=category)))

------------------------------
Alan, Peter, 

Thanks for your responses. Sorry about the lack of context and module 
information in my initial post. Peter got the context right - creating python 
object(s) from a collection of text documents (the corpus) in preparation to 
doing text mining and modeling. The modified script from Peter follows. I 
dropped the size limitation and have included some test data below. 

Problems still exist. The code attempts to read files with names based on 
concatenating the first and third columns, the data that is coming form the 
yield . Consequently, I'm convinced I will need to write a custom 
csvCorpusReader. I've received some tips for that from an nltk email group. 

If anyone has additional suggestions or comments I would love to hear them. 

Thanks, 
Bob

#####  Code below here   #####

import csv
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader
from itertools import islice, chain

#filename = 'L:/gps_pa/DEV/TextMining/EmailTickerInterest/Data/testNotes.txt'# 
set to None if not debugging
filename = "C:/nltk_data/corpora/notes/testNotes.txt" # set to None if not 
debugging

def pairs(filename):
    """Generate (filename, list_of_categories) pairs from a csv file
    """
    with open(filename, "rb") as infile:
        rows = csv.reader(infile, delimiter="|")
        for row in rows:
            yield row[0], row[2]
            print row[0], row[2]

if __name__ == "__main__":
    import random
    FILENAME = "C:/nltk_data/corpora/notes/testNotes.txt"

    # assume that every filename occurs only once in the file
    file_to_categories = dict(pairs(FILENAME))

    files = list(file_to_categories)

    all_categories = set(chain.from_iterable(file_to_categories.itervalues()))
    print all_categories

    reader = CategorizedPlaintextCorpusReader(".", files, 
cat_map=file_to_categories)

    # print words for a random category
    category = random.choice(list(all_categories))
    print "words for category {}:".format(category)
    print sorted(set(reader.words(categories=category)))

Some test data looks like the following, the first row being column headers: 

CID|X|MID|note
1|not|101|note 1
2|any|102|note 2
3|thing|103|note 3
4|tbd|104|note 4

Modifying Peter's code to get it to run as far as possible.  

--------------------------------------------------------------------------------

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or 
views contained herein are not intended to be, and do not constitute, advice 
within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and 
Consumer Protection Act. If you have received this communication in error, 
please destroy all electronic and paper copies and notify the sender 
immediately. Mistransmission is not intended to waive confidentiality or 
privilege. Morgan Stanley reserves the right, to the extent permitted under 
applicable law, to monitor electronic communications. This message is subject 
to terms available at the following link: 
http://www.morganstanley.com/disclaimers. If you cannot access these links, 
please notify us by reply message and we will send the contents to you. By 
messaging with Morgan Stanley you consent to the foregoing.
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] creating a corpus from a csv file

Reply via email to