Re: [Tutor] Is the difference in outputs with different size input lists due to limits on memory with PYTHON?

Dave Angel Thu, 06 May 2010 08:16:50 -0700

Art Kendall wrote:

I am running Windows 7 64bit Home premium. with quad cpus and 8Gmemory. I am using Python 2.6.2.
I have all the Federalist Papers concatenated into one .txt file.

Which is how big? Currently you (unnecessarily) load the entire thinginto memory with readlines(). And then you do confusing work to splitit apart again, into one list element per paper. And for a whilethere, you have three copies of the entire text. You're keeping twocopies, in the form of alltext and papers.You print out the len(papers). What do you see there? Is it correctly87 ? If it's not, you have to fix the problem here, before even going on.

I want to prepare a file with a row for each paper and a column foreach term. The cells would contain the count of a term in that paper.In the original application in the 1950's 30 single word terms wereused. I can now use NoteTab to get a list of all the 8708 separatewords in allWords.txt. I can then use that data in statisticalexploration of the set of texts.
I have the python program(?) syntax(?) script(?) below that I am usingto learn PYTHON. The comments starting with "later" are things I willtry to do to make this more useful. I am getting one step at at timeto work
It works when the number of terms in the term list is small e.g., 10.I get a file with the correct number of rows (87) and count columns(10) in termcounts.txt. The termcounts.txt file is not correct when Ihave a larger number of terms, e.g., 100. I get a file with only 40rows and the correct number of columns. With 8700 terms I get only 40rows I need to be able to have about 8700 terms. (If this were FORTRANI would say that the subscript indices were getting scrambled.) (As Idevelop this I would like to be open-ended with the numbers of inputpapers and open ended with the number of words/terms.)
# word counts: Federalist papers

import re, textwrap
# read the combined file and split into individual papers
# later create a new version that deals with all files in a folderrather than having papers concatenated
alltext = file("C:/Users/Art/Desktop/fed/feder16v3.txt").readlines()
papers= re.split(r'FEDERALIST No\.'," ".join(alltext))
print len(papers)

countsfile = file("C:/Users/Art/desktop/fed/TermCounts.txt", "w")
syntaxfile = file("C:/Users/Art/desktop/fed/TermCounts.sps", "w")
# later create a python program that extracts all words instead ofusing NoteTab
termfile   = open("C:/Users/Art/Desktop/fed/allWords.txt")
termlist = termfile.readlines()
termlist = [item.rstrip("\n") for item in termlist]
print len(termlist)
# check for SPSS reserved words
varnames = textwrap.wrap(" ".join([v.lower() in ['and', 'or', 'not','eq', 'ge','gt', 'le', 'lt', 'ne', 'all', 'by', 'to','with'] and (v+"_r") or vfor v in termlist]))syntaxfile.write("data list file='c:/users/Art/desktop/fed/termcounts.txt' free/docnumber\n")
syntaxfile.writelines([v + "\n" for v in varnames])
syntaxfile.write(".\n")
# before using the syntax manually replace spaces internal to a stringto underscore // replace (ltrtim(rtrim(varname))," ","_") replaceany special characters with @ in variable names
for p in range(len(papers)):

range(len()) is un-pythonic.  Simply do
        for paper in papers:

and of course use paper below instead of papers[p]

   counts = []
   for t in termlist:
counts.append(len(re.findall(r"\b" + t + r"\b", papers[p],re.IGNORECASE)))
   if sum(counts) > 0:
      papernum = re.search("[0-9]+", papers[p]).group(0)
countsfile.write(str(papernum) + " " + " ".join([str(s) for s incounts]) + "\n")
Art

If you're memory limited, you really should sequence through the files,only loading one at a time, rather than all at once. It's no harder.Use dirlist() to make a list of files, then your loop becomes somethinglike:


for  infile in filelist:
     paper = " ".join(open(infile, "r").readlines())

Naturally, to do it right, you should use with... Or at least closeeach file when done.


DaveA

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Is the difference in outputs with different size input lists due to limits on memory with PYTHON?

Reply via email to