On 5/6/2010 8:52 PM, Dave Angel wrote:


I got my own copy of the papers, at http://thomas.loc.gov/home/histdox/fedpaper.txt

I copied your code, and added logic to it to initialize termlist from the actual file. And it does complete the output file at 83 lines, approx 17000 columns per line (because most counts are one digit). It takes quite a while, and perhaps you weren't waiting for it to complete. I'd suggest either adding a print to the loop, showing the count, and/or adding a line that prints "done" after the loop terminates normally.

I watched memory usage, and as expected, it didn't get very high. There are things you need to redesign, however. One is that all the punctuation and digits and such need to be converted to spaces.


DaveA



Thank you for going the extra mile.

I obtained my copy before I retired in 2001 and there are some differences. In the current copy from the LOC papers 7, 63, and 81 start with "FEDERALIST." (an extra period). That explains why you have 83. There also some comments such as attributed author. After the weekend, I'll do a file compare and see differences in more detail.

Please email me your version of the code. I'll try it as is. Then I'll put in a counter, have it print the count and paper number, and a 'done' message.

As a check after reading in the counts, I'll include the counts from NoteTab and see if these counts sum to those from NoteTab.

I'll use SPSS to create a version of the .txt file with punctuation and numerals changed to spaces and try using that as the corpus. Then I'll try to create a similar file with Python.

Art
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to