Hi,

I'm relatively new to programming in general, and totally new to python,
and I've been told that this language is particularly good for what I
need to do. Let me explain.
I have a large corpus of English text, in the form of several files.

First of all I would like to scan each file. Then, for each word I find,
I'd like to examine its case status, and write the (lower case) word back
to another text file - with, appended, a tag stating the case it had in the original file.

An example. Suppose we have three possible "case conditions"
-all lowercase
-all uppercase
-initial uppercase only

Three corresponding tags for each of these might be, respectively:
-nocap
-allcaps
-cap

Therefore, given the string

"The Chairman of BP was asleep"

I would like to produce

"the/cap chairman/cap of/nocap /bp/allcaps was/nocap /asleep/nocap"

and writing this into a file.


I have the following algorithm in mind:

-open input file
-open output file
-get line of text
        -split line into words
        -for each word
                -tag = checkCase(word)
                -newword = lowercase(word) + append(tag)
        rejoin words into line
        write line into output file

Now, I managed to write the following initial code

   for s in file:
        lines += 1
        if lines % 1000 == 0:
            print '%d lines' % We print the total lines
        sent = s.split() #split string by spaces
#...


But then I don't quite know what would be the fastest/best way to do this. Could I use the join function to reform the string? And, regarding the casetest() function, what do you suggest to do? Should I test each character of each word or there are faster methods?

Thanks very much,

F.



--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to