[Tutor] Can't transform a list of tokens into a text

Eduardo Vieira Wed, 15 Jul 2009 19:34:19 -0700

Hello, I have a file that was a resulted from a POS-Tagging program,
after some transformations, I wanted to restore to it's normal form.
So, I used sed to remove the POS-Tags and have something like this:


--- Example begins
No
,
thank...@+  # this +...@+ I inserted to mark paragraphs, because the
POS-Tagger don't keep paragraph marks
He
said
,
"
No
,
thanks
.
"
OK
?
--- Example ends
--- And I want to transform into this:
No, thanks.
He said, "No, thanks." OK?
--- End example
I tried different approaches, and dealing with the quote marks is what
is defeating my attempts:

I had originally this code:
import cStringIO
##import string
myoutput = cStringIO.StringIO()
f = open(r'C:\mytools\supersed\tradunew.txt')
lista = [line.strip() for line in f]
punct = '!#$%)*,-.:;<=>?@/\\]^_`}~”…'
for i, item in enumerate(lista):

    if item == '"' and lista[i + 1] not in punct:
        myoutput.write(item)
        spacer = True


    elif '+...@+' in item:
        donewline = item.replace('+...@+','\n ')
        myoutput.write(donewline)
    elif item not in punct and lista[i + 1] in punct:
        myoutput.write(item)
    elif item in punct and lista[i + 1] in punct:
        myoutput.write(item)
    elif item in punct and lista[i + 1] == '"' and spacer:
        myoutput.write(item)
        spacer = False
    elif item not in punct and lista[i + 1] == '"' and spacer:
        myoutput.write(item)
        spacer = False
    elif item in '([{“':
        myoutput.write(item)

    else:
        myoutput.write(item + " ")

newlist = myoutput.getvalue().splitlines()



myoutput.close()

f = open(r'C:\mytools\supersed\traducerto-k.txt', 'w')

for line in newlist:
    f.write(line.lstrip()+'\n')
f.close()

#===
I tried this version to post in this forum but that gives me an error.
I don't know why I don't get an error with the code above which is
essentially the same:
# -*- coding: cp1252 -*-

result = ''
lista = [
'No', ',', 'thank...@+',
'He', 'said', ',', '"', 'no', ',', 'thanks', '.', '"', 'OK', '?', 'Hi']

punct = '!#$%)*,-.:;<=>?@/\\]^_`}~”…'
for i, item in enumerate(lista):

    if item == '"' and lista[i + 1] not in punct:
        result +=item
        spacer = True
    elif '+...@+' in item:
        donewline = item.replace('+...@+','\n ')
        result += donewline
    elif item not in punct and lista[i + 1] in punct:
        result += item
    elif item in punct and lista[i + 1] in punct:
        result += item
    elif item in punct and lista[i + 1] == '"' and spacer:
        result += item
        spacer = False
    elif item not in punct and lista[i + 1] == '"' and spacer:
        result += item
        spacer = False
    elif item in '([{“':
        result += item
    else:
        result += (item + " ")

print result

#==
The error is this:
Traceback (most recent call last):
  File "<string>", line 244, in run_nodebug
  File "C:\mytools\jointags-v4.py", line 17, in <module>
    elif item not in punct and lista[i + 1] in punct:
IndexError: list index out of range

I'm using python 2.6.2 with PyScripter IDE
I have tried a so many variations that I'm not sure what I'm doing any more....
I'm just trying to avoid some post-processing with sed again.

Thankful,

Eduardo
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] Can't transform a list of tokens into a text

Reply via email to