Hello, I have a file that was a resulted from a POS-Tagging program, after some transformations, I wanted to restore to it's normal form. So, I used sed to remove the POS-Tags and have something like this:
--- Example begins No , thank...@+ # this +...@+ I inserted to mark paragraphs, because the POS-Tagger don't keep paragraph marks He said , " No , thanks . " OK ? --- Example ends --- And I want to transform into this: No, thanks. He said, "No, thanks." OK? --- End example I tried different approaches, and dealing with the quote marks is what is defeating my attempts: I had originally this code: import cStringIO ##import string myoutput = cStringIO.StringIO() f = open(r'C:\mytools\supersed\tradunew.txt') lista = [line.strip() for line in f] punct = '!#$%)*,-.:;<=>?@/\\]^_`}~”…' for i, item in enumerate(lista): if item == '"' and lista[i + 1] not in punct: myoutput.write(item) spacer = True elif '+...@+' in item: donewline = item.replace('+...@+','\n ') myoutput.write(donewline) elif item not in punct and lista[i + 1] in punct: myoutput.write(item) elif item in punct and lista[i + 1] in punct: myoutput.write(item) elif item in punct and lista[i + 1] == '"' and spacer: myoutput.write(item) spacer = False elif item not in punct and lista[i + 1] == '"' and spacer: myoutput.write(item) spacer = False elif item in '([{“': myoutput.write(item) else: myoutput.write(item + " ") newlist = myoutput.getvalue().splitlines() myoutput.close() f = open(r'C:\mytools\supersed\traducerto-k.txt', 'w') for line in newlist: f.write(line.lstrip()+'\n') f.close() #=== I tried this version to post in this forum but that gives me an error. I don't know why I don't get an error with the code above which is essentially the same: # -*- coding: cp1252 -*- result = '' lista = [ 'No', ',', 'thank...@+', 'He', 'said', ',', '"', 'no', ',', 'thanks', '.', '"', 'OK', '?', 'Hi'] punct = '!#$%)*,-.:;<=>?@/\\]^_`}~”…' for i, item in enumerate(lista): if item == '"' and lista[i + 1] not in punct: result +=item spacer = True elif '+...@+' in item: donewline = item.replace('+...@+','\n ') result += donewline elif item not in punct and lista[i + 1] in punct: result += item elif item in punct and lista[i + 1] in punct: result += item elif item in punct and lista[i + 1] == '"' and spacer: result += item spacer = False elif item not in punct and lista[i + 1] == '"' and spacer: result += item spacer = False elif item in '([{“': result += item else: result += (item + " ") print result #== The error is this: Traceback (most recent call last): File "<string>", line 244, in run_nodebug File "C:\mytools\jointags-v4.py", line 17, in <module> elif item not in punct and lista[i + 1] in punct: IndexError: list index out of range I'm using python 2.6.2 with PyScripter IDE I have tried a so many variations that I'm not sure what I'm doing any more.... I'm just trying to avoid some post-processing with sed again. Thankful, Eduardo _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor