[Tutor] Unicode trouble
Might the problem only be related to Win32com, not Python since Python prints it without trouble? That's another issue. First you need to know what you are starting with. You really should read this: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) http://www.joelonsoftware.com/articles/Unicode.html Kent Thanks a lot for your help. I did actually get it to work. It didn't have to do with the characters, but the flags that I set for Word. But, I did learn a few things about characters in the process as well -- This email has been scanned for viruses spam by Decna as - www.decna.no Denne e-posten er sjekket for virus spam av Decna as - www.decna.no ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Unicode trouble
Michael Lange wrote: I haven't read all of this thread, but maybe you are trying to pass a non-utf8 string to the utf8 codec? Yes, I guess that much is pretty clear - there is some data in the source file that is not valid utf-8. I tried the error='replace' as you suggested and the program made it thru the list. However, here are some results: the gjenoppl�et gjenoppl� from the gjenoppløst det gjenoppløste kan v� konsentrert from kan være konsentrert I did check the site http://www.columbia.edu/kermit/utf8.html and the letters that is the problem here are a part of the utf-8. Is there anything else I could try? Thanks in advance -- This email has been scanned for viruses spam by Decna as - www.decna.no Denne e-posten er sjekket for virus spam av Decna as - www.decna.no ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Unicode trouble
Øyvind wrote: I tried the error='replace' as you suggested and the program made it thru the list. However, here are some results: the gjenoppl�et gjenoppl� from the gjenoppløst det gjenoppløste kan v� konsentrert from kan være konsentrert It seems pretty clear that you are using the wrong encoding somewhere. I did check the site http://www.columbia.edu/kermit/utf8.html and the letters that is the problem here are a part of the utf-8. That doesn't mean anything. Pretty much every letter used in every natural language of the world is part of unicode, that's the point of it. utf-8 is just a way to encode unicode so it includes all unicode characters. The important question is, what is actual encoding of your source data? Is there anything else I could try? Understand why the above question is important, then answer it. Until you do you are just thrashing around in the dark. Do you know what a character encoding is? Do you understand the difference between utf-8 and latin-1? Kent -- http://www.kentsjohnson.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Unicode trouble
The important question is, what is actual encoding of your source data? Is there anything else I could try? Understand why the above question is important, then answer it. Until you do you are just thrashing around in the dark. The source is a text-document that as far as I know only contains English and Norwegian letters. It can be opened with Notepad and Excel. I tried to run thru it in Python by: f = open('c://file.txt') for i in f: print f and that doesn't seem to give any problem. It prints all characters without any trouble. How would I find what encoding the document is in? All I can find is by opening Notepad, selecting Font/Script and it says 'Western'. Might the problem only be related to Win32com, not Python since Python prints it without trouble? Do you know what a character encoding is? Do you understand the difference between utf-8 and latin-1? Earlier characters had values 1-255. (Ascii). Now, you have a wider choice. In our part of the world we can use an extended version which contains a lot more, latin-1. UTF-8 is a part of Unicode and contains a lot more characters than Ascii. My knowledge about character encoding doesn't go much farther than this. Simply said, I understand that the document that I want to read includes characters beyond Ascii, and therefore I need to use UTF-8 or Latin-1. Why I should use one instead of the other, I have no idea. -- This email has been scanned for viruses spam by Decna as - www.decna.no Denne e-posten er sjekket for virus spam av Decna as - www.decna.no ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Unicode trouble
Øyvind wrote: The important question is, what is actual encoding of your source data? Is there anything else I could try? Understand why the above question is important, then answer it. Until you do you are just thrashing around in the dark. The source is a text-document that as far as I know only contains English and Norwegian letters. It can be opened with Notepad and Excel. I tried to run thru it in Python by: f = open('c://file.txt') for i in f: print f and that doesn't seem to give any problem. It prints all characters without any trouble. That doesn't narrow it down much though it does point towards latin-1 (or cp1252). How would I find what encoding the document is in? All I can find is by opening Notepad, selecting Font/Script and it says 'Western'. That doesn't really mean anything about the doc. Try opening the file in your browser. Most browsers have an encoding menu (View / Character Encoding in Firefox, View / Encoding in IE). Find the selection in this menu that makes the text display correctly; that's the encoding of the file. Might the problem only be related to Win32com, not Python since Python prints it without trouble? That's another issue. First you need to know what you are starting with. Do you know what a character encoding is? Do you understand the difference between utf-8 and latin-1? Earlier characters had values 1-255. (Ascii). Now, you have a wider choice. In our part of the world we can use an extended version which contains a lot more, latin-1. UTF-8 is a part of Unicode and contains a lot more characters than Ascii. My knowledge about character encoding doesn't go much farther than this. Simply said, I understand that the document that I want to read includes characters beyond Ascii, and therefore I need to use UTF-8 or Latin-1. Why I should use one instead of the other, I have no idea. You really should read this: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) http://www.joelonsoftware.com/articles/Unicode.html Kent -- http://www.kentsjohnson.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Unicode trouble
Hello. I am writing a program that reads in a text file, extracts each of the words and replaces a different document with the words. It works great until it encounter a non-english letter. I have tried the following: self.f = codecs.open(ordliste, 'r', 'utf-8') where I open the first file. And en = unicode(en) en = en.encode('utf-8') as well as en = en.decode('iso-8859-1') where each word is entered from the document. But, still, I get this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17: ordinal not in range(128) As well as this: UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170: invalid data if I skips the second part. What is wrong? How can I fix this? I am using ActiveState Python 2.3 and WinXp. Thanks in advance... This is the whole source: from win32com.client import Dispatch import time import codecs class oversett: def __init__(self, ordliste, dokument): objWord = Dispatch(Word.Application) self.f = codecs.open(ordliste, 'r', 'utf-8') #self.f = open(ordliste) objDoc = objWord.Documents.Open(dokument) self.objSelection = objWord.Selection def kjor(self): s = time.clock() wdReplaceAll = 2 wdFindContinue = 1 t = 1 for i in self.f.readlines(): en = i.split('\t')[0] #en = str(en).decode('iso-8859-1') #en = en.decode('iso-8859-1') en = unicode(en) en = en.encode('utf-8') print en to = i.split('\t')[1] #to = str(to).decode('iso-8859-1') #to = to.decode('iso-8859-1') to = unicode(to) to = to.encode('utf-8') t = t + 1 if t % 1000 == 0: print t try: self.objSelection.Find.Execute(en, False, True, False, False, True, True, wdFindContinue, True, to, wdReplaceAll, False, False, False, False) except UnicodeEncodeError: print 'pokker' except: pass print time.clock() - s if __name__ == '__main__': n = oversett('c:/ordliste.txt','c:/foo.doc') n.kjor() -- This email has been scanned for viruses spam by Decna as - www.decna.no Denne e-posten er sjekket for virus spam av Decna as - www.decna.no ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Unicode trouble
Øyvind wrote: Hello. I am writing a program that reads in a text file, extracts each of the words and replaces a different document with the words. It works great until it encounter a non-english letter. I have tried the following: self.f = codecs.open(ordliste, 'r', 'utf-8') where I open the first file. And en = unicode(en) en = en.encode('utf-8') as well as en = en.decode('iso-8859-1') where each word is entered from the document. But, still, I get this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17: ordinal not in range(128) As well as this: UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170: invalid data if I skips the second part. Where are you getting these errors (what line of the program)? Do you know what kind of strings objSelection.Find.Execute() is expecting? Kent What is wrong? How can I fix this? I am using ActiveState Python 2.3 and WinXp. Thanks in advance... This is the whole source: from win32com.client import Dispatch import time import codecs class oversett: def __init__(self, ordliste, dokument): objWord = Dispatch(Word.Application) self.f = codecs.open(ordliste, 'r', 'utf-8') #self.f = open(ordliste) objDoc = objWord.Documents.Open(dokument) self.objSelection = objWord.Selection def kjor(self): s = time.clock() wdReplaceAll = 2 wdFindContinue = 1 t = 1 for i in self.f.readlines(): en = i.split('\t')[0] #en = str(en).decode('iso-8859-1') #en = en.decode('iso-8859-1') en = unicode(en) en = en.encode('utf-8') print en to = i.split('\t')[1] #to = str(to).decode('iso-8859-1') #to = to.decode('iso-8859-1') to = unicode(to) to = to.encode('utf-8') t = t + 1 if t % 1000 == 0: print t try: self.objSelection.Find.Execute(en, False, True, False, False, True, True, wdFindContinue, True, to, wdReplaceAll, False, False, False, False) except UnicodeEncodeError: print 'pokker' except: pass print time.clock() - s if __name__ == '__main__': n = oversett('c:/ordliste.txt','c:/foo.doc') n.kjor() -- http://www.kentsjohnson.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Unicode trouble
Where are you getting these errors (what line of the program)? Do you know what kind of strings objSelection.Find.Execute() is expecting? Kent The program stops working and gives me these errors when I try to run it when it encounters a non-english letter. This is the full error: Traceback (most recent call last): File C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py, line 310, in RunScript exec codeObject in __main__.__dict__ File C:\Python\BA\Oversett.py, line 47, in ? File C:\Python\BA\Oversett.py, line 23, in kjor en = i.split('\t')[0] File C:\Python23\lib\codecs.py, line 388, in readlines return self.reader.readlines(sizehint) File C:\Python23\lib\codecs.py, line 314, in readlines return self.decode(data, self.errors)[0].splitlines(1) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170: invalid data and Traceback (most recent call last): File C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py, line 310, in RunScript exec codeObject in __main__.__dict__ File C:\Python\BA\Oversett.py, line 49, in ? File C:\Python\BA\Oversett.py, line 33, in kjor if t % 1000 == 0: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17: ordinal not in range(128) objSelection.Find.Execute() is supposed to accept any kind of string. (It is the function Search Replace in MS Word). -- This email has been scanned for viruses spam by Decna as - www.decna.no Denne e-posten er sjekket for virus spam av Decna as - www.decna.no ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Unicode trouble
Øyvind wrote: Where are you getting these errors (what line of the program)? Do you know what kind of strings objSelection.Find.Execute() is expecting? Kent The program stops working and gives me these errors when I try to run it when it encounters a non-english letter. This is the full error: Traceback (most recent call last): File C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py, line 310, in RunScript exec codeObject in __main__.__dict__ File C:\Python\BA\Oversett.py, line 47, in ? File C:\Python\BA\Oversett.py, line 23, in kjor en = i.split('\t')[0] File C:\Python23\lib\codecs.py, line 388, in readlines return self.reader.readlines(sizehint) File C:\Python23\lib\codecs.py, line 314, in readlines return self.decode(data, self.errors)[0].splitlines(1) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170: invalid data This is fairly strange as the line en = i.split('\t')[0] should not call any method in codecs. I don't know how you can get such a stack trace. Maybe try deleting all the .pyc files to make sure they are in sync with the source and try again? The actual error indicates that the input data is not valid utf-8. Are you sure that is the correct encoding for the input file? If the file is utf-8 and has bad characters you could pass error='ignore' or error='replace' as a parameter to codecs.open() to change the error handling style to something more forgiving. and Traceback (most recent call last): File C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py, line 310, in RunScript exec codeObject in __main__.__dict__ File C:\Python\BA\Oversett.py, line 49, in ? File C:\Python\BA\Oversett.py, line 33, in kjor if t % 1000 == 0: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17: ordinal not in range(128) Again this stack trace doesn't make sense, the indicated line doesn't do any string operation. This error message normally occurs when a non-ascii string is converted to unicode using the default encoding (which is 'ascii'). Often the conversion is implicit in some other operation but I don't see any such operation here. objSelection.Find.Execute() is supposed to accept any kind of string. (It is the function Search Replace in MS Word). It has to make some assumption about the type of the string. Does it want unicode or encoded bytes? If encoded bytes, what encoding does it expect? Kent -- http://www.kentsjohnson.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Unicode trouble
Øyvind wrote: Where are you getting these errors (what line of the program)? Do you know what kind of strings objSelection.Find.Execute() is expecting? Kent The program stops working and gives me these errors when I try to run it when it encounters a non-english letter. This is the full error: Traceback (most recent call last): File C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py, line 310, in RunScript exec codeObject in __main__.__dict__ File C:\Python\BA\Oversett.py, line 47, in ? File C:\Python\BA\Oversett.py, line 23, in kjor en = i.split('\t')[0] File C:\Python23\lib\codecs.py, line 388, in readlines return self.reader.readlines(sizehint) File C:\Python23\lib\codecs.py, line 314, in readlines return self.decode(data, self.errors)[0].splitlines(1) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170: invalid data This is fairly strange as the line en = i.split('\t')[0] should not call any method in codecs. I don't know how you can get such a stack trace. The file f where en comes from does contain lots of lines with one english word followed by a tab and a norwegian one. (Approximately 25000 lines) It can look like this: core\tkjærne So en is supposed to be the english word that the program need to find in MS Word, and to is the replacement word. So wouldn't that be a string that should be handeled by codecs? for i in self.f.readlines(): en = i.split('\t')[0] Maybe try deleting all the .pyc files to make sure they are in sync with the source and try again? This didn't seem to help. The actual error indicates that the input data is not valid utf-8. Are you sure that is the correct encoding for the input file? If the file is utf-8 and has bad characters you could pass error='ignore' or error='replace' as a parameter to codecs.open() to change the error handling style to something more forgiving. Is not valid utf-8? I have tried with latin-1 as well. No avail. The letters that are the problem is æøå. They shouldn't be that exotic? Traceback (most recent call last): File C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py, line 310, in RunScript exec codeObject in __main__.__dict__ File C:\Python\BA\Oversett.py, line 49, in ? File C:\Python\BA\Oversett.py, line 33, in kjor if t % 1000 == 0: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17: ordinal not in range(128) Again this stack trace doesn't make sense, the indicated line doesn't do any string operation. This error message normally occurs when a non-ascii string is converted to unicode using the default encoding (which is 'ascii'). Often the conversion is implicit in some other operation but I don't see any such operation here. But regardless, shouldn't 'ascii' be excluded here? Since I tell the program to change to utf-8, not only once but twice? objSelection.Find.Execute() is supposed to accept any kind of string. (It is the function Search Replace in MS Word). It has to make some assumption about the type of the string. Does it want unicode or encoded bytes? If encoded bytes, what encoding does it expect? I think the letters should be accepted. The pythonscript here is written to replace abot 25000 MS Word-macros, so all the letters have been accepted by MS Word when feeded by Visual Basic. All I have done now is to extract the words from the macros and put them in a file. -- This email has been scanned for viruses spam by Decna as - www.decna.no Denne e-posten er sjekket for virus spam av Decna as - www.decna.no ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Unicode trouble
On Wed, 30 Nov 2005 13:41:54 -0500 Kent Johnson [EMAIL PROTECTED] wrote: This is the full error: Traceback (most recent call last): File C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py, line 310, in RunScript exec codeObject in __main__.__dict__ File C:\Python\BA\Oversett.py, line 47, in ? File C:\Python\BA\Oversett.py, line 23, in kjor en = i.split('\t')[0] File C:\Python23\lib\codecs.py, line 388, in readlines return self.reader.readlines(sizehint) File C:\Python23\lib\codecs.py, line 314, in readlines return self.decode(data, self.errors)[0].splitlines(1) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170: invalid data This is fairly strange as the line en = i.split('\t')[0] should not call any method in codecs. I don't know how you can get such a stack trace. The file f where en comes from does contain lots of lines with one english word followed by a tab and a norwegian one. (Approximately 25000 lines) It can look like this: core\tkjærne Yes, I understand that. So en is supposed to be the english word that the program need to find in MS Word, and to is the replacement word. So wouldn't that be a string that should be handeled by codecs? for i in self.f.readlines(): en = i.split('\t')[0] The thing is, it's the line for i in self.f.readlines(): that is calling the codecs module, not the line en = i.split('\t')[0] but it is the latter line that is in the stack trace. Can any of the other tutors make any sense of this stack trace? As far as I see here, isn't the line return self.decode(data, self.errors)[0].splitlines(1) causing the traceback? I haven't read all of this thread, but maybe you are trying to pass a non-utf8 string to the utf8 codec? Michael ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Unicode trouble
Michael Lange wrote: On Wed, 30 Nov 2005 13:41:54 -0500 Kent Johnson [EMAIL PROTECTED] wrote: This is the full error: Traceback (most recent call last): File C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py, line 310, in RunScript exec codeObject in __main__.__dict__ File C:\Python\BA\Oversett.py, line 47, in ? File C:\Python\BA\Oversett.py, line 23, in kjor en = i.split('\t')[0] File C:\Python23\lib\codecs.py, line 388, in readlines return self.reader.readlines(sizehint) File C:\Python23\lib\codecs.py, line 314, in readlines return self.decode(data, self.errors)[0].splitlines(1) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170: invalid data This is fairly strange as the line en = i.split('\t')[0] should not call any method in codecs. I don't know how you can get such a stack trace. Can any of the other tutors make any sense of this stack trace? As far as I see here, isn't the line return self.decode(data, self.errors)[0].splitlines(1) causing the traceback? I haven't read all of this thread, but maybe you are trying to pass a non-utf8 string to the utf8 codec? Yes, I guess that much is pretty clear - there is some data in the source file that is not valid utf-8. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor