[Tutor] Unicode trouble

2005-12-04 Thread Øyvind

 Might the problem only be related to Win32com, not Python since Python
 prints it without trouble?

That's another issue. First you need to know what you are starting with.

You really should read this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

Kent

Thanks a lot for your help. I did actually get it to work. It didn't have
to do with the characters, but the flags that I set for Word. But, I did
learn a few things about characters in the process as well


-- 
This email has been scanned for viruses  spam by Decna as - www.decna.no
Denne e-posten er sjekket for virus  spam av Decna as - www.decna.no

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Unicode trouble

2005-12-01 Thread Øyvind
Michael Lange wrote:

 I haven't read all of this thread, but maybe you are trying to pass a
 non-utf8 string to the utf8 codec?

Yes, I guess that much is pretty clear - there is some data in the source
file that is not valid utf-8.

I tried the error='replace' as you suggested and the program made it thru
the list. However, here are some results:

the gjenoppl�et gjenoppl�
from
the gjenoppløst det gjenoppløste

kan v� konsentrert
from
kan være konsentrert

I did check the site http://www.columbia.edu/kermit/utf8.html and the
letters that is the problem here are a part of the utf-8.

Is there anything else I could try?

Thanks in advance



-- 
This email has been scanned for viruses  spam by Decna as - www.decna.no
Denne e-posten er sjekket for virus  spam av Decna as - www.decna.no

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Unicode trouble

2005-12-01 Thread Kent Johnson
Øyvind wrote:
 I tried the error='replace' as you suggested and the program made it thru
 the list. However, here are some results:
 
 the gjenoppl�et gjenoppl�
 from
 the gjenoppløst   det gjenoppløste
 
 kan v� konsentrert
 from
 kan være konsentrert

It seems pretty clear that you are using the wrong encoding somewhere.
 
 I did check the site http://www.columbia.edu/kermit/utf8.html and the
 letters that is the problem here are a part of the utf-8.

That doesn't mean anything. Pretty much every letter used in every natural 
language of the world is part of unicode, that's the point of it. utf-8 is just 
a way to encode unicode so it includes all unicode characters.

The important question is, what is actual encoding of your source data?
 
 Is there anything else I could try?

Understand why the above question is important, then answer it. Until you do 
you are just thrashing around in the dark.

Do you know what a character encoding is? Do you understand the difference 
between utf-8 and latin-1?

Kent
-- 
http://www.kentsjohnson.com

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Unicode trouble

2005-12-01 Thread Øyvind

The important question is, what is actual encoding of your source data?

 Is there anything else I could try?

Understand why the above question is important, then answer it. Until you
do you are just thrashing around in the dark.

The source is a text-document that as far as I know only contains English
and Norwegian letters. It can be opened with Notepad and Excel. I tried to
run thru it in Python by:

f = open('c://file.txt')

for i in f:
print f

and that doesn't seem to give any problem. It prints all characters
without any trouble.

How would I find what encoding the document is in? All I can find is by
opening Notepad, selecting Font/Script and it says 'Western'.

Might the problem only be related to Win32com, not Python since Python
prints it without trouble?
Do you know what a character encoding is? Do you understand the
difference between utf-8 and latin-1?

Earlier characters had values 1-255. (Ascii). Now, you have a wider
choice. In our part of the world we can use an extended version which
contains a lot more, latin-1. UTF-8 is a part of Unicode and contains a
lot more characters than Ascii.

My knowledge about character encoding doesn't go much farther than this.
Simply said, I understand that the document that I want to read includes
characters beyond Ascii, and therefore I need to use UTF-8 or Latin-1. Why
I should use one instead of the other, I have no idea.



-- 
This email has been scanned for viruses  spam by Decna as - www.decna.no
Denne e-posten er sjekket for virus  spam av Decna as - www.decna.no

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Unicode trouble

2005-12-01 Thread Kent Johnson
Øyvind wrote:
The important question is, what is actual encoding of your source data?

Is there anything else I could try?
 
 
Understand why the above question is important, then answer it. Until you
 
 do you are just thrashing around in the dark.
 
 The source is a text-document that as far as I know only contains English
 and Norwegian letters. It can be opened with Notepad and Excel. I tried to
 run thru it in Python by:
 
 f = open('c://file.txt')
 
 for i in f:
 print f
 
 and that doesn't seem to give any problem. It prints all characters
 without any trouble.

That doesn't narrow it down much though it does point towards latin-1 (or 
cp1252).

 How would I find what encoding the document is in? All I can find is by
 opening Notepad, selecting Font/Script and it says 'Western'.

That doesn't really mean anything about the doc. Try opening the file in your 
browser. Most browsers have an encoding menu (View / Character Encoding in 
Firefox, View / Encoding in IE). Find the selection in this menu that makes the 
text display correctly; that's the encoding of the file.

 Might the problem only be related to Win32com, not Python since Python
 prints it without trouble?

That's another issue. First you need to know what you are starting with.
 
Do you know what a character encoding is? Do you understand the
 difference between utf-8 and latin-1?
 
 Earlier characters had values 1-255. (Ascii). Now, you have a wider
 choice. In our part of the world we can use an extended version which
 contains a lot more, latin-1. UTF-8 is a part of Unicode and contains a
 lot more characters than Ascii.
 
 My knowledge about character encoding doesn't go much farther than this.
 Simply said, I understand that the document that I want to read includes
 characters beyond Ascii, and therefore I need to use UTF-8 or Latin-1. Why
 I should use one instead of the other, I have no idea.

You really should read this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know 
About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

Kent
-- 
http://www.kentsjohnson.com

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Unicode trouble

2005-11-30 Thread Øyvind
Hello.

I am writing a program that reads in a text file, extracts each of the
words and replaces a different document with the words. It works great
until it encounter a non-english letter.

I have tried the following:

self.f = codecs.open(ordliste, 'r', 'utf-8')
where I open the first file.

And
en = unicode(en)
en = en.encode('utf-8')

as well as
en = en.decode('iso-8859-1')

where
each word is entered from the document.

But, still, I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17:
ordinal not in range(128)

As well as this:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170:
invalid data
if I skips the second part.

What is wrong? How can I fix this? I am using ActiveState Python 2.3 and
WinXp.

Thanks in advance...


This is the whole source:

from win32com.client import Dispatch
import time
import codecs

class oversett:
def __init__(self, ordliste, dokument):
objWord = Dispatch(Word.Application)
self.f = codecs.open(ordliste, 'r', 'utf-8')
#self.f = open(ordliste)
objDoc =  objWord.Documents.Open(dokument)
self.objSelection = objWord.Selection

def kjor(self):
s = time.clock()
wdReplaceAll = 2
wdFindContinue = 1
t = 1
for i in self.f.readlines():
en = i.split('\t')[0]
#en = str(en).decode('iso-8859-1')
#en = en.decode('iso-8859-1')
en = unicode(en)
en = en.encode('utf-8')
print en
to = i.split('\t')[1]
#to = str(to).decode('iso-8859-1')
#to = to.decode('iso-8859-1')
to = unicode(to)
to = to.encode('utf-8')
t = t + 1
if t % 1000 == 0:
print t
try:
self.objSelection.Find.Execute(en, False, True, False,
False, True, True, wdFindContinue, True, to, wdReplaceAll,
False, False, False, False)
except UnicodeEncodeError:
print 'pokker'
except:
pass

print time.clock() - s

if __name__ == '__main__':
n = oversett('c:/ordliste.txt','c:/foo.doc')
n.kjor()


-- 
This email has been scanned for viruses  spam by Decna as - www.decna.no
Denne e-posten er sjekket for virus  spam av Decna as - www.decna.no

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Unicode trouble

2005-11-30 Thread Kent Johnson
Øyvind wrote:
 Hello.
 
 I am writing a program that reads in a text file, extracts each of the
 words and replaces a different document with the words. It works great
 until it encounter a non-english letter.
 
 I have tried the following:
 
 self.f = codecs.open(ordliste, 'r', 'utf-8')
 where I open the first file.
 
 And
 en = unicode(en)
 en = en.encode('utf-8')
 
 as well as
 en = en.decode('iso-8859-1')
 
 where
 each word is entered from the document.
 
 But, still, I get this error:
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17:
 ordinal not in range(128)
 
 As well as this:
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170:
 invalid data
 if I skips the second part.

Where are you getting these errors (what line of the program)? Do you know what 
kind of strings objSelection.Find.Execute() is expecting?

Kent

 
 What is wrong? How can I fix this? I am using ActiveState Python 2.3 and
 WinXp.
 
 Thanks in advance...
 
 
 This is the whole source:
 
 from win32com.client import Dispatch
 import time
 import codecs
 
 class oversett:
 def __init__(self, ordliste, dokument):
 objWord = Dispatch(Word.Application)
 self.f = codecs.open(ordliste, 'r', 'utf-8')
 #self.f = open(ordliste)
 objDoc =  objWord.Documents.Open(dokument)
 self.objSelection = objWord.Selection
 
 def kjor(self):
 s = time.clock()
 wdReplaceAll = 2
 wdFindContinue = 1
 t = 1
 for i in self.f.readlines():
 en = i.split('\t')[0]
 #en = str(en).decode('iso-8859-1')
 #en = en.decode('iso-8859-1')
 en = unicode(en)
 en = en.encode('utf-8')
 print en
 to = i.split('\t')[1]
 #to = str(to).decode('iso-8859-1')
 #to = to.decode('iso-8859-1')
 to = unicode(to)
 to = to.encode('utf-8')
 t = t + 1
 if t % 1000 == 0:
 print t
 try:
 self.objSelection.Find.Execute(en, False, True, False,
 False, True, True, wdFindContinue, True, to, wdReplaceAll,
 False, False, False, False)
 except UnicodeEncodeError:
 print 'pokker'
 except:
 pass
 
 print time.clock() - s
 
 if __name__ == '__main__':
 n = oversett('c:/ordliste.txt','c:/foo.doc')
 n.kjor()
 
 

-- 
http://www.kentsjohnson.com

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Unicode trouble

2005-11-30 Thread Øyvind


Where are you getting these errors (what line of the program)? Do you
know what kind of strings objSelection.Find.Execute() is expecting?

Kent

The program stops working and gives me these errors when I try to run it
when it encounters a non-english letter.

This is the full error:
Traceback (most recent call last):
  File
C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py,
line 310, in RunScript
exec codeObject in __main__.__dict__
  File C:\Python\BA\Oversett.py, line 47, in ?
  File C:\Python\BA\Oversett.py, line 23, in kjor
en = i.split('\t')[0]
  File C:\Python23\lib\codecs.py, line 388, in readlines
return self.reader.readlines(sizehint)
  File C:\Python23\lib\codecs.py, line 314, in readlines
return self.decode(data, self.errors)[0].splitlines(1)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170:
invalid data

and

Traceback (most recent call last):
  File
C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py,
line 310, in RunScript
exec codeObject in __main__.__dict__
  File C:\Python\BA\Oversett.py, line 49, in ?
  File C:\Python\BA\Oversett.py, line 33, in kjor
if t % 1000 == 0:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17:
ordinal not in range(128)

objSelection.Find.Execute() is supposed to accept any kind of string. (It
is the function Search  Replace in MS Word).



-- 
This email has been scanned for viruses  spam by Decna as - www.decna.no
Denne e-posten er sjekket for virus  spam av Decna as - www.decna.no

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Unicode trouble

2005-11-30 Thread Kent Johnson
Øyvind wrote:
 
Where are you getting these errors (what line of the program)? Do you
 
 know what kind of strings objSelection.Find.Execute() is expecting?
 
Kent
 
 
 The program stops working and gives me these errors when I try to run it
 when it encounters a non-english letter.
 
 This is the full error:
 Traceback (most recent call last):
   File
 C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py,
 line 310, in RunScript
 exec codeObject in __main__.__dict__
   File C:\Python\BA\Oversett.py, line 47, in ?
   File C:\Python\BA\Oversett.py, line 23, in kjor
 en = i.split('\t')[0]
   File C:\Python23\lib\codecs.py, line 388, in readlines
 return self.reader.readlines(sizehint)
   File C:\Python23\lib\codecs.py, line 314, in readlines
 return self.decode(data, self.errors)[0].splitlines(1)
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170:
 invalid data

This is fairly strange as the line
  en = i.split('\t')[0]
should not call any method in codecs. I don't know how you can get such a stack 
trace. Maybe try deleting all the .pyc files to make sure they are in sync with 
the source and try again?

The actual error indicates that the input data is not valid utf-8. Are you sure 
that is the correct encoding for the input file? If the file is utf-8 and has 
bad characters you could pass error='ignore' or error='replace' as a parameter 
to codecs.open() to change the error handling style to something more forgiving.
 
 and
 
 Traceback (most recent call last):
   File
 C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py,
 line 310, in RunScript
 exec codeObject in __main__.__dict__
   File C:\Python\BA\Oversett.py, line 49, in ?
   File C:\Python\BA\Oversett.py, line 33, in kjor
 if t % 1000 == 0:
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17:
 ordinal not in range(128)

Again this stack trace doesn't make sense, the indicated line doesn't do any 
string operation.

This error message normally occurs when a non-ascii string is converted to 
unicode using the default encoding (which is 'ascii'). Often the conversion is 
implicit in some other operation but I don't see any such operation here.
 
 objSelection.Find.Execute() is supposed to accept any kind of string. (It
 is the function Search  Replace in MS Word).

It has to make some assumption about the type of the string. Does it want 
unicode or encoded bytes? If encoded bytes, what encoding does it expect?

Kent
-- 
http://www.kentsjohnson.com

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Unicode trouble

2005-11-30 Thread Øyvind
Øyvind wrote:

Where are you getting these errors (what line of the program)? Do you

 know what kind of strings objSelection.Find.Execute() is expecting?

Kent


 The program stops working and gives me these errors when I try to run it
 when it encounters a non-english letter.

 This is the full error:
 Traceback (most recent call last):
   File
 C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py,
 line 310, in RunScript
 exec codeObject in __main__.__dict__
   File C:\Python\BA\Oversett.py, line 47, in ?
   File C:\Python\BA\Oversett.py, line 23, in kjor
 en = i.split('\t')[0]
   File C:\Python23\lib\codecs.py, line 388, in readlines
 return self.reader.readlines(sizehint)
   File C:\Python23\lib\codecs.py, line 314, in readlines
 return self.decode(data, self.errors)[0].splitlines(1)
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170:
 invalid data

This is fairly strange as the line
  en = i.split('\t')[0]
should not call any method in codecs. I don't know how you can get such a
stack trace.

The file f where en comes from does contain lots of lines with one english
word followed by a tab and a norwegian one. (Approximately 25000 lines) It
can look like this: core\tkjærne
So en is supposed to be the english word that the program need to find in
MS Word, and to is the replacement word. So wouldn't that be a string that
should be handeled by codecs?

for i in self.f.readlines():
en = i.split('\t')[0]

Maybe try deleting all the .pyc files to make sure they are in sync with
the source and try again?

This didn't seem to help.

The actual error indicates that the input data is not valid utf-8. Are
you sure that is the correct encoding for the input file? If the file is
utf-8 and has bad characters you could pass error='ignore' or
error='replace' as a parameter to codecs.open() to change the error
handling style to something more forgiving.

Is not valid utf-8? I have tried with latin-1 as well. No avail. The
letters that are the problem is æøå. They shouldn't be that exotic?

 Traceback (most recent call last):
   File
 C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py,
 line 310, in RunScript
 exec codeObject in __main__.__dict__
   File C:\Python\BA\Oversett.py, line 49, in ?
   File C:\Python\BA\Oversett.py, line 33, in kjor
 if t % 1000 == 0:
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17:
 ordinal not in range(128)

Again this stack trace doesn't make sense, the indicated line doesn't do
any string operation.

This error message normally occurs when a non-ascii string is converted
to unicode using the default encoding (which is 'ascii'). Often the
conversion is implicit in some other operation but I don't see any such
operation here.

But regardless, shouldn't 'ascii' be excluded here? Since I tell the
program to change to utf-8, not only once but twice?

 objSelection.Find.Execute() is supposed to accept any kind of string. (It
 is the function Search  Replace in MS Word).

It has to make some assumption about the type of the string. Does it want
unicode or encoded bytes? If encoded bytes, what encoding does it
expect?

I think the letters should be accepted. The pythonscript here is written
to replace abot 25000 MS Word-macros, so all the letters have been
accepted by MS Word when feeded by Visual Basic. All I have done now is to
extract the words from the macros and put them in a file.




-- 
This email has been scanned for viruses  spam by Decna as - www.decna.no
Denne e-posten er sjekket for virus  spam av Decna as - www.decna.no

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Unicode trouble

2005-11-30 Thread Michael Lange
On Wed, 30 Nov 2005 13:41:54 -0500
Kent Johnson [EMAIL PROTECTED] wrote:


 This is the full error:
 Traceback (most recent call last):
   File
 C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py,
 line 310, in RunScript
 exec codeObject in __main__.__dict__
   File C:\Python\BA\Oversett.py, line 47, in ?
   File C:\Python\BA\Oversett.py, line 23, in kjor
 en = i.split('\t')[0]
   File C:\Python23\lib\codecs.py, line 388, in readlines
 return self.reader.readlines(sizehint)
   File C:\Python23\lib\codecs.py, line 314, in readlines
 return self.decode(data, self.errors)[0].splitlines(1)
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170:
 invalid data
  
  
 This is fairly strange as the line
  en = i.split('\t')[0]
 should not call any method in codecs. I don't know how you can get such a
 stack trace.
  
  The file f where en comes from does contain lots of lines with one english
  word followed by a tab and a norwegian one. (Approximately 25000 lines) It
  can look like this: core\tkjærne
 
 Yes, I understand that.
 
  So en is supposed to be the english word that the program need to find in
  MS Word, and to is the replacement word. So wouldn't that be a string that
  should be handeled by codecs?
  
  for i in self.f.readlines():
  en = i.split('\t')[0]
 
 The thing is, it's the line
   for i in self.f.readlines():
 that is calling the codecs module, not the line
   en = i.split('\t')[0]
 but it is the latter line that is in the stack trace.
 
 Can any of the other tutors make any sense of this stack trace?

As far as I see here, isn't the line

return self.decode(data, self.errors)[0].splitlines(1)

causing the traceback?

I haven't read all of this thread, but maybe you are trying to pass a
non-utf8 string to the utf8 codec?

Michael




___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Unicode trouble

2005-11-30 Thread Kent Johnson
Michael Lange wrote:
 On Wed, 30 Nov 2005 13:41:54 -0500
 Kent Johnson [EMAIL PROTECTED] wrote:
 
This is the full error:
Traceback (most recent call last):
 File
C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py,
line 310, in RunScript
   exec codeObject in __main__.__dict__
 File C:\Python\BA\Oversett.py, line 47, in ?
 File C:\Python\BA\Oversett.py, line 23, in kjor
   en = i.split('\t')[0]
 File C:\Python23\lib\codecs.py, line 388, in readlines
   return self.reader.readlines(sizehint)
 File C:\Python23\lib\codecs.py, line 314, in readlines
   return self.decode(data, self.errors)[0].splitlines(1)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170:
invalid data


This is fairly strange as the line
en = i.split('\t')[0]
should not call any method in codecs. I don't know how you can get such a
stack trace.


Can any of the other tutors make any sense of this stack trace?
 
 
 As far as I see here, isn't the line
 
 return self.decode(data, self.errors)[0].splitlines(1)
 
 causing the traceback?
 
 I haven't read all of this thread, but maybe you are trying to pass a
 non-utf8 string to the utf8 codec?

Yes, I guess that much is pretty clear - there is some data in the source file 
that is not valid utf-8. 

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor