Re: base64 and unicode

2007-05-04 Thread Duncan Booth
EuGeNe Van den Bulke [EMAIL PROTECTED] wrote:

 Duncan Booth wrote:
 However, the decoded text looks as though it is utf16 encoded so it
 should be written as binary. i.e.  the output mode should be wb.
 
 Thanks for the wb tip that works (see bellow). I guess it is 
 experience based but how could you tell that it was utf16 encoded?

I pasted the encoded form into idle and decoded it base 64. It ends with \r
\x00\n\x00 and the nulls instantly suggest a 16 bit encoding. Scrolling to 
the beginning and it starts \xff\xfe which is the BOM for little-endian 
utf16.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: base64 and unicode

2007-05-04 Thread Duncan Booth
EuGeNe Van den Bulke [EMAIL PROTECTED] wrote:

  import base64
  base64.decode(file(hebrew.b64,r),file(hebrew.lang,w))
 
 It runs but the result is not correct: some of the lines in hebrew.lang 
 are correct but not all of them (hebrew.expected.lang is the correct 
 file). I guess it is a unicode problem but can't seem to find out how to 
 fix it.

My guess would be that your problem is that you wrote the file in text 
mode, so (assuming you are on windows) all newline characters in the output 
are converted to carriage return/linefeed pairs. However, the decoded text 
looks as though it is utf16 encoded so it should be written as binary. i.e.  
the output mode should be wb.

Simpler than using the base64 module you can just use the base64 codec. 
This will decode a string to a byte sequence and you can then decode that 
to get the unicode string:

with file(hebrew.b64,r) as f:
   text = f.read().decode('base64').decode('utf16')

You can then write the text to a file through any desired codec or process 
it first.

BTW, you may just have shortened your example too much, but depending on 
python to close files for you is risky behaviour. If you get an exception 
thrown before the file goes out of scope it may not get closed when you 
expect and that can lead to some fairly hard to track problems. It is much 
better to either call the close method explicitly or to use Python 2.5's 
'with' statement.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: base64 and unicode

2007-05-04 Thread EuGeNe Van den Bulke
Duncan Booth wrote:
 However, the decoded text looks as though it is utf16 encoded so it should be 
 written as binary. i.e.  
 the output mode should be wb.

Thanks for the wb tip that works (see bellow). I guess it is 
experience based but how could you tell that it was utf16 encoded?

 Simpler than using the base64 module you can just use the base64 codec. 
 This will decode a string to a byte sequence and you can then decode that 
 to get the unicode string:
 
 with file(hebrew.b64,r) as f:
text = f.read().decode('base64').decode('utf16')
 
 You can then write the text to a file through any desired codec or process 
 it first.

  with file(hebrew.lang,wb) as f:
  ... file.write(text.encode('utf16'))

Done ... superb!

 BTW, you may just have shortened your example too much, but depending on 
 python to close files for you is risky behaviour. If you get an exception 
 thrown before the file goes out of scope it may not get closed when you 
 expect and that can lead to some fairly hard to track problems. It is much 
 better to either call the close method explicitly or to use Python 2.5's 
 'with' statement.

Yes I had shortened my example but thanks for the 'with' statement tip 
... I never think about using it and I should ;)

Thanks,

EuGeNe -- http://www.3kwa.com
-- 
http://mail.python.org/mailman/listinfo/python-list