Re: base64 and unicode
EuGeNe Van den Bulke [EMAIL PROTECTED] wrote: Duncan Booth wrote: However, the decoded text looks as though it is utf16 encoded so it should be written as binary. i.e. the output mode should be wb. Thanks for the wb tip that works (see bellow). I guess it is experience based but how could you tell that it was utf16 encoded? I pasted the encoded form into idle and decoded it base 64. It ends with \r \x00\n\x00 and the nulls instantly suggest a 16 bit encoding. Scrolling to the beginning and it starts \xff\xfe which is the BOM for little-endian utf16. -- http://mail.python.org/mailman/listinfo/python-list
Re: base64 and unicode
EuGeNe Van den Bulke [EMAIL PROTECTED] wrote: import base64 base64.decode(file(hebrew.b64,r),file(hebrew.lang,w)) It runs but the result is not correct: some of the lines in hebrew.lang are correct but not all of them (hebrew.expected.lang is the correct file). I guess it is a unicode problem but can't seem to find out how to fix it. My guess would be that your problem is that you wrote the file in text mode, so (assuming you are on windows) all newline characters in the output are converted to carriage return/linefeed pairs. However, the decoded text looks as though it is utf16 encoded so it should be written as binary. i.e. the output mode should be wb. Simpler than using the base64 module you can just use the base64 codec. This will decode a string to a byte sequence and you can then decode that to get the unicode string: with file(hebrew.b64,r) as f: text = f.read().decode('base64').decode('utf16') You can then write the text to a file through any desired codec or process it first. BTW, you may just have shortened your example too much, but depending on python to close files for you is risky behaviour. If you get an exception thrown before the file goes out of scope it may not get closed when you expect and that can lead to some fairly hard to track problems. It is much better to either call the close method explicitly or to use Python 2.5's 'with' statement. -- http://mail.python.org/mailman/listinfo/python-list
Re: base64 and unicode
Duncan Booth wrote: However, the decoded text looks as though it is utf16 encoded so it should be written as binary. i.e. the output mode should be wb. Thanks for the wb tip that works (see bellow). I guess it is experience based but how could you tell that it was utf16 encoded? Simpler than using the base64 module you can just use the base64 codec. This will decode a string to a byte sequence and you can then decode that to get the unicode string: with file(hebrew.b64,r) as f: text = f.read().decode('base64').decode('utf16') You can then write the text to a file through any desired codec or process it first. with file(hebrew.lang,wb) as f: ... file.write(text.encode('utf16')) Done ... superb! BTW, you may just have shortened your example too much, but depending on python to close files for you is risky behaviour. If you get an exception thrown before the file goes out of scope it may not get closed when you expect and that can lead to some fairly hard to track problems. It is much better to either call the close method explicitly or to use Python 2.5's 'with' statement. Yes I had shortened my example but thanks for the 'with' statement tip ... I never think about using it and I should ;) Thanks, EuGeNe -- http://www.3kwa.com -- http://mail.python.org/mailman/listinfo/python-list