Re: [Tutor] Opening filenames with unicode characters

Jerry Hill Thu, 28 Jun 2012 13:28:44 -0700

On Thu, Jun 28, 2012 at 3:48 PM, James Chapman <ja...@uplinkzero.com> wrote:
> Informative thanks Jerry, however I'm not out of the woods yet.
>
>
>> Here's a couple of questions that you'll need to answer 'Yes' to
>> before you're going to get this to work reliably:
>>
>> Are you familiar with the differences between byte strings and unicode
>> strings?
>
> I think so, although I'm probably missing key bits of information.


Okay.  I would start with this article as a refresher:
http://www.joelonsoftware.com/articles/Unicode.html

This isn't a bad reference either:
http://docs.python.org/howto/unicode.html

>> Do you understand how to convert from one to the other,
>> using a particular encoding?
>
> No not really. This is something that's still very new to me.

Okay.  The lowest level way is to work with the strings themselves.
Byte strings (the python2.x str type) have a .decode() method that
will transform a byte string in a particular encoding into the
equivalent unicode string.  Unicode strings have a .encode() method to
convert a unicode string into the equivalent byte string in a
particular encoding.

So, for example:

byte_string = "This is a pound sign: \xA3"

My terminal uses cp1252 as the encoding.  I know that because I
checked sys.stdin.encoding as I was working up this example.  In
CP1252, the pound character is 0xA3.  I looked that up on a chart on
wikipedia.

So, if I print the string, and the encoding of the byte string matches
the encoding on my terminal, everything should look fine:

>>> print byte_string
This is a pound sign: £

That works, because the encoding my terminal knows to display is the
same as the bytes python put out.  If I want to convert that string to
unicode, I could do this:

unicode_string = byte_string.decode('cp1252')

now unicode_string has the unicode version of my original string.  I
can now encode that to any encoding I want.  For instance:

>>> print repr(unicode_string.encode('utf8'))
'This is a pound sign: \xc2\xa3'

>>> print repr(unicode_string.encode('utf-16'))
'\xff\xfeT\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00
\x00p\x00o\x00u\x00n\x00d\x00 \x00s\x00i\x00g\x00n\x00:\x00
\x00\xa3\x00'

>>> print repr(unicode_string.encode('shift-jis'))
'This is a pound sign: \x81\x92'

>> Do you know what encoding your source
>> file is saved in?
>
> The name of the file I'm trying to open comes from a UTF-16 encoded text 
> file, I'm then using regex to extract the string (filename) I need to open. 
> However, all the examples I've been using here are just typed into the python 
> console, meaning string source at this stage is largely irrelevant.
>
>> If your string is not coming from a source file,
>> but some other source of bytes, do you know what encoding those bytes
>> are using?
>>
>> Try the following.  Before trying to convert filename to unicode, do a
>> "print repr(filename)".  That will show you the byte string, along
>> with the numeric codes for the non-ascii parts.  Then convert those
>> bytes to a unicode object using the appropriate encoding.  If the
>> bytes are utf-8, then you'd do something like this:
>> unicode_filename = unicode(filename, 'utf-8')
>
>>>> print(repr(filename))
> "This is_a-test'FILE to Ensure$ that\x9c stuff^ works.txt.js"
>
>>>> fileName = unicode(filename, 'utf-8')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 35: 
> invalid start byte
>
>>>> fileName = unicode(filename, 'utf-16')
>
>>>> fileName
> u'\u6854\u7369\u6920\u5f73\u2d61\u6574\u7473\u4627\u4c49\u2045\u6f74\u4520\u736e\u7275\u2465\u7420\u6168\u9c74\u7320\u7574\u6666\u205e\u6f77\u6b72\u2e73\u7874\u2e74\u736a'
>
>
>
> So I now have a UTF-16 encoded string, but I still can't open it.

No.  You have mojibake  Try printing that unicode string, and you'll
see that you have a huge mess of asian characters.  You need to figure
out what encoding that byte string is really in.  If you're typing it
on the terminal, do this:

import sys
print sys.stdin.encoding

If the pound sign is being encoded as 0x9c (which it is, based on the
example you showed), then your terminal is using an odd encoding.  At
a guess, cp850, maybe?

>
>>>> codecs.open(fileName, 'r', 'utf-16')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "C:\Python27\lib\codecs.py", line 881, in open
>    file = __builtin__.open(filename, mode, buffering)
> IOError: [Errno 2] No such file or directory: 
> u'\u6854\u7369\u6920\u5f73\u2d61\u6574\u7473\u4627\u4c49\u2045\u6f74\u4520\u736e\u72
> 75\u2465\u7420\u6168\u9c74\u7320\u7574\u6666\u205e\u6f77\u6b72\u2e73\u7874\u2e74\u736a'
>
>
> I presume I need to perform some kind of decode operation on it to open the 
> file but then am I not basically going back to my starting point?

Once you have the name properly decoded from a byte string to a
unicode string, just supply the unicode string to the open() function.
 Everything should be okay then.

-- 
Jerry
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Opening filenames with unicode characters

Reply via email to