Re: Newbie question about text encoding

Chris Angelico Sat, 07 Mar 2015 09:29:26 -0800

On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa <ma...@pacujo.net> wrote:
> See:
>
>    $ mkdir /tmp/xyz
>    $ touch /tmp/xyz/
> \x80'
>    $ python3
>    Python 3.3.2 (default, Dec  4 2014, 12:49:00)
>    [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
>    Type "help", "copyright", "credits" or "license" for more information.
>    >>> import os
>    >>> os.listdir('/tmp/xyz')
>    ['\udc80']
>    >>> open(os.listdir('/tmp/xyz')[0])
>    Traceback (most recent call last):
>      File "<stdin>", line 1, in <module>
>    FileNotFoundError: [Errno 2] No such file or directory: '\udc80'
>
> File names encoded with Latin-X are quite commonplace even in UTF-8
> locales.


That is not a problem with UTF-8, though. I don't understand how
you're blaming UTF-8 for that. There are two things happening here:

1) The underlying file system is not UTF-8, and you can't depend on
that, ergo the decode to Unicode has to have some special handling of
failing bytes.
2) You forgot to put the path on that, so it failed to find the file.
Here's my version of your demo:

>>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
<_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'>

Looks fine to me.

Alternatively, if you pass a byte string to os.listdir, you get back a
list of byte string file names:

>>> os.listdir(b"/tmp/xyz")
[b'\x80']
>>> open(b"/tmp/xyz/"+os.listdir(b'/tmp/xyz')[0])
<_io.TextIOWrapper name=b'/tmp/xyz/\x80' mode='r' encoding='UTF-8'>

Either way works. You can use bytes or text, and if you use text,
there is a way to smuggle bytes through it. None of this has anything
to do with UTF-8 as an encoding. (Note that the "encoding='UTF-8'"
note in the response has to do with the presumed encoding of the file
contents, not of the file name. As an empty file, it can be considered
to be a stream of zero Unicode characters, encoded UTF-8, so that's
valid.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Reply via email to