Getting a bit OT, but I *think* this is the story:

I've heard it argued, by folks that want to write Python software that uses
bytes for filenames, that:

A file path on a *nix system can be any string of bytes, except two special
values:

b'\x00'   : null
b'\x2f'    : slash

(consistent with this SO post, among many other sources:
https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-encoding
)

So any encoding will work, as long as those two values mean the right
thing. Practically, null is always null, so that leaves the slash

So any encoding that uses b'\x2f' for the slash would work. Which seems to
include, for instance, UTF-16:

In [31]: "/".encode('utf-16')

Out[31]: b'\xff\xfe/\x00'

In [40]: [hex(b) for b in "/".encode('utf-16')]

Out[40]: ['0xff', '0xfe', '0x2f', '0x0']

However, if one were to actually use that in raw form, and, for instance,
split on the \x2f byte, you wouldn't get anything useful.

In [53]: first, second =
"first_part/second_part".encode('utf-16').split(b'/')

In [54]: first.decode('utf-16')

Out[54]: 'first_part'

In [55]: second.decode('utf-16')

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-55-9eec3a9ebb3d> in <module>
----> 1 second.decode('utf-16')

UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position
22: truncated data

In practice, I suspect that every *nix system uses encoding(s) that are
ASCII-compatible for the first 127 values, or more precisely that have the
slash character value be a single byte of value: 2f. And as long as that's
the case (and that value doesn't show up anywhere else) then software can
know nothing about the encoding, and still do two things:

pass it around
split on the slash.

Which may be enough for various system tools. So a fine argument for
facilitating their use in things like globbing directories, opening files,
etc.

But as soon as you have any interaction with humans, then filenames need to
be human meaningful. and as soon as you manipulate the names beyond
splitting and merging on a slash, then you do need to know something about
the encoding.

In practice, maybe knowing that it's ascii compatible in the first 127
bytes will get pretty far, as you can do things like:

if filename_bytes.endswith(b'.txt'):
    root_name = filename_bytes[:-4]

So adding the new stripsuffix or whatever we call it makes sense. However:

As soon as someone wants to do anything even a bit more sophisticated, that
may involve non-ascii characters, that would all go to heck.

And my understanding is that with the 'surrogateescape' error handlers, you
can convert to a "maybe right" encoding, manipulate it, and then convert
back, using the same encoding.

Though this still goes to heck if the encoding uses more than one byte for
the slash. (or a surrogate escape is part of some other manipulation you
may do).

Anyway -- this is why it seems like a bad idea to give the bytes object any
more "string like" functionality.

But bytes has a pretty full set of "string like" methods now, so I suppose
it makes sense to add a couple new ones that are related to ones that are
already there.

-CHB


-- 
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ADOT6FCMVGZ3SDDIKJHHQP7ZGTYQALAL/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to