On 20Aug2014 16:04, Chris Barker - NOAA Federal <chris.bar...@noaa.gov> wrote:
 but disallowing them in higher level
> explicitly cross platform abstractions like pathlib.

I think the trick here is that posix-using folks claim that filenames are
just bytes, and indeed they can be passed around with a char*, so they seem
to be.

but you can't possible do anything other than pass them around if you
REALLY think they are just bytes.

So really, people treat them as
"bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and
maybe a couple others)-is-ascii-compatible"

As someone who fought long and hard in the surrogate-escape listdir() wars, and was won over once the scheme was thoroughly explained to me, I take issue with these assertions: they are bogus or misleading.

Firstly, POSIX filenames _are_ just byte strings. The only forbidden character is the NUL byte, which terminates a C string, and the only special character is the slash, which separates pathanme components.

Second, a bare low level program cannot do _much_ more than pass them around. It certainly can do things like compute their basename, or other path related operations.

The "bytes in some arbitrary encoding where at least the slash character (and
maybe a couple others) is ascii compatible" notion is completely bogus. There's only one special byte, the slash (code 47). There's no OS-level need that it or anything else be ASCII compatible. I think characterisations such as the one quoted are activately misleading.

The way you get UTF-8 (or some other encoding, fortunately getting less and less common) is by convention: you decide in your environment to work in some encoding (say utf-8) via the locale variables, and all your user-facing text gets used in UTF-8 encoding form when turned into bytes for the filename calls because your text<->bytes methods say to do so.

I think we'd all agree it is nice to have a system where filenames are all Unicode, but since POSIX/UNIX predates it by decades it is a bit late to ignore the reality for such systems. I certainly think the Window-side Babel of code pages and multiple code systems is far far worse. (Disclaimer: not a Windows programmer, just based on hearing them complain.)

I'm +1000 on systems where the filesystem enforces Unicode (eg Plan 9 or Mac OSX, which forces a specific UTF-8 encoding in the bytes POSIX APIs - the underlying filesystems reject invalid byte sequences).

[...]
Antoine Pitrou wrote:
To elaborate specifically about pathlib, it doesn't handle bytes paths
but allows you to generate them if desired:
https://docs.python.org/3/library/pathlib.html#operators

but that uses
os.fsencode:  Encode filename to the filesystem encoding

As I understand it, the whole problem with some posix systems is that there
is NO filesystem encoding -- i.e. you can't know for sure what encoding a
filename is in. So you need to be able to pass the bytes through as they
are.

Yes and no. I made that argument too.

There's no _external_ "filesystem encoding" in the sense of something recorded in the filesystem that anyone can inspect. But there is the expressed locale settings, available at runtime to any program that cares to pay attention. It is a workable situation.

Oh, and I reject Nick's characterisation of POSIX as "broken". It's perfectly internally consistent. It just doesn't match what he wants. (Indeed, what I want, and I'm a long time UNIX fanboy.)

Cheers,
Cameron Simpson <c...@zip.com.au>

God is real, unless declared integer.   - Johan Montald, jo...@ingres.com
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to