Re: [Python-Dev] File system path encoding on Windows

Stephen J. Turnbull Mon, 22 Aug 2016 21:23:50 -0700

Steve Dower writes:

 > The Windows world is Unicode. Mostly represented in UTF-16, but UTF-8 is 
 > entirely equivalent.


Sort of, yes, and not for present purposes.

AFAICS, the Windows world is mostly application/* media that require
substantial developer effort to extract text from; character encoding
is a minor annoyance.  These are not Unicode, even if the embedded
text uses the Unicode coded character set.  When in comes to text/*
media (including file system names), my personal experience is that
non-Unicode encodings are used often, even where they're forbidden
(and, ironically enough, where forbidden only by Windows users[1]).

As far as the UTF in use, I concede your expertise.

UTF-8 is absolutely not equivalent to UTF-16 from the point of view of
developers. Passing it to Windows APIs requires decoding to UTF-16 (or
from a Python developer's point of view, decoding to str and use of
str APIs).  That fact is what got you started on this whole proposal!

 > All MSVC users have been pushed towards Unicode for many years.

But that "push" is due to the use of UTF-16-based *W APIs and
deprecation of ACP-based *A APIs, right?  The input to *W APIs must be
decoded from all text/* content "out there", including UTF-8 content.
I don't see evidence that users have been pushed toward *UTF-8* in that
statement; they may be decoding from something else.  Unicode != UTF-8
for our purposes!

In any case, I suspect lot of people use Python to avoid C, and so
existing Python users may not be affected by MSVC "pressure".

 > The .NET Framework has defaulted to UTF-8

Default != enforce, though.  Do you know that almost nobody changes
the default, and that behavior is fairly uniform across different
classes of organization (specifically by language)?  Or did you mean
"enforce"?

 > its entire existence. The use of code pages has been discouraged
 > for decades. We're not going first :)

The fact that a framework, which by definition provides a world-
within-a-world, can insist on UTF-8 from the start is very different
from a generic programming language, which has deliberately provided
multiscript capability for decades.  People who buy in to .NET do so
because the disadvantages (which may include character encoding
conversion at the boundary, or "purification" of the environment to
use only UTF-8) are outweighed by both the individual features of the
framework and their packaging into a consistent whole.  This is
closely related to my idea about "effective monopoly IT providers".

On the contrary, people who use Python may very well have done to
*avoid* the Unicode strictures of .NET (or at least consider it a
convenience compared to changing user behavior to conform to .NET),
perhaps "localized" to a particular department or use case.  I believe
I've mentioned that my employers' various downloadable database
queries (course catalog, student rosters) are mostly structured as CSV
files, with the option to encode as UTF-8 or Shift-JIS.  I suspect
that is very common in Japanese universities because of the popularity
of Macs among educators, professionals, and students.  I don't know
about business and government, which is very Windows-oriented.  There,
I suspect Shift-JIS is the rule for text/* media, but Excel for data
tables and Word, Powerpoint, and PDF for "rich text" may be used almost
exclusively, so text/* may not be relevant in information interchange.

 > > I don't understand why this argument doesn't cut both ways
 > > equally.  If you believe that, you should also believe that the
 > > same people who won't change code to opt in also won't use a
 > > Python containing fix #1, and may not install it at all.  Doesn't
 > > that matter?
 > 
 > People already do this (e.g. Python 2.7). I don't think it should
 > matter enough to prevent us from making changes in new versions of
 > Python.

Of course it shouldn't, for the generic idea of change.  But the
argument you made is that "if we don't *force* UTF-8, users who won't
change code won't get the benefit of UTF-8".  My rebuttal is that "if
we *do* force UTF-8, those same users lose the benefit of both Python
3.6 and UTF-8."  It matters how many are in that situation, but
unfortunately we'll just have to guess about that.

 > So I guess the question here is: for organisations who have already
 > (incorrectly) assumed that the file system encoding and the active
 > code page are always the same,

Stop bashing the users, please!  This "users are stupid, we know
better" is the attitude that scares me about this proposal.  In the
enterprises I'm talking about, that is an organizational decision, not
an assumption.  (It is likely to be "close enough" to true in some
cases that lack such a policy, too.)  Or are you telling me that
Windows will change the active code page behind the users' backs even
if it's told not to do so?

Now, you can argue that few organizations actually have such policies,
and you may be right.  I don't know, and you don't know.  The damage
to Python's reputation if even *one* such gets screwed by forcing
UTF-8 will be large, though.

 > have built solid infrastructure around this using bytes (including
 > ensuring that their systems never encounter external paths in
 > glob/listdir/etc.), are currently using 3.5 and want to migrate to
 > 3.6 - is an environment variable to change back to mbcs sufficient
 > to meet their needs?

I should hope so!  As you surely know, the amount of technical
knowledge and organizational discipline required to to build solid
infrastructure around non-UTF-8 encodings is great.  The set of
applications that use bytes and need the setting should be finite, and
the decision to migrate them to Python 3.6 is unlikely to be
thoughtless.

To be clear: asking users who want backward-compatible behavior to set
an environment variable does not count as a "screw" -- some will
complain, but "the defaults always suck for somebody".  Reasonable
people know that, and we can't do anything about the hysterics.

The questions then are, what are the costs and benefits to various
classes of user, and how big are those classes?  Here's how I see the
costs and benefits playing out:

1.  Organizations which behave like ".NET users" already have pure
    UTF-8 environments.  They win from Python defaulting to UTF-8,
    since Windows won't let them do it for themselves.  Now they can
    plug in bytes-oriented code written for the POSIX environment
    straight from upstream.

    Is that correct?  Ie, without transcoding, they can't now use
    bytes because their environment hands them UTF-8 but when Python
    hands those bytes to Windows, it assumes anything else but UTF-8?

BTW, I wonder how those organizations manage to get pure UTF-8
environments, given that Windows itself won't default to that.  Is it
just that they live in .NET and other applications that default to
producing UTF-8 text (in the rare(?) case that text is generated at
all, vs some application/* medium), and so never get near applications
that produce text in the active code page, and especially not near
applications that embed file system names encoded in a non-UTF-8
encoding in text/* media?

2.  Organizations with a mixed environment will get a different set of
    "random" failures when using bytes-oriented code from before.
    Bytes-oriented code still represents a substantial risk with the
    UTF-8 default.

3.  Organizations with pure "other" encoding environments in the short
    run will have to change Python's defaults (or use older Python
    versions, if UTF-8 is forced) for bytes-oriented code (which they
    may already have installed).

I guess from Nick's (and Victor's) point of view, we would also like
to know if we're going to be able to recruit more Windows-based
developers from group 1.

Overall, I think Nick's hybrid strategy is the way to go.  First, give
users the choice of 'mbcs' or 'utf-8' for the Windows encoding.  I see
no reason not to do this for locale.getpreferredencoding() at the same
time, as long as it's an option.  Then, default them to 'utf-8' for
the betas, document how to change the defaults prominently, reserve
the right to change defaults for the rcs and the release.  Now we see
how many and who screams, and what they do about the pain -- reset
defaults or mandate UTF-8 (or both for a transition period).

It would be a good idea to have a short list of libraries using bytes-
oriented code and their applications that users can easily install to
try out, too.  Our working assumption has to be that few Windows users
do have them installed already, because they haven't worked to date.


Footnotes: 
[1]  Users who *happen* to be Windows users.  Windows didn't make them
do these horrible things, but the software that does is used only on
Windows.

[2]  I wonder how they manage that, given that Windows itself won't
let them set the preferred encoding to UTF-8.  Just how does .NET
manage the non-UTF-8 content that it must occasionally encounter?

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] File system path encoding on Windows

Reply via email to