Steve Dower writes: > The Windows world is Unicode. Mostly represented in UTF-16, but UTF-8 is > entirely equivalent.
Sort of, yes, and not for present purposes. AFAICS, the Windows world is mostly application/* media that require substantial developer effort to extract text from; character encoding is a minor annoyance. These are not Unicode, even if the embedded text uses the Unicode coded character set. When in comes to text/* media (including file system names), my personal experience is that non-Unicode encodings are used often, even where they're forbidden (and, ironically enough, where forbidden only by Windows users[1]). As far as the UTF in use, I concede your expertise. UTF-8 is absolutely not equivalent to UTF-16 from the point of view of developers. Passing it to Windows APIs requires decoding to UTF-16 (or from a Python developer's point of view, decoding to str and use of str APIs). That fact is what got you started on this whole proposal! > All MSVC users have been pushed towards Unicode for many years. But that "push" is due to the use of UTF-16-based *W APIs and deprecation of ACP-based *A APIs, right? The input to *W APIs must be decoded from all text/* content "out there", including UTF-8 content. I don't see evidence that users have been pushed toward *UTF-8* in that statement; they may be decoding from something else. Unicode != UTF-8 for our purposes! In any case, I suspect lot of people use Python to avoid C, and so existing Python users may not be affected by MSVC "pressure". > The .NET Framework has defaulted to UTF-8 Default != enforce, though. Do you know that almost nobody changes the default, and that behavior is fairly uniform across different classes of organization (specifically by language)? Or did you mean "enforce"? > its entire existence. The use of code pages has been discouraged > for decades. We're not going first :) The fact that a framework, which by definition provides a world- within-a-world, can insist on UTF-8 from the start is very different from a generic programming language, which has deliberately provided multiscript capability for decades. People who buy in to .NET do so because the disadvantages (which may include character encoding conversion at the boundary, or "purification" of the environment to use only UTF-8) are outweighed by both the individual features of the framework and their packaging into a consistent whole. This is closely related to my idea about "effective monopoly IT providers". On the contrary, people who use Python may very well have done to *avoid* the Unicode strictures of .NET (or at least consider it a convenience compared to changing user behavior to conform to .NET), perhaps "localized" to a particular department or use case. I believe I've mentioned that my employers' various downloadable database queries (course catalog, student rosters) are mostly structured as CSV files, with the option to encode as UTF-8 or Shift-JIS. I suspect that is very common in Japanese universities because of the popularity of Macs among educators, professionals, and students. I don't know about business and government, which is very Windows-oriented. There, I suspect Shift-JIS is the rule for text/* media, but Excel for data tables and Word, Powerpoint, and PDF for "rich text" may be used almost exclusively, so text/* may not be relevant in information interchange. > > I don't understand why this argument doesn't cut both ways > > equally. If you believe that, you should also believe that the > > same people who won't change code to opt in also won't use a > > Python containing fix #1, and may not install it at all. Doesn't > > that matter? > > People already do this (e.g. Python 2.7). I don't think it should > matter enough to prevent us from making changes in new versions of > Python. Of course it shouldn't, for the generic idea of change. But the argument you made is that "if we don't *force* UTF-8, users who won't change code won't get the benefit of UTF-8". My rebuttal is that "if we *do* force UTF-8, those same users lose the benefit of both Python 3.6 and UTF-8." It matters how many are in that situation, but unfortunately we'll just have to guess about that. > So I guess the question here is: for organisations who have already > (incorrectly) assumed that the file system encoding and the active > code page are always the same, Stop bashing the users, please! This "users are stupid, we know better" is the attitude that scares me about this proposal. In the enterprises I'm talking about, that is an organizational decision, not an assumption. (It is likely to be "close enough" to true in some cases that lack such a policy, too.) Or are you telling me that Windows will change the active code page behind the users' backs even if it's told not to do so? Now, you can argue that few organizations actually have such policies, and you may be right. I don't know, and you don't know. The damage to Python's reputation if even *one* such gets screwed by forcing UTF-8 will be large, though. > have built solid infrastructure around this using bytes (including > ensuring that their systems never encounter external paths in > glob/listdir/etc.), are currently using 3.5 and want to migrate to > 3.6 - is an environment variable to change back to mbcs sufficient > to meet their needs? I should hope so! As you surely know, the amount of technical knowledge and organizational discipline required to to build solid infrastructure around non-UTF-8 encodings is great. The set of applications that use bytes and need the setting should be finite, and the decision to migrate them to Python 3.6 is unlikely to be thoughtless. To be clear: asking users who want backward-compatible behavior to set an environment variable does not count as a "screw" -- some will complain, but "the defaults always suck for somebody". Reasonable people know that, and we can't do anything about the hysterics. The questions then are, what are the costs and benefits to various classes of user, and how big are those classes? Here's how I see the costs and benefits playing out: 1. Organizations which behave like ".NET users" already have pure UTF-8 environments. They win from Python defaulting to UTF-8, since Windows won't let them do it for themselves. Now they can plug in bytes-oriented code written for the POSIX environment straight from upstream. Is that correct? Ie, without transcoding, they can't now use bytes because their environment hands them UTF-8 but when Python hands those bytes to Windows, it assumes anything else but UTF-8? BTW, I wonder how those organizations manage to get pure UTF-8 environments, given that Windows itself won't default to that. Is it just that they live in .NET and other applications that default to producing UTF-8 text (in the rare(?) case that text is generated at all, vs some application/* medium), and so never get near applications that produce text in the active code page, and especially not near applications that embed file system names encoded in a non-UTF-8 encoding in text/* media? 2. Organizations with a mixed environment will get a different set of "random" failures when using bytes-oriented code from before. Bytes-oriented code still represents a substantial risk with the UTF-8 default. 3. Organizations with pure "other" encoding environments in the short run will have to change Python's defaults (or use older Python versions, if UTF-8 is forced) for bytes-oriented code (which they may already have installed). I guess from Nick's (and Victor's) point of view, we would also like to know if we're going to be able to recruit more Windows-based developers from group 1. Overall, I think Nick's hybrid strategy is the way to go. First, give users the choice of 'mbcs' or 'utf-8' for the Windows encoding. I see no reason not to do this for locale.getpreferredencoding() at the same time, as long as it's an option. Then, default them to 'utf-8' for the betas, document how to change the defaults prominently, reserve the right to change defaults for the rcs and the release. Now we see how many and who screams, and what they do about the pain -- reset defaults or mandate UTF-8 (or both for a transition period). It would be a good idea to have a short list of libraries using bytes- oriented code and their applications that users can easily install to try out, too. Our working assumption has to be that few Windows users do have them installed already, because they haven't worked to date. Footnotes: [1] Users who *happen* to be Windows users. Windows didn't make them do these horrible things, but the software that does is used only on Windows. [2] I wonder how they manage that, given that Windows itself won't let them set the preferred encoding to UTF-8. Just how does .NET manage the non-UTF-8 content that it must occasionally encounter? _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com