On Sat, Jan 23, 2021 at 11:34 PM Stephen J. Turnbull
<turnbull.stephen...@u.tsukuba.ac.jp> wrote:
>  > I'd rather focus on just moving to UTF-8 as the default, rather
>  > than bringing in a new function - especially with such a confusing
>  > name.
>
> I expect there are several bodies of users who will experience that as
> quite obnoxious for a long time to come.  I *still* see a ton of stuff
> that is Shift JIS, a fair amount of email in ISO-2022-JP, and in China
> gb18030 isn't just a good idea, it's the law.  (OK, the precise
> statement of the law is "must support", not "must use", but my Chinese
> students all default to GB.)

But "UTF-8 as the default if you don't specify an encoding" doesn't
stop you from using all those other encodings. The only change is
that, if you don't specify an encoding, you get a cross-platform
consistent default that can be easily described, rather than one which
depends on system settings.

> The problem is that these users use some software that will create
> text in a national language encoding by default and other that use
> UTF-8 by default.  So I guess Naoki's hope is that "when I'm
> processing Microsoft/Oracle-generated data, I use 'open_text', when
> it's local software I use 'open'" becomes an easy and natural reponse
> in such environments.

Exactly, so no single default will work.

Is there an easy way to say open("filename", encoding="use my system
default") ? Currently encoding=None does that, and maybe that can be
retained (just with the default becoming "utf-8"), or maybe some other
keyword can be used. But that should cover the situations where you
specifically *want* a platform-dependent selection.

>  > What exactly are the blockers on making open(fn) use UTF-8 by
>  > default?
>
> Backward incompatibility with just about every script in existence?

Or for a large number of them, sudden cross-platform compatibility
that they didn't previously have. This is *fixing a bug* for many
scripts.

>  > Can the proposals be written with that as the ultimate goal (even if
>  > it's going to take X versions and multiple deprecation phases), rather
>  > than aiming for a messy goal where people aren't sure which function
>  > to use?
>
> The problem is that on Windows there are a lot of installations that
> continue to use non-UTF-8 encodings enough that users set their
> preferred encoding that way.  I guess that folks where the majority of
> their native-language alphabet is drawn from ASCII are by now almost
> all using UTF-8 by default, but this is not so for East Asians (who
> almost all still use a mixture of several encodings every day because
> email still often defaults to national standard encodings).  I can't
> speak to Cyrillic, Hebrew, Arabic, Indic languages, but I wouldn't be
> surprised if they're somewhere in the middle.

So Windows is being a pain in the behind, once again, because it
doesn't move forward. File names on Mac OS and most Linux systems will
be in UTF-8, regardless of your chosen language. Why stick to other
encodings as the default?

(I repeat: I am NOT advocating abolishing support for all other
encodings. The ONLY thing I want to see is that UTF-8 becomes the
default.)

> Naoki can document that "open(..., encoding='...')" is strongly
> preferred to 'open_text'.  Maybe a better name is "open_utf8", to
> discourage people who want to use non-default encodings, or
> programmatically chosen encodings, in that function.

TBH I don't think a separate built-in is of value here, but perhaps
it'd be beneficial as an alternative to the wall-of-text help info
that open() has. But I do rather like Random's and Steve's suggestion
that the alternate function be specifically documented as magic. It'd
actually tie in very nicely with a change of default: open() does what
it's explicitly told, and has cross-platform defaults, but
open_sesame() probes the file to try to guess at its encoding,
attempting to use a platform-specific eight bit encoding if
applicable. It'd "just work" for reading most text files, regardless
of their source, as long as they came from this current computer. (All
bets are off anyway if they came from some other system and are in an
eight-bit encoding.)

ChrisA
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/PKUN6TDU6R3CDX2LCI34DF5CCLGHMVIX/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to