subject:"\[Python\-ideas\] Re\: Adding `open_text\(\)` builtin function. \(relating to PEP 597\)"

Chris Angelico writes:

 > Can anyone give an example of a current system encoding (ie one that
 > is likely to be the default currently used by open()) that can have
 > byte values below 128 which do NOT mean what they would mean in ASCII?
 > In other words, is it possible to read in a section of a file, think
 > that it's ASCII, and then find that you decoded it wrongly?

Japanese Shift JIS, as mentioned by Richard.  The Japanese just
redefine the glyph used for Windows paths and character escapes to be
the yen sign.  So it's a total muddle, because they also use that for
the yen sign.  They also use a broken vertical bar for the pipe
symbol, but the visual similarity there is so strong that you have to
know a *lot* of computational Japanese to realize that they're
different characters (they are, in JIS, but nobody cares -- there's
almost never a reason to use both).

 > I'm assuming here that there is a *single* default system encoding,
 > meaning that the automatic handler has only three cases to worry
 > about: UTF-16 (with BOM), UTF-8 (including pure ASCII), and the system
 > encoding.

Sure that handles a lot of cases ... but the vast majority are already
handled with just the system encoding and UTF-8.  In my experience the
UTF-16 cases are not going to be the majority of what's left.  YMMV.

 > Right, but as long as there's only one system encoding, that's not
 > our problem. If you're on a Greek system and you want to decode
 > ISO-8859-9 text, you have to state that explicitly. For the
 > situations where you want heuristics based on byte distributions,
 > there's always chardet.

But that's the big question.  If you're just going to fall back to
chardet, you might as well start there.  No?  Consider: if 'open'
detects the encoding for you, *you can't find out what it is*.  'open'
has no facility to tell you!

As somebody else pointed out, if you're writing a text editor,
autodetection makes a lot of sense.  You just provide a facility for
the user to chose something different and reread the file.  But if
you're running non-interactive, it's much harder to recover -- and
'open' can't do it for you.

 > > Program source code where the higher-level functions (likely to
 > > contain literal strings) come late in the file are frequently
 > > misdetected based on the earlier bytes.
 > 
 > Yup; and the real question is whether anything would have been decoded
 > incorrectly.

If I recall correctly there are several Latin-1 characters in UTF-8
which are plausible Windows 125x digraphs.  So, yes, it's quite possible.

 > If you read in a bunch of ASCII-only text and yield it to
 > the app, and then come across something that proves that the file is
 > not UTF-8, then as far as I am aware, you won't have to un-yield any
 > of the previous text - it'll all have been correctly decoded.

Not if it's UTF-16.  And again, if you put the detection logic in
'open', once you've yielded anything to the main logic *it's too late
to change your mind*.

 > In theory, UTF-16 without a BOM can consist entirely of byte values
 > below 128,

It's not just theory, it's my life.  62/80 of the Japanese "hiragana"
syllabary is composed of 2 printing ASCII characters (including SPC).
A large fraction of the Han ideographs satisfy that condition, and I
wouldn't be surprised if a majority of the 1000 most common ones do.
(Not a good bet because half of the ideographs have a low byte > 127,
but the order of characters isn't random, so if you get a couple of
popular radicals that have 50 or so characters in a group in that
range, you'd be much of the way there.)

 > But there's no solution to that,

Well, yes, but that's my line. ;-)
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/CIBX3EFFW2OMFUXQ4KPUJ4OZIYMQK5PH/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Richard Damon

On 1/24/21 6:00 AM, Chris Angelico wrote:
> Sorry, let me clarify.
>
> Can anyone give an example of a current system encoding (ie one that
> is likely to be the default currently used by open()) that can have
> byte values below 128 which do NOT mean what they would mean in ASCII?
> In other words, is it possible to read in a section of a file, think
> that it's ASCII, and then find that you decoded it wrongly?

EBCDIC is one big option.
There are also some National Character sets which change a couple of the lower 
128 characters for use with characters that language needed. (This was the 
cause of adding Trigraphs to C, to provide a way enter those characters on 
systems that didn't have those characters.

One common example was a Japanese character set that replaced \ with the Yen 
sign (and a few others) and then used some above 128 codes for multi-byte 
sequences. Users of such systems just got used to use the Yen sign as the path 
separator. 

The EBCDIC cases would likely be well know on those systems, and planned for. 
Having a system with a few of the lower 128 being substituted for could be a 
bigger surprise for a system.

-- 
Richard Damon
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/24PLMY635JZAS32BY2G5YVHBXTQPEFE5/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread 2QdxY4RzWzUUiLuE

On 2021-01-25 at 00:29:41 +1100,
Steven D'Aprano  wrote:

> On Sat, Jan 23, 2021 at 03:24:12PM +, Barry Scott wrote:
> 
> > I think that you are going to create a bug magnet if you attempt to auto
> > detect the encoding.
> > 
> > First problem I see is that the file may be a pipe and then you will block
> > until you have enough data to do the auto detect.
> 
> Can you use `open('filename')` to read a pipe?

Yes.  Named pipes are files, at least on POSIX.

And no.  Unnamed pipes are identified by OS-level file descriptors, so
you can't open them with open('filename'), but you can open them with
os.fdopen.  Once opened, such data sources "should be" interchangeable.

> Is blocking a problem in practice? If you try to open a network file,
> that could block too, if there are network issues. And since you're
> likely to follow the open with a read, the read is likely to block. So
> over all I don't think that blocking is an issue.

If open blocks too many bytes, then my application never gets to respond
unless enough data comes through the pipe.  Consider protocols like FTP
and SMTP, where commands and responses are often only handfuls of bytes
long.  OTOH, if I'm opening a file (or a pipe) for such a protocol, then
both ends should know the encoding ahead of time and there's no need to
guess.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/55ZMKKQES3EYMXZFYPHOT3WYOKXMUG3Q/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Chris Angelico

On Mon, Jan 25, 2021 at 12:33 AM Steven D'Aprano  wrote:
>
> On Sat, Jan 23, 2021 at 03:24:12PM +, Barry Scott wrote:
>
> > I think that you are going to create a bug magnet if you attempt to auto
> > detect the encoding.
> >
> > First problem I see is that the file may be a pipe and then you will block
> > until you have enough data to do the auto detect.
>
> Can you use `open('filename')` to read a pipe?

Yes. You can even use it with stdin:

>>> open("/proc/self/fd/0").read(1)
a
'a'

The second line was me typing something, even though I was otherwise
at the REPL.

> Is blocking a problem in practice? If you try to open a network file,
> that could block too, if there are network issues. And since you're
> likely to follow the open with a read, the read is likely to block. So
> over all I don't think that blocking is an issue.

Definitely could be a problem if you read too much just for the sake
of autodetection. It needs to be possible to do everything with an
absolute minimum of reading.

> > Second problem is that the first N bytes are all in ASCII and only later
> > do you see Windows code page signature (odd lack of utf-8 signature).
>
> UTF-8 is a strict superset of ASCII, so if the file is actually
> ASCII, there is no harm in using UTF-8.
>
> The bigger issue is if you have N bytes of pure ASCII followed by some
> non-UTF superset, such as one of the ISO-8859-* encodings. So you end up
> detecting what you think is ASCII/UTF-8 but is actually some legacy
> encoding. But if N is large, say 512 bytes, that's unlikely in practice.

There's no problem if you think it's ASCII, so the only problem would
be if you start thinking that it's UTF-8 and then discover that it
isn't. The scheme used by UTF-8 is designed such that this is highly
unlikely with random data or actual text in an eight-bit encoding, so
it's most likely to be broken UTF-8 than legit ISO-8859-X.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/MBBCCHLFHFHYPCS54AKOVOCA4ELBFNPD/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Steven D'Aprano

On Sat, Jan 23, 2021 at 03:24:12PM +, Barry Scott wrote:

> I think that you are going to create a bug magnet if you attempt to auto
> detect the encoding.
> 
> First problem I see is that the file may be a pipe and then you will block
> until you have enough data to do the auto detect.

Can you use `open('filename')` to read a pipe?

Is blocking a problem in practice? If you try to open a network file, 
that could block too, if there are network issues. And since you're 
likely to follow the open with a read, the read is likely to block. So 
over all I don't think that blocking is an issue.

> Second problem is that the first N bytes are all in ASCII and only later
> do you see Windows code page signature (odd lack of utf-8 signature).

UTF-8 is a strict superset of ASCII, so if the file is actually 
ASCII, there is no harm in using UTF-8.

The bigger issue is if you have N bytes of pure ASCII followed by some 
non-UTF superset, such as one of the ISO-8859-* encodings. So you end up 
detecting what you think is ASCII/UTF-8 but is actually some legacy 
encoding. But if N is large, say 512 bytes, that's unlikely in practice.

> > That auto-detection behaviour could be enough to differentiate it from 
> > the regular open(), thus solving the "but in ten years time it will be 
> > redundant and will need to be deprecated" objection.
> > 
> > Having said that, I can't say I'm very keen on the name "open_text", but 
> > I can't think of any other bikeshed colour I prefer.
> 
> Given the the functions purpose is to open unicode text use a name that
> reflects that it is the encoding that is set not the mode (binary vs. text).
> 
> open_unicode maybe?

I guess that depends on whether the auto-detection is intended to 
support non-Unicode legacy encodings or not.

> If you are teaching open_text then do you also need to have open_binary?

No. There are no frustrating, difficult, platform-specific encoding 
issues when reading binary files. Bytes are bytes.

-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/MVX5PNZM7W4I42XDSACOQTW3YRJPRQHI/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Random832

On Sat, Jan 23, 2021, at 22:43, Matt Wozniski wrote:
> 1. Deprecate calling `open` for text mode (the default) unless an 
> `encoding=` is specified,

I have a suggestion, if this is going to be done:

If the third positional argument to open is a string, accept it as encoding 
instead of buffering. Maybe even allow the fourth to be errors.

It might be worthwhile to consider making the other arguments keyword-only - 
are they ever used positionally in real-world code?
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HX37WBOP5PSMTVVNK7FVHLMEEGW4B2VX/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Steven D'Aprano

On Sun, Jan 24, 2021 at 10:00:47PM +1100, Chris Angelico wrote:
> On Sun, Jan 24, 2021 at 9:13 PM Stephen J. Turnbull
>  wrote:
> >
> > Chris Angelico writes:
> >
> >  > Can anyone give an example of a current in-use system encoding that
> >  > would have [ASCII bytes in non-ASCII text]?
> >
> > Shift JIS, Big5.  (Both can have bytes < 128 inside multibyte
> > characters.)  I don't know if Big5 is still in use as the default
> > encoding anywhere, but Shift JIS is, although it's decreasing.
> 
> Sorry, let me clarify.
> 
> Can anyone give an example of a current system encoding (ie one that
> is likely to be the default currently used by open()) that can have
> byte values below 128 which do NOT mean what they would mean in ASCII?
> In other words, is it possible to read in a section of a file, think
> that it's ASCII, and then find that you decoded it wrongly?

I believe that IBM mainframes such as the Z series still use 
EBCDIC. Python for z/OS has EBCDIC/UTF interoperability as a selling 
point. I think that just means the codec module :-)

https://www.ibm.com/products/open-enterprise-python-zos


-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/COR53MJK4URT77P77SRYMQYS6ZLHYMEU/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Chris Angelico

On Sun, Jan 24, 2021 at 9:13 PM Stephen J. Turnbull
 wrote:
>
> Chris Angelico writes:
>
>  > Can anyone give an example of a current in-use system encoding that
>  > would have [ASCII bytes in non-ASCII text]?
>
> Shift JIS, Big5.  (Both can have bytes < 128 inside multibyte
> characters.)  I don't know if Big5 is still in use as the default
> encoding anywhere, but Shift JIS is, although it's decreasing.

Sorry, let me clarify.

Can anyone give an example of a current system encoding (ie one that
is likely to be the default currently used by open()) that can have
byte values below 128 which do NOT mean what they would mean in ASCII?
In other words, is it possible to read in a section of a file, think
that it's ASCII, and then find that you decoded it wrongly?

> For both of those once you encounter a non-ASCII byte you can just
> switch over, and none of the previous text was mis-decoded.

Good to know, so these two won't be a problem.

I'm assuming here that there is a *single* default system encoding,
meaning that the automatic handler has only three cases to worry
about: UTF-16 (with BOM), UTF-8 (including pure ASCII), and the system
encoding.

> But
> that's only if you *know* the language was Japanese (respectively
> Chinese).  Remember, there is no encoding that can be distinguished
> from ISO 8859-1 (and several other Latin encodings) simply based on
> the bytes found, since it uses all 256 bytes.

Right, but as long as there's only one system encoding, that's not our
problem. If you're on a Greek system and you want to decode ISO-8859-9
text, you have to state that explicitly. For the situations where you
want heuristics based on byte distributions, there's always chardet.

>  > How likely is it that you'd get even one line of text that purports
>  > to be ASCII?
>
> Program source code where the higher-level functions (likely to
> contain literal strings) come late in the file are frequently
> misdetected based on the earlier bytes.

Yup; and the real question is whether anything would have been decoded
incorrectly. If you read in a bunch of ASCII-only text and yield it to
the app, and then come across something that proves that the file is
not UTF-8, then as far as I am aware, you won't have to un-yield any
of the previous text - it'll all have been correctly decoded.

In theory, UTF-16 without a BOM can consist entirely of byte values
below 128, and that's an absolute pain. But there's no solution to
that, other than demanding a BOM (or hoping that the first few
characters are all ASCII, so you can see "H\0e\0l\0l\0o\0", which I
wouldn't call reliable, although your odds probably aren't that bad in
real-world cases).

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GZPXWOYPSAE733ZMTKFBK26C2LVCNOQU/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

Chris Angelico writes:

 > Can anyone give an example of a current in-use system encoding that
 > would have [ASCII bytes in non-ASCII text]?

Shift JIS, Big5.  (Both can have bytes < 128 inside multibyte
characters.)  I don't know if Big5 is still in use as the default
encoding anywhere, but Shift JIS is, although it's decreasing.

For both of those once you encounter a non-ASCII byte you can just
switch over, and none of the previous text was mis-decoded.  But
that's only if you *know* the language was Japanese (respectively
Chinese).  Remember, there is no encoding that can be distinguished
from ISO 8859-1 (and several other Latin encodings) simply based on
the bytes found, since it uses all 256 bytes.

 > How likely is it that you'd get even one line of text that purports
 > to be ASCII?

Program source code where the higher-level functions (likely to
contain literal strings) come late in the file are frequently
misdetected based on the earlier bytes.

Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ZB2LM3KYLQ34DHA276SPZA73BHJBRQMF/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

Matt Wozniski writes:

 > Rather than introducing a new `open_utf8` function, I'd suggest the
 > following:
 > 
 > 1. Deprecate calling `open` for text mode (the default) unless an
 > `encoding=` is specified,

For that, we should have a sentinel for "system default encoding" (as
you acknowledge, but I want to foot-stomp it).  The current dance to
get that is quite annoying.

 > I think a __future__ import [of 'open_text' by some name] solves
 > the problem better than introducing a new function would.

Only if you redefine the problem.  If the problem is casual coders who
want a quick-and-dirty ready-to-bake function to read UTF-8 when their
default encodings are something else, then it's builtin or Just Don't
-- teach them to copy-paste "encoding='utf-8'" FTW.  I'm perfectly
happy with "Just Don't" followed by "It's Time to Work on UTF-8 by
Default".  You'll have to ask Naoki how he feels about that.

Your proposal (1. above) is an interesting one for that.

Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/QQKQ5HYTR2RLVGUPH44I3QVOZGOD7QEK/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

Cameron Simpson writes:

 > I thought I'd seen [UTF-16 BOM] on Windows text files within the
 > last year or so (I don't use Windows often, so this is happenstance
 > from receiving some data, not an observation of the Windows
 > ecosystem; my recollection is that it was a UTF16 CSV file.)

OK; my experience is limited.

 > But BOMs may be commonplace. This isn't a text file example,

I don't care at all about BOMs in specialized protocols in this
thread.  This thread is about 'open'.

 > I do not consider the BOM dead, and it is so cheap to recognise
 > that not bothering to do so seems almost mean sprited.

Not if you view it from the point of view of cognitive burden on
casual coders.  See my reply to Guido.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GOYUG5WDUQDQTKUZN6V4EDFH6U23656R/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

Guido van Rossum writes:

> I have definitely seen BOMs written by Notepad on Windows 10.

I'm not clear on what circumstances we care if a UTF-8 file has or
doesn't have a UTF-8 signature. Most software doesn't care, it just
reads it and spits it back out if it's there and hasn't been edited
out.

If people are seeing UTF-16 BOMs, that may be worth detecting,
depending on how often and how much trouble it is to deal with them.
I'm just saying that I never see them. I was pretty careful about
saying that my sample is quite restricted.

However ...

> Why can’t the future be that open() in text mode guesses the
> encoding?

The medium-term future is UTF-8 in all UIs and public APIs, except for
archivists. I think we all agree on that.

There are two issues with encoding guessing. The statistically
unimportant one (at least for UTFs) is that guessing is guessing. It
will get it wrong. The people who want guessing are mostly people who
will be hurt most by wrong guesses.

Second, and a real issue for design AFAICS: if you introduce detection
of other encodings to 'open', the programmer may need to (1) discover
that encoding in order to match it on output (open does not return
that), or (2) choose the correct encoding on output, which may or may
not be the detected one depending on what the next software in the
pipeline expects. At that point "in the face of ambiguity" really
does bind, "although practicality" notwithstanding. I'm not sure that
putting detection into 'open' solves any problems, it just pushes them
into other parts of the code.

Remark: As I understand it, Naoki's proposal is about the casual coder
in a monolingual environment where either defaulting to
getpreferredencoding DTRTs or they need UTF-8 because some engineer
decided "UTF-8 is the future, and in my project the future is now!"
I don't think it's intended to be more general than that, but you'll
have to ask him about that.

Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/ZRUF34M5QWQKCDCMEMJOAIIONISCMZIJ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Matt Wozniski

On Sat, Jan 23, 2021 at 10:51 PM Chris Angelico  wrote:

> On Sun, Jan 24, 2021 at 2:46 PM Matt Wozniski  wrote:
> > 2. At the same time as the deprecation is announced, introduce a new
> __future__ import named "utf8_open" or something like that, to opt into the
> future behavior of `open` defaulting to utf-8-sig or utf-8 when opening a
> file in text mode and no explicit encoding is specified.
> >
> > I think a __future__ import solves the problem better than introducing a
> new function would.
>
> Note that, since this doesn't involve any language or syntax changes,
> a regular module import would work here - something like "from
> utf8mode import open", which would then shadow the builtin. Otherwise
> no change to your proposal - everything else works exactly the same
> way.
>

True - that's an even better idea. That even allows it to be wrapped in a
try/except ImportError, allowing someone to write code that's backwards
compatible to versions before the new function is introduced. Though it
does mean that the new function will need to stick around, even though it
will eventually be identical to the builtin open() function.

That would also allow the option of introducing a locale_open as well,
which would behave as though encoding=locale.getpreferredencoding(False) is
the default encoding for files opened in text mode. I can imagine putting
both functions in io, and allowing the user to silence the deprecation
warning by either opting into the new behavior:

from io import utf8_open as open

or explicitly declaring their desire for the legacy behavior:

from io import locale_open as open

~Matt
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ETJ6BADTVM5IICDLICGFIWQDMRDD34XS/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sun, Jan 24, 2021 at 2:46 PM Matt Wozniski  wrote:
> 2. At the same time as the deprecation is announced, introduce a new 
> __future__ import named "utf8_open" or something like that, to opt into the 
> future behavior of `open` defaulting to utf-8-sig or utf-8 when opening a 
> file in text mode and no explicit encoding is specified.
>
> I think a __future__ import solves the problem better than introducing a new 
> function would.

Note that, since this doesn't involve any language or syntax changes,
a regular module import would work here - something like "from
utf8mode import open", which would then shadow the builtin. Otherwise
no change to your proposal - everything else works exactly the same
way.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HFIMUG2JVQ2QULCWEHSXAEALSQOAY2TL/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Matt Wozniski

On Sat, Jan 23, 2021 at 9:22 PM Inada Naoki  wrote:

> On Sun, Jan 24, 2021 at 10:17 AM Guido van Rossum 
> wrote:
> >
> > I have definitely seen BOMs written by Notepad on Windows 10.
> >
> > Why can’t the future be that open() in text mode guesses the encoding?
>
> I don't like guessing. As a Japanese, I have seen many mojibake caused
> by the wrong guess.
> I don't think guessing encoding is not a good part of reliable software.
>

I agree that guessing encodings in general is a bad idea and is an avenue
for subtle localization issues - bad things will happen when it guesses
wrong, and it will lead to code that works properly on the developer's
machine and fails for end users. It makes sense for a text editor to try to
guess, because showing the user something is better than nothing (and if it
guesses wrong the user can easily see that, and perhaps take some manual
action to correct it). It does not make sense for a programming language to
guess, because the user cannot easily detect or correct an incorrect guess,
and mistakes will tend to be propagated rather than caught.

On the other hand, if we add `open_utf8()`, it's easy to ignore BOM:
>

Rather than introducing a new `open_utf8` function, I'd suggest the
following:

1. Deprecate calling `open` for text mode (the default) unless an
`encoding=` is specified, and 3 years after deprecation change the default
encoding for `open` to "utf-8-sig" for reading and "utf-8" for writing (to
ignore a BOM if one exists when reading, but to not create a BOM when
writing).
2. At the same time as the deprecation is announced, introduce a new
__future__ import named "utf8_open" or something like that, to opt into the
future behavior of `open` defaulting to utf-8-sig or utf-8 when opening a
file in text mode and no explicit encoding is specified.

I think a __future__ import solves the problem better than introducing a
new function would. Users who already have a UTF-8 locale (the majority of
users on the majority of platforms) could simply turn on the new __future__
import in any files where they're calling open() with no change in
behavior, suppressing the deprecation warning. Users who have a non-UTF-8
locale and want to keep opening text files in that non-UTF-8 locale by
default can add encoding=locale.getpreferredencoding(False) to retain the
old behavior, suppressing the deprecation warning. And perhaps we could
make a shortcut for that, like encoding="locale".

~Matt
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/UACU527OLD6DLI5URTMALWVOSPEKKADA/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sun, Jan 24, 2021 at 01:32:28AM +, MRAB wrote:
> On 2021-01-24 01:14, Guido van Rossum wrote:
> >I have definitely seen BOMs written by Notepad on Windows 10.
> >
> >Why can’t the future be that open() in text mode guesses the encoding?
> >
> "In the face of ambiguity, refuse the temptation to guess."

"Although practicality beats purity."

The Zen is like scripture: there's a koan for any position you wish to 
take :-)

If you want to be pedantic, and I certainly do *wink*, providing any 
default for the encoding parameter is a guess. The encoding of all text 
files is ambiguous (the intended encoding is metadata which is not 
recorded in the file format). Most text files on Linux and Mac OS use 
UTF-8, and many on Windows too, but not *all* so setting the default to 
UTF-8 is just a guess.

I understand that there are good heuristics for auto-detection of 
encodings which are reliable and used in many other software. If 
auto-detection is a "guess", its an *educated* guess and not much 
different from the status quo, which usually guesses correctly on Linux 
and Mac but often guesses wrongly on Windows. This proposal is to 
improve the quality of the guess by inspecting the file's contents.

For example, a file opened in text mode where every second character is 
a NULL is *almost certainly* UTF-16. The chances that somebody actually 
intended to write:

H\0e\0l\0l\0o\O \OW\0o\0r\0l\0d\0

rather than "Hello World" is negligible.

Before we consider changing the default encoding to "auto-detect", I 
would like to see some estimate of how many UTF-8 encoded files will be 
misclassified as something else. That is, if we make this change, how 
much software that currently guesses UTF-8 correctly (the default 
encoding is the actual intended encoding) will break because it guesses 
something else? That surely won't happen with mostly-ASCII files, but I 
suppose it could happen with some non-English languages?

-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/U2T4JSKOUGSEXVVW3Y7LTXR7HQ5UJUKI/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sun, Jan 24, 2021 at 10:17 AM Guido van Rossum  wrote:
>
> I have definitely seen BOMs written by Notepad on Windows 10.
>
> Why can’t the future be that open() in text mode guesses the encoding?

I don't like guessing. As a Japanese, I have seen many mojibake caused
by the wrong guess.
I don't think guessing encoding is not a good part of reliable software.

On the other hand, if we add `open_utf8()`, it's easy to ignore BOM:

* When reading, use "utf-8-sig". (it can read UTF-8 without bom)
* When writing, use "utf-8".

Although UTF-8 with BOM is not recommended, and Notepad uses UTF-8
without BOM as default encoding from 1903, UTF-8 with BOM is still
used in some cases.
For example, Excel reads CSV file with UTF-8 with BOM or legacy
encoding. So some CSV files is written with BOM.

Regards,
-- 
Inada Naoki  
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/BJC6LCYNO2HHRLHF4TFHWTG53M4YL6LL/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread MRAB


On 2021-01-24 01:14, Guido van Rossum wrote:

I have definitely seen BOMs written by Notepad on Windows 10.

Why can’t the future be that open() in text mode guesses the encoding?


"In the face of ambiguity, refuse the temptation to guess."
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/KVLLWSHHVZPLC3OLPAIT7BOXJJK2VPNU/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Guido van Rossum

I have definitely seen BOMs written by Notepad on Windows 10.

Why can’t the future be that open() in text mode guesses the encoding?
-- 
--Guido (mobile)
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FCIMN3PSTAZT4ST3FH3QALGBH5H5IA6P/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sat, Jan 23, 2021 at 11:59:12PM +1100, Chris Angelico wrote:

> So Windows is being a pain in the behind, once again, because it
> doesn't move forward. 

*cough*

That would be called "backwards compatibility" :-)

Microsoft's attitude towards backwards compatibility is probably even 
stricter than ours.

> File names on Mac OS and most Linux systems will
> be in UTF-8, regardless of your chosen language. Why stick to other
> encodings as the default?

Aren't we talking about the file *contents*, not the file names?

The file name depends on the file system, not the OS. On Mac OS, the 
file system used until High Sierra was HFS+, where file names are 
UTF-16. I expect that there will still be many Mac systems with HFS+ 
file systems.

After High Sierra, the default file system shifted to APFS which does 
use UTF-8.

Linux file systems such as ext4 are bytes. Any UTF-8 support is enforced 
by the desktop manager or shell, not the file system, and so can be 
subverted, either deliberately or accidently (mojibake).

-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/F3IH5PQJ7F4WQZCIODK3QSKBX6V3RWVK/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Random832

On Sat, Jan 23, 2021, at 08:00, Stephen J. Turnbull wrote:
> I see very little use in detecting the BOMs.  I haven't seen a UTF-16
> BOM in the wild in a decade (as usual for me, that's Japan-specific,
> and may be limited to the academic community as well), and the UTF-8
> BOM is a no-op if the default is UTF-8 anyway.

It's not *entirely* a no-op, you'd want the decoder to consume the leading BOM 
rather than returning '\ufeff' on the first read. And AIUI they're much more 
common on Windows (being able to detect UTF-16 *without* BOMs might be useful 
as well, but has historically been a source of problems on Windows) - until 
recently all UTF-8 or UTF-16 files saved with notepad would have them.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GNV2JJVRUI5QGXRAA6VTZYNPCD7OGVNA/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Random832

On Sat, Jan 23, 2021, at 05:06, Inada Naoki wrote:
> On Sat, Jan 23, 2021 at 2:43 PM Random832  wrote:
> >
> > On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
> > > * Default encoding is "utf-8".
> >
> > it might be worthwhile to be a little more sophisticated than this.
> >
> > Notepad itself uses character set detection [it might not be reasonable to 
> > do this on the whole file as notepad does, but maybe the first 512 bytes, 
> > or the result of read1(512)?] when opening a file of unknown encoding, and 
> > msvcrt's "ccs=UTF-8" option to fopen will at least detect at the presence 
> > of UTF-8 and UTF-16 BOMs [and treat the file as UTF-16 in the latter case].
> 
> I meant Notepad (and VS code) use UTF-8 without BOM when creating new text 
> file.
> Students learning Python can not read it with `open()`.

Right, I was simply suggesting it might be worthwhile to target "be able to 
open all files that notepad can open" as the goal rather than simply defaulting 
to UTF8-no-BOM only, which requires a little more sophistication than just a 
default encoding.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/VJ67ZCY7HG6JTWM4K2JDZDQAJIXEMF4T/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Cameron Simpson

On 23Jan2021 22:00, Stephen J. Turnbull  
wrote:
>I see very little use in detecting the BOMs.  I haven't seen a UTF-16
>BOM in the wild in a decade (as usual for me, that's Japan-specific,
>and may be limited to the academic community as well), and the UTF-8
>BOM is a no-op if the default is UTF-8 anyway.

I thought I'd seen them on Windows text files within the last year or so 
(I don't use Windows often, so this is happenstance from receiving some 
data, not an observation of the Windows ecosystem; my recollection is 
that it was a UTF16 CSV file.)

But BOMs may be commonplace. This isn't a text file example, ut the 
ISO14496 standard (the basis for all MOV and MP4 files) has a text field 
type which may be UTF-16LE, UTF16BE or UTF-8, detected by a BOM of the 
right flavour for UTF16 and not BOM implying UTF8. I'm sure this is to 
accomodate easy writing by various systems.

I do not consider the BOM dead, and it is so cheap to recognise that not 
bothering to do so seems almost mean sprited.

Cheers,
Cameron Simpson 
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/KWBRCLYQHZK5ETJOT6KFRN7MJMGXX5H6/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread MRAB


On 2021-01-23 10:11, Chris Angelico wrote:
[snip]


Okay. If the goal is to make UTF-8 the default, may I request that PEP
597 say so, please? With a heading of "deprecation", it's not really
clear what its actual goal is.

  From the sound of things - and it's still possible I'm misreading PEP
597, my apologies if so - this open_text function wouldn't really
solve anything much, and the original goal of "change the default
encoding to UTF-8" is better served by 597.

I use Windows and I switched to UTF-8 years ago. However, the standard 
on Windows is 'utf-8-sig', so I'd probably prefer it if the default when 
_reading_ was 'utf-8-sig'. (I'm not bothered about writing; I can still 
be explicit if I want 'utf-8-sig' for Windows-specific UTF-8 files.)

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/SVDIUALZVHPQLBZPFRETXFKN2GIJNQCD/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sun, Jan 24, 2021 at 2:31 AM Barry Scott  wrote:
> I think that you are going to create a bug magnet if you attempt to auto
> detect the encoding.
>
> First problem I see is that the file may be a pipe and then you will block
> until you have enough data to do the auto detect.
>
> Second problem is that the first N bytes are all in ASCII and only later
> do you see Windows code page signature (odd lack of utf-8 signature).

Both can be handled, just as universal newlines can, by remaining in
an "uncertain" state.

When the file is first opened, we know nothing about its encoding.
Once you request that anything be read (eg by pumping the iterator or
anything), it reads, as per current status. Then:

1) If it looks like UTF-16, assume UTF-16. Rather than falling for the
"Bush hid the facts" issue, this might be restricted to files that
start with a BOM.

2) If it's entirely ASCII, decode it as ASCII and stay uncertain.

3) If it can be decoded UTF-8, remember that this is a UTF-8 file, and
from there on, error out if anything isn't UTF-8.

4) Otherwise, use the system encoding.

On subsequent reads, if we're in ASCII mode, repeat steps 2-4. Until
it finds a non-ASCII byte value, it doesn't really matter how it
decodes it.

Unlike chardet, this can be done completely dependably. I'm not sure
what would happen if the system encoding isn't an eight-bit
ASCII-compatible one, though. The algorithm might produce some odd
results if the file looks like ASCII, but then switches to some
incompatible encoding. Can anyone give an example of a current in-use
system encoding that would have this issue? How likely is it that
you'd get even one line of text that purports to be ASCII?

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ZEXFMCCD5L647HSAMB3U6W6CDQKVN5JA/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Barry Scott




> On 23 Jan 2021, at 11:00, Steven D'Aprano  wrote:
> 
> On Sat, Jan 23, 2021 at 12:40:55AM -0500, Random832 wrote:
>> On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
>>> * Default encoding is "utf-8".
>> 
>> it might be worthwhile to be a little more sophisticated than this.
>> 
>> Notepad itself uses character set detection [it might not be 
>> reasonable to do this on the whole file as notepad does, but maybe the 
>> first 512 bytes, or the result of read1(512)?] when opening a file of 
>> unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at 
>> least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the 
>> file as UTF-16 in the latter case].
> 
> 
> I like Random's idea. If we add a new "open text file" builtin function, 
> we should seriously consider having it attempt to auto-detect the 
> encoding. It need not be as sophisticated as `chardet`.

I think that you are going to create a bug magnet if you attempt to auto
detect the encoding.

First problem I see is that the file may be a pipe and then you will block
until you have enough data to do the auto detect.

Second problem is that the first N bytes are all in ASCII and only later
do you see Windows code page signature (odd lack of utf-8 signature).

> That auto-detection behaviour could be enough to differentiate it from 
> the regular open(), thus solving the "but in ten years time it will be 
> redundant and will need to be deprecated" objection.
> 
> Having said that, I can't say I'm very keen on the name "open_text", but 
> I can't think of any other bikeshed colour I prefer.

Given the the functions purpose is to open unicode text use a name that
reflects that it is the encoding that is set not the mode (binary vs. text).

open_unicode maybe?

If you are teaching open_text then do you also need to have open_binary?

Barry

> 
> 
> -- 
> Steve
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/VAWFPIAA4WIVLIF4LFJ4OATJK6JDJS2N/
> Code of Conduct: http://python.org/psf/codeofconduct/
> 
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/4LHLZ5QIBOCLIZUVYQ2UXAU6MEX6VMJH/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Stephen J. Turnbull

Steven D'Aprano writes:
 > On Sat, Jan 23, 2021 at 12:40:55AM -0500, Random832 wrote:
 > > On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
 > > > * Default encoding is "utf-8".
 > > 
 > > it might be worthwhile to be a little more sophisticated than this.
 > > 
 > > Notepad itself uses character set detection [it might not be 
 > > reasonable to do this on the whole file as notepad does, but maybe the 
 > > first 512 bytes, or the result of read1(512)?] when opening a file of 
 > > unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at 
 > > least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the 
 > > file as UTF-16 in the latter case].
 > 
 > 
 > I like Random's idea. If we add a new "open text file" builtin
 > function, we should seriously consider having it attempt to
 > auto-detect the encoding. It need not be as sophisticated as
 > `chardet`.

It definitely should not be as sophisticated as chardet.  Detection of
ISO 8859, ISO 2022, and EUC family encodings is reliable as long as
you know that only one of each family is going to be used.  But you
cannot easily tell which of the many ISO 8859 (also Windows-12xx)
family are present, and similarly for the other families.

I see very little use in detecting the BOMs.  I haven't seen a UTF-16
BOM in the wild in a decade (as usual for me, that's Japan-specific,
and may be limited to the academic community as well), and the UTF-8
BOM is a no-op if the default is UTF-8 anyway.

I'm definitely leaning to the suggestion I made elsewhere (if it's
adopted at all): force UTF-8, and name it 'open_utf8'.

Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/LPUM3JPQD3RJCYFZ42GWTISCAHKF462C/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sat, Jan 23, 2021 at 11:34 PM Stephen J. Turnbull
 wrote:
>  > I'd rather focus on just moving to UTF-8 as the default, rather
>  > than bringing in a new function - especially with such a confusing
>  > name.
>
> I expect there are several bodies of users who will experience that as
> quite obnoxious for a long time to come.  I *still* see a ton of stuff
> that is Shift JIS, a fair amount of email in ISO-2022-JP, and in China
> gb18030 isn't just a good idea, it's the law.  (OK, the precise
> statement of the law is "must support", not "must use", but my Chinese
> students all default to GB.)

But "UTF-8 as the default if you don't specify an encoding" doesn't
stop you from using all those other encodings. The only change is
that, if you don't specify an encoding, you get a cross-platform
consistent default that can be easily described, rather than one which
depends on system settings.

> The problem is that these users use some software that will create
> text in a national language encoding by default and other that use
> UTF-8 by default.  So I guess Naoki's hope is that "when I'm
> processing Microsoft/Oracle-generated data, I use 'open_text', when
> it's local software I use 'open'" becomes an easy and natural reponse
> in such environments.

Exactly, so no single default will work.

Is there an easy way to say open("filename", encoding="use my system
default") ? Currently encoding=None does that, and maybe that can be
retained (just with the default becoming "utf-8"), or maybe some other
keyword can be used. But that should cover the situations where you
specifically *want* a platform-dependent selection.

>  > What exactly are the blockers on making open(fn) use UTF-8 by
>  > default?
>
> Backward incompatibility with just about every script in existence?

Or for a large number of them, sudden cross-platform compatibility
that they didn't previously have. This is *fixing a bug* for many
scripts.

>  > Can the proposals be written with that as the ultimate goal (even if
>  > it's going to take X versions and multiple deprecation phases), rather
>  > than aiming for a messy goal where people aren't sure which function
>  > to use?
>
> The problem is that on Windows there are a lot of installations that
> continue to use non-UTF-8 encodings enough that users set their
> preferred encoding that way.  I guess that folks where the majority of
> their native-language alphabet is drawn from ASCII are by now almost
> all using UTF-8 by default, but this is not so for East Asians (who
> almost all still use a mixture of several encodings every day because
> email still often defaults to national standard encodings).  I can't
> speak to Cyrillic, Hebrew, Arabic, Indic languages, but I wouldn't be
> surprised if they're somewhere in the middle.

So Windows is being a pain in the behind, once again, because it
doesn't move forward. File names on Mac OS and most Linux systems will
be in UTF-8, regardless of your chosen language. Why stick to other
encodings as the default?

(I repeat: I am NOT advocating abolishing support for all other
encodings. The ONLY thing I want to see is that UTF-8 becomes the
default.)

> Naoki can document that "open(..., encoding='...')" is strongly
> preferred to 'open_text'.  Maybe a better name is "open_utf8", to
> discourage people who want to use non-default encodings, or
> programmatically chosen encodings, in that function.

TBH I don't think a separate built-in is of value here, but perhaps
it'd be beneficial as an alternative to the wall-of-text help info
that open() has. But I do rather like Random's and Steve's suggestion
that the alternate function be specifically documented as magic. It'd
actually tie in very nicely with a change of default: open() does what
it's explicitly told, and has cross-platform defaults, but
open_sesame() probes the file to try to guess at its encoding,
attempting to use a platform-specific eight bit encoding if
applicable. It'd "just work" for reading most text files, regardless
of their source, as long as they came from this current computer. (All
bets are off anyway if they came from some other system and are in an
eight-bit encoding.)

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/PKUN6TDU6R3CDX2LCI34DF5CCLGHMVIX/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Stephen J. Turnbull

Chris Angelico writes:
 > On Sat, Jan 23, 2021 at 12:37 PM Inada Naoki  wrote:

 > > ## 1. Add `io.open_text()`, builtin `open_text()`, and
 > > `pathlib.Path.open_text()`.
 > >
 > > All functions are same to `io.open()` or `Path.open()`, except:
 > >
 > > * Default encoding is "utf-8".

I wonder if it might not be better to remove the encoding parameter
for this version.  Further comments below.

 > > * "b" is not allowed in the mode option.
 > 
 > I *really* don't like this, because it implies that open() will open
 > in binary mode.

I doubt that will be a common misunderstanding, as long as 'open_text'
is documented as a convenience wrapper for 'open' aimed primarily at
Windows programmers.

 > > How do you think about this idea? Is this worth enough to add a new
 > > built-in function?
 > 
 > Highly dubious.

I won't go so far as "highly", but yeah, dubious to me.  In my own
environment, while I still see Shift JIS data quite a bit, the rule is
that this or that correspondent sends it to me.  While a lot of the
University infrastructure used to default to Shift JIS, it now
defaults to UTF-8.  So I don't have a consistent rule by "kind of
data", ie, which scripts use 'open_text' and which 'open'.  If the
script processes data from "JIS users", it needs to accept a
command-line flag because other users *will* be sending that kind of
data in UTF-8.  Naoki's mileage may vary.

See below for additional comments.

 > I'd rather focus on just moving to UTF-8 as the default, rather
 > than bringing in a new function - especially with such a confusing
 > name.

I expect there are several bodies of users who will experience that as
quite obnoxious for a long time to come.  I *still* see a ton of stuff
that is Shift JIS, a fair amount of email in ISO-2022-JP, and in China
gb18030 isn't just a good idea, it's the law.  (OK, the precise
statement of the law is "must support", not "must use", but my Chinese
students all default to GB.)

The problem is that these users use some software that will create
text in a national language encoding by default and other that use
UTF-8 by default.  So I guess Naoki's hope is that "when I'm
processing Microsoft/Oracle-generated data, I use 'open_text', when
it's local software I use 'open'" becomes an easy and natural reponse
in such environments.

We don't see very many Asian language users on the python-* lists.  We
see a few more Russian users, I suspect quite a few Hebrew and Indic
users, maybe a few Arabic users.  So we should listen very carefully
to the few we do have where they come from tiny minorities of python-*
subscribers.

 > What exactly are the blockers on making open(fn) use UTF-8 by
 > default?

Backward incompatibility with just about every script in existence?

 > Can the proposals be written with that as the ultimate goal (even if
 > it's going to take X versions and multiple deprecation phases), rather
 > than aiming for a messy goal where people aren't sure which function
 > to use?

The problem is that on Windows there are a lot of installations that
continue to use non-UTF-8 encodings enough that users set their
preferred encoding that way.  I guess that folks where the majority of
their native-language alphabet is drawn from ASCII are by now almost
all using UTF-8 by default, but this is not so for East Asians (who
almost all still use a mixture of several encodings every day because
email still often defaults to national standard encodings).  I can't
speak to Cyrillic, Hebrew, Arabic, Indic languages, but I wouldn't be
surprised if they're somewhere in the middle.

Naoki can document that "open(..., encoding='...')" is strongly
preferred to 'open_text'.  Maybe a better name is "open_utf8", to
discourage people who want to use non-default encodings, or
programmatically chosen encodings, in that function.

As someone who avoids Windows like the plague, I have no real sense of
how important this is, and I like your argument from first
principles.  So on net, I guess I'm +/- 0 only because Naoki thinks it
important enough to spend quite a bit of skull sweat and effort on
this.

Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/E2X4QYTOW47BVYVRWACOIBQA3H5BVZMQ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sat, Jan 23, 2021 at 12:40:55AM -0500, Random832 wrote:
> On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
> > * Default encoding is "utf-8".
> 
> it might be worthwhile to be a little more sophisticated than this.
> 
> Notepad itself uses character set detection [it might not be 
> reasonable to do this on the whole file as notepad does, but maybe the 
> first 512 bytes, or the result of read1(512)?] when opening a file of 
> unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at 
> least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the 
> file as UTF-16 in the latter case].

I like Random's idea. If we add a new "open text file" builtin function, 
we should seriously consider having it attempt to auto-detect the 
encoding. It need not be as sophisticated as `chardet`.

That auto-detection behaviour could be enough to differentiate it from 
the regular open(), thus solving the "but in ten years time it will be 
redundant and will need to be deprecated" objection.

Having said that, I can't say I'm very keen on the name "open_text", but 
I can't think of any other bikeshed colour I prefer.

-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/VAWFPIAA4WIVLIF4LFJ4OATJK6JDJS2N/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sat, Jan 23, 2021 at 01:31:28PM +0300, Paul Sokolovsky wrote:

> > * Teachers can teach to use `open_text` to open text files. Students
> > can use "utf-8" by default without knowing about what encoding is.
> 
> Let's also add max_int(), min_int(), max_float(), min_float() builtins.
> Teachers can teach that if you need to min ints, then to use min_int(),
> if you need to min floats, then to use min_float(), and otherwise, use
> min(). Bonus point: max_int(), min_int(), max_float(), min_float() are
> all easier to annotate.

Why would we need to do that? The proposed `open_text()` builtin solves 
an actual problem with opening files on one platform. Is there an 
equivalent issue with some platform where min() and max() misbehave by 
default with ints and floats?

If not, then your analogy is invalid.

If so, please raise a bug on the tracker.

Adding this proposed `open_text` function does not require us to add 
multiple redundant functions that solve no problems.

> > So `open_text()` can provide better developer experience, without
> > waiting 10 years.
> 
> Except that in 10 years, when the default encoding is finally changed,
> open_text() is a useless function, which now needs to be deprecated and
> all the fun process repeated again.

It won't be useless. It will still work as well as it ever did, so 
useful. It might be redundant, in which case we could deprecate it in 
documentation and take no further action until Python 5000.

-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ZM66MNQT32WFABXM6CVEMCTBXDVB5GA4/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sat, Jan 23, 2021 at 7:31 PM Paul Sokolovsky  wrote:
> >
> > * Replacing open with open_text is easier than adding `,
> > encoding="utf-8"`.
>
> How is it easier, if "open_text" exists only in imagination, while
> encoding="utf-8" has been there all this time?
>

Note that the warning is not enabled by default anytime soon.
If we decide to change the default encoding and enable the
EncodingWarning by default in Python 3.15, user can use `open_text()`
for 3.10~3.15.
It will be enough backward compatibility for most users.

>
> > * Teachers can teach to use `open_text` to open text files. Students
> > can use "utf-8" by default without knowing about what encoding is.
>
> Let's also add max_int(), min_int(), max_float(), min_float() builtins.

It is off-topic. Please don't compare apple and orange.

>
> > So `open_text()` can provide better developer experience, without
> > waiting 10 years.
>
> Except that in 10 years, when the default encoding is finally changed,
> open_text() is a useless function, which now needs to be deprecated and
> all the fun process repeated again.

Yes, if we can change the default encoding in 2030, two open functions
will become messy.
But there is no promise for the change. Without mitigating the pain,
we can not change the default encoding forever.

Anyway, thank you for your feedback.
Two people prefer `encoding="utf-8"` to `open_text()`.

I still wait for feedbacks from more people before updating the PEP 597.

Regards,
-- 
Inada Naoki  
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6UKLKB6JRAJZOCSYPTZTS6XA6VJPQYR3/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sat, Jan 23, 2021 at 7:13 PM Chris Angelico  wrote:
>
> > On the other hand, if we add `open_text()`:
> >
> > * Replacing open with open_text is easier than adding `, encoding="utf-8"`.
> > * Teachers can teach to use `open_text` to open text files. Students
> > can use "utf-8" by default without knowing about what encoding is.
> >
> > So `open_text()` can provide better developer experience, without
> > waiting 10 years.
>
> But this has a far worse end goal - two open functions with subtly
> incompatible defaults, and a big question of "why should I choose this
> over that". And if you start using open_text, suddenly your code won't
> work on older Pythons.
>

Yes. There is cons too.
That's why I posted this thread before including the idea in the PEP.
Thank you for your feedback.


> >
> > Ultimate goal is make the "utf-8" default. But I don't know when we
> > can change it.
> > So I focus on what we can do in near future (< 5 years, I hope).
> >
>
> Okay. If the goal is to make UTF-8 the default, may I request that PEP
> 597 say so, please? With a heading of "deprecation", it's not really
> clear what its actual goal is.

No. I avoid it intentionally.  I am making the PEP useful even if we
can not change the default encoding.
The PEP can be discussed without discussing we can change the default
encoding or not.

Please read the first motivation section in the PEP.
https://www.python.org/dev/peps/pep-0597/#using-the-default-encoding-is-a-common-mistake

Regards,
-- 
Inada Naoki  
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6OIKAWIQ6OPVDJ5ZUJECZPAY4FDUOZVD/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Paul Sokolovsky

Hello,

On Sat, 23 Jan 2021 19:04:08 +0900
Inada Naoki  wrote:

> On Sat, Jan 23, 2021 at 10:47 AM Chris Angelico 
> wrote:
> >
> >
> > Highly dubious. I'd rather focus on just moving to UTF-8 as the
> > default, rather than bringing in a new function - especially with
> > such a confusing name.
> >
> > What exactly are the blockers on making open(fn) use UTF-8 by
> > default?  
> 
> Backward compatibility. That's what PEP 597 tries to solve.
> 
> 1. Add optional warning for `open()` call without specifying
> `encoding` option. (PEP 597)
> 2. (Several years later) Make the warning default.
> 3. (Several years later) Change the default encoding.
> 
> When (2) happens, users are forced to write `encoding="utf-8"` to
> suppress the warning.
> 
> But note that the default encoding is "utf-8" already in (most) Linux
> including WSL, macOS, iOS, and Android.
> And Windows user can read ASCII text files without specifying
> `encoding` regardless default encoding is legacy codec or "utf-8".
> So adding `, encoding="utf-8"` everywhere `open()` is used might be
> tedious job.
> 
> On the other hand, if we add `open_text()`:
> 
> * Replacing open with open_text is easier than adding `,
> encoding="utf-8"`.

How is it easier, if "open_text" exists only in imagination, while
encoding="utf-8" has been there all this time?

The only easier thing than adding 'encoding="utf-8"' would be:

1. Just go ahead and switch the default encoding to utf-8 right away.
2. For backward compatibility, add "python3 --backward-compatibility"
switch. Perhaps even tell users to use it straight in
the UnicodeDecodeError backtrace.

> * Teachers can teach to use `open_text` to open text files. Students
> can use "utf-8" by default without knowing about what encoding is.

Let's also add max_int(), min_int(), max_float(), min_float() builtins.
Teachers can teach that if you need to min ints, then to use min_int(),
if you need to min floats, then to use min_float(), and otherwise, use
min(). Bonus point: max_int(), min_int(), max_float(), min_float() are
all easier to annotate.

> So `open_text()` can provide better developer experience, without
> waiting 10 years.

Except that in 10 years, when the default encoding is finally changed,
open_text() is a useless function, which now needs to be deprecated and
all the fun process repeated again.

[]

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/23HBISVYGAJ5G25ZPXDNLD4YZX2XXZAQ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sat, Jan 23, 2021 at 9:04 PM Inada Naoki  wrote:
>
> On Sat, Jan 23, 2021 at 10:47 AM Chris Angelico  wrote:
> >
> >
> > Highly dubious. I'd rather focus on just moving to UTF-8 as the
> > default, rather than bringing in a new function - especially with such
> > a confusing name.
> >
> > What exactly are the blockers on making open(fn) use UTF-8 by default?
>
> Backward compatibility. That's what PEP 597 tries to solve.
>
> 1. Add optional warning for `open()` call without specifying
> `encoding` option. (PEP 597)
> 2. (Several years later) Make the warning default.
> 3. (Several years later) Change the default encoding.
>
> When (2) happens, users are forced to write `encoding="utf-8"` to
> suppress the warning.
>
> But note that the default encoding is "utf-8" already in (most) Linux
> including WSL, macOS, iOS, and Android.
> And Windows user can read ASCII text files without specifying
> `encoding` regardless default encoding is legacy codec or "utf-8".
> So adding `, encoding="utf-8"` everywhere `open()` is used might be tedious 
> job.

Okay, but this (a) has a good end goal, and (b) is only
backward-incompatible with its default - adding the encoding parameter
makes your code compatible with all versions of Python.

> On the other hand, if we add `open_text()`:
>
> * Replacing open with open_text is easier than adding `, encoding="utf-8"`.
> * Teachers can teach to use `open_text` to open text files. Students
> can use "utf-8" by default without knowing about what encoding is.
>
> So `open_text()` can provide better developer experience, without
> waiting 10 years.

But this has a far worse end goal - two open functions with subtly
incompatible defaults, and a big question of "why should I choose this
over that". And if you start using open_text, suddenly your code won't
work on older Pythons.

> > Can the proposals be written with that as the ultimate goal (even if
> > it's going to take X versions and multiple deprecation phases), rather
> > than aiming for a messy goal where people aren't sure which function
> > to use?
> >
>
> Ultimate goal is make the "utf-8" default. But I don't know when we
> can change it.
> So I focus on what we can do in near future (< 5 years, I hope).
>

Okay. If the goal is to make UTF-8 the default, may I request that PEP
597 say so, please? With a heading of "deprecation", it's not really
clear what its actual goal is.

>From the sound of things - and it's still possible I'm misreading PEP
597, my apologies if so - this open_text function wouldn't really
solve anything much, and the original goal of "change the default
encoding to UTF-8" is better served by 597.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/U6BL5RWB4OPDZNM3NEFO3UPPZEIVYKYZ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

On Sat, Jan 23, 2021 at 2:43 PM Random832  wrote:
>
> On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
> > * Default encoding is "utf-8".
>
> it might be worthwhile to be a little more sophisticated than this.
>
> Notepad itself uses character set detection [it might not be reasonable to do 
> this on the whole file as notepad does, but maybe the first 512 bytes, or the 
> result of read1(512)?] when opening a file of unknown encoding, and msvcrt's 
> "ccs=UTF-8" option to fopen will at least detect at the presence of UTF-8 and 
> UTF-16 BOMs [and treat the file as UTF-16 in the latter case].

I meant Notepad (and VS code) use UTF-8 without BOM when creating new text file.
Students learning Python can not read it with `open()`.

-- 
Inada Naoki  
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/5WYWXLCHL6MORJDU4V7JFRI2XD7E3G5Z/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)