[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-26 Thread Stephen J. Turnbull
Chris Angelico writes:

 > Isn't that what file objects have attributes for?

You're absolutely right.  Not sure what I was thinking.  (Note: not an
excuse for my brain bubble, but Path.read_text and Path.read_binary do
have this problem because they return str and bytes respectively.)

 > Do you get files that lack the BOM?

As I wrote earlier, I don't get UTF-16 text files at all.  You'll have
to ask somebody else.  I'm just pointing out that it's pretty likely
that if they exist, there are languages that are likely to not
distinguish ASCII from UTF-16 in some files without a (fragile)
statistical analysis of byte frequencies.

Do you actually face the problem of receiving data that should be
decoded one way but Python does something different by default?  Or
are you just tired of hearing about the problems of people who can't
"just assume UTF-8 and wish Python would, too"?

 > so IMO it's not unreasonable to assert that all files that don't
 > start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using the
 > ASCII-compatible detection method.

As I've said before, I think Naoki's suggestion is aimed at something
different: the user for whom getpreferredencoding normally DTRTs but
has streams that they know are UTF-8 and want a simple obvious way to
read and write them.  That is the usual case in my experience.  As of
now, Guido and Naoki have agreed to document "encoding='utf-8'" and
drop 'open_text', so I think the discussion is moot, unless somebody
really wants to push autodetection of encodings.

If somebody has a different experience, I'd like to hear about it.
But note that my experience (and Naoki's) is special: in Japan we
encounter at least three different encodings of Japanese daily in
plain text (ISO-2022-JP in mail, UTF-8 and Shift-JIS in local files).
So if anybody is likely to experience the need, I believe we are.

Steve

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FIHZYB3W5ZXYFMOQSNPYB3SAE7DHD44I/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-25 Thread Inada Naoki
On Tue, Jan 26, 2021 at 3:07 PM Guido van Rossum  wrote:
>
>>
>> I agree that. But until we switch to the default encoding of open(),
>> we must recommend to avoid `open(filename)` anyway.
>> The default encoding of VS Code, Atom, Notepad is already UTF-8.
>>
>> Maybe, we need to update the tutorial (*) to use `encoding="utf-8"`.
>
>
> Telling people to always add `encoding='utf8'` makes much more sense to me 
> than introducing a new function and telling them to do that.
>

Ok, I will not add open_utf8() to PEP 597, and update the tutorial to
recommend `encoding="utf-8"`.

-- 
Inada Naoki  
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HXJKDIZUF6TMMHHPDZWQ3PYPFLXX6C66/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-25 Thread Guido van Rossum
On Mon, Jan 25, 2021 at 5:49 PM Inada Naoki  wrote:

> On Tue, Jan 26, 2021 at 10:22 AM Guido van Rossum 
> wrote:
> >
> >
> > Older Pythons may be easy to drop, but I'm not so sure about older
> unofficial docs. The open() function is very popular and there must be
> millions of blog posts with examples using it, most of them reading text
> files (written by bloggers naive in Python but good at SEO).
> >
> > I would be very sad if the official recommendation had to become "[for
> the most common case] avoid open(filename), use open_text(filename)".
> >
>
> I agree that. But until we switch to the default encoding of open(),
> we must recommend to avoid `open(filename)` anyway.
> The default encoding of VS Code, Atom, Notepad is already UTF-8.
>
> Maybe, we need to update the tutorial (*) to use `encoding="utf-8"`.
>

Telling people to always add `encoding='utf8'` makes much more sense to me
than introducing a new function and telling them to do that.

-- 
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ZT66Q2UMDYJBOKM7GAMTLTPIXFVXZMBG/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-25 Thread Matt Wozniski
On Mon, Jan 25, 2021 at 8:51 PM Inada Naoki  wrote:

> On Tue, Jan 26, 2021 at 10:22 AM Guido van Rossum 
> wrote:
> > Older Pythons may be easy to drop, but I'm not so sure about older
> unofficial docs. The open() function is very popular and there must be
> millions of blog posts with examples using it, most of them reading text
> files (written by bloggers naive in Python but good at SEO).
> >
> > I would be very sad if the official recommendation had to become "[for
> the most common case] avoid open(filename), use open_text(filename)".
>
> I agree that. But until we switch to the default encoding of open(),
> we must recommend to avoid `open(filename)` anyway.
> The default encoding of VS Code, Atom, Notepad is already UTF-8.


Maybe we're overthinking this - do we really need to recommend avoiding
`open(filename)` in all cases? Isn't it just fine to use if
`locale.getpreferredencoding(False)` is UTF-8, since in that case there
won't be any change in behavior when `open` switches from the old,
locale-specific default to the new, always UTF-8 default?

If that's the case, then it would be less of a backwards incompatibility
issue, since most production environments will already be using UTF-8 as
the locale (by virtue of it being the norm on Unix systems and servers).

And if that's the case, all we need is a warning that is raised
conditionally when open() is called for text mode without an explicit
encoding when the system locale is not UTF-8, and that warning can say
something like:

Your system is currently configured to use shift_jis for text files.
Beginning in Python 3.13, open() will always use utf-8 for text files
instead.
For compatibility with future Python versions, pass open() the extra
argument:
encoding="shift_jis"

~Matt
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6C2Y3RELB7PQYNNV5GS2D3H65SOXVD3N/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-25 Thread Inada Naoki
On Tue, Jan 26, 2021 at 10:22 AM Guido van Rossum  wrote:
>
>
> Older Pythons may be easy to drop, but I'm not so sure about older unofficial 
> docs. The open() function is very popular and there must be millions of blog 
> posts with examples using it, most of them reading text files (written by 
> bloggers naive in Python but good at SEO).
>
> I would be very sad if the official recommendation had to become "[for the 
> most common case] avoid open(filename), use open_text(filename)".
>

I agree that. But until we switch to the default encoding of open(),
we must recommend to avoid `open(filename)` anyway.
The default encoding of VS Code, Atom, Notepad is already UTF-8.

Maybe, we need to update the tutorial (*) to use `encoding="utf-8"`.

(*)  
https://docs.python.org/3.10/tutorial/inputoutput.html#reading-and-writing-files


> BTW remind me what open_text() would do? How would it differ from open() with 
> the same arguments? That's too many messages back.
>

Current proposal is "open_utf8()". The differences from open() are:

* There is no encoding parameter. It uses "utf-8" always. (*)
* "b" is not allowed for mode.

(*) Another option is to use "utf-8-sig" for reading and "utf-8" for
writing. But it has some drawbacks. utf-8-sig has overhead because it
is a wrapper implemented in Python. And TextIOWrapper has fast-paths
for utf-8, but not for utf-8-sig. "utf-8-sig" may be not tested well
compared to "utf-8".

Regards,
-- 
Inada Naoki  
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/BCMUOSHJOA36AKOWKQINNJZYAC2WIBUF/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-25 Thread Guido van Rossum
On Mon, Jan 25, 2021 at 4:42 PM Steven D'Aprano  wrote:

> On Sat, Jan 23, 2021 at 09:11:27PM +1100, Chris Angelico wrote:
>
> > > On the other hand, if we add `open_text()`:
> > >
> > > * Replacing open with open_text is easier than adding `,
> encoding="utf-8"`.
> > > * Teachers can teach to use `open_text` to open text files. Students
> > > can use "utf-8" by default without knowing about what encoding is.
> > >
> > > So `open_text()` can provide better developer experience, without
> > > waiting 10 years.
> >
> > But this has a far worse end goal - two open functions with subtly
> > incompatible defaults, and a big question of "why should I choose this
> > over that".
>
> It has an easy answer:
>
> - Are you opening a text file and you don't know about or want to deal
>   with encodings? Use `open_text`.
>
> - Otherwise, use `open`.
>
> I think that if we moved to an open_text() builtin, it should have the
> simplest possible signature:
>
> open_text(filename, mode='r')
>
> If you care about anything beyond that, use `open`.
>
>
> > And if you start using open_text, suddenly your code won't
> > work on older Pythons.
>
> "Using older Pythons" is mostly a concern for library maintainers, not
> beginners. A few years from now, Python 3.10 will be the oldest version
> the great majority of beginners will care about, and 3.9 will be as
> irrelevant to them as 3.4 is to us today.
>
> Library maintainers always have to deal with the issue of not being able
> to use the newest functionality, it doesn't prevent us from adding new
> functionality.
>

Older Pythons may be easy to drop, but I'm not so sure about older
unofficial docs. The open() function is very popular and there must be
millions of blog posts with examples using it, most of them reading text
files (written by bloggers naive in Python but good at SEO).

I would be very sad if the official recommendation had to become "[for the
most common case] avoid open(filename), use open_text(filename)".

BTW remind me what open_text() would do? How would it differ from open()
with the same arguments? That's too many messages back.

-- 
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/QPKA3SOCHMFMGZXW7YBCTSDMVQ6B6BHW/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-25 Thread Steven D'Aprano
On Sat, Jan 23, 2021 at 09:11:27PM +1100, Chris Angelico wrote:

> > On the other hand, if we add `open_text()`:
> >
> > * Replacing open with open_text is easier than adding `, encoding="utf-8"`.
> > * Teachers can teach to use `open_text` to open text files. Students
> > can use "utf-8" by default without knowing about what encoding is.
> >
> > So `open_text()` can provide better developer experience, without
> > waiting 10 years.
> 
> But this has a far worse end goal - two open functions with subtly
> incompatible defaults, and a big question of "why should I choose this
> over that".

It has an easy answer:

- Are you opening a text file and you don't know about or want to deal 
  with encodings? Use `open_text`.

- Otherwise, use `open`.

I think that if we moved to an open_text() builtin, it should have the 
simplest possible signature:

open_text(filename, mode='r')

If you care about anything beyond that, use `open`.


> And if you start using open_text, suddenly your code won't
> work on older Pythons.

"Using older Pythons" is mostly a concern for library maintainers, not 
beginners. A few years from now, Python 3.10 will be the oldest version 
the great majority of beginners will care about, and 3.9 will be as 
irrelevant to them as 3.4 is to us today.

Library maintainers always have to deal with the issue of not being able 
to use the newest functionality, it doesn't prevent us from adding new 
functionality.



-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/4K7U5KEXEIURFB36ML2GSMJD4HEQ7ZZL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-25 Thread Steven D'Aprano
Thanks Matt for the detailed explanation for why we cannot change `open` 
to do encoding detection by default. I think that should answer Guido's 
question.

It still leaves open the possibility of:

- a new mode to open() that opts-in to encoding detection;

- a new built-in function that is only used for opening text files (not 
  pipes) with encoding detection by default;

- or a new function that attempts the detection:

enc = io.guess_encoding(FILENAME) or 'UTF-8'
with open(FILENAME, encoding=enc) as f:
...


These may be useful, but I don't think that they are very helpful for 
solving the problem of naive programmers who don't know anything about 
encodings trying to open files which are encoded differently from the 
system encoding. Such users aren't knowledgable enough to know that they 
should opt-in to encoding detection. If they were, they would probably 
just set the encoding to "utf-8" in the first place.

-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/IUNLC2JQYSAQ3IC6DWPGMWKQS5FWQDEK/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-25 Thread Matt Wozniski
On Mon, Jan 25, 2021, 4:25 AM Steven D'Aprano  wrote:

> On Sun, Jan 24, 2021 at 10:43:54PM -0500, Matt Wozniski wrote:
> > And
> > `f.read(1)` needs to pick one of those and return it immediately. It
> can't
> > wait for more information. The contract of `read` is "Read from
> underlying
> > buffer until we have n characters or we hit EOF."
>
> In text mode, reads are always buffered:
>
> https://docs.python.org/3/library/functions.html#open
>
> so `f.read(1)` will read as much as needed, so long as it only returns a
> single character.
>

Text mode files are always backed by a buffer, yes, but that's not
relevant. My point is that `f.read(1)` must immediately return a character
if one exists in the buffer. It can't wait for more data to get buffered if
there is already a buffered character, as that would be a backwards
incompatible change that would badly break line based protocols like FTP,
SMTP, and POP.

Up until now, `f.read(1)` has always read bytes from the underlying file
descriptor into the buffer until it has one full character, and immediately
returned it. And this is user facing behavior. Imagine an echo server that
reads 1 character at a time and echoes it back, forever. The client will
only ever send 1 character at a time, so if an eight bit locale encoding is
in use the client will only send one byte before waiting for a response. As
things stand today this works. If encoding detection were added and the
server's call to `f.read(1)` could decide it doesn't know how to decode the
first byte it gets and to block until more data comes in, that would be a
deadlock, since the client isn't sending more.

A typical buffer size is 4096 bytes, or more.


Sure, but that doesn't mean that much data is always available. If
something has written less than that, it's not reasonable to block until
more data can be buffered in places where up until now no blocking would
have occurred. Not least because no more data will necessarily ever come.

And if it were to instead make its decisions based on what has been
buffered already, without ever blocking, then the behavior becomes
nondeterministic: it could return a different character based on how much
data the OS returned in the first read syscall.

In any case, I believe the intention of this proposal is for *open*, not
> read, to perform the detection.


If that's the case, named pipes are a perfect example of why that's
impossible. It's perfectly normal to open a named pipe that contains no
data, and that won't until you trigger some action (say, spawning a child
process that will write to it). You can't auto detect the encoding of an
empty pipe, and you can't make open block until data arrives because it's
entirely possible data will never arrive if open blocks.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GUL5VOYGDEE3MSC2KDWZ7RNDP2ZMJGAS/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-25 Thread Steven D'Aprano
On Sun, Jan 24, 2021 at 10:43:54PM -0500, Matt Wozniski wrote:
> On Sun, Jan 24, 2021 at 9:53 AM <2qdxy4rzwzuui...@potatochowder.com> wrote:
> 
> > On 2021-01-25 at 00:29:41 +1100,
> > Steven D'Aprano  wrote:
> >
> > > On Sat, Jan 23, 2021 at 03:24:12PM +, Barry Scott wrote:
> > > > First problem I see is that the file may be a pipe and then you will
> > block
> > > > until you have enough data to do the auto detect.
> > >
> > > Can you use `open('filename')` to read a pipe?
> >
> > Yes.  Named pipes are files, at least on POSIX.
> >
> > And no.  Unnamed pipes are identified by OS-level file descriptors, so
> > you can't open them with open('filename'),
> >
> 
> The `open` function takes either a file path as a string, or a file
> descriptor as an integer. So you can use `open` to read an unnamed pipe or
> a socket.

Okay, but I was asking about using open with a filename string. In any 
case, the existence of named pipes answers my question.


[...]
> It's possible to do a `f.read(1)` on a file opened in text mode. If the
> first two bytes of the file are 0xC2 0x99, that's either ™ if the file is
> UTF-8, or 슙 if the file is UTF-16BE, or 駂 if the file is UTF-16LE.

Or  followed by the SGC control code in Latin-1. Or ™ in Windows-1252, 
or ¬ô in MacRoman. Etc.


> And
> `f.read(1)` needs to pick one of those and return it immediately. It can't
> wait for more information. The contract of `read` is "Read from underlying
> buffer until we have n characters or we hit EOF."

In text mode, reads are always buffered:

https://docs.python.org/3/library/functions.html#open

so `f.read(1)` will read as much as needed, so long as it only returns a 
single character.

A typical buffer size is 4096 bytes, or more.

In any case, I believe the intention of this proposal is for *open*, not 
read, to perform the detection.



-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/OCMXGX7RY3EMKBNM6HMF72INK7K7FNVJ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Matt Wozniski
On Sun, Jan 24, 2021 at 9:53 AM <2qdxy4rzwzuui...@potatochowder.com> wrote:

> On 2021-01-25 at 00:29:41 +1100,
> Steven D'Aprano  wrote:
>
> > On Sat, Jan 23, 2021 at 03:24:12PM +, Barry Scott wrote:
> > > First problem I see is that the file may be a pipe and then you will
> block
> > > until you have enough data to do the auto detect.
> >
> > Can you use `open('filename')` to read a pipe?
>
> Yes.  Named pipes are files, at least on POSIX.
>
> And no.  Unnamed pipes are identified by OS-level file descriptors, so
> you can't open them with open('filename'),
>

The `open` function takes either a file path as a string, or a file
descriptor as an integer. So you can use `open` to read an unnamed pipe or
a socket.

> Is blocking a problem in practice? If you try to open a network file,
> > that could block too, if there are network issues. And since you're
> > likely to follow the open with a read, the read is likely to block. So
> > over all I don't think that blocking is an issue.
>
> If open blocks too many bytes, then my application never gets to respond
> unless enough data comes through the pipe.


It's possible to do a `f.read(1)` on a file opened in text mode. If the
first two bytes of the file are 0xC2 0x99, that's either ™ if the file is
UTF-8, or 슙 if the file is UTF-16BE, or 駂 if the file is UTF-16LE. And
`f.read(1)` needs to pick one of those and return it immediately. It can't
wait for more information. The contract of `read` is "Read from underlying
buffer until we have n characters or we hit EOF." A call to `read(1)`
cannot keep blocking after the first character was received to decide what
encoding to decode it as; that would be backwards incompatible, and it
might block forever if the sender only sends one character before waiting
for a response.

~Matt
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/BAUQXIMQP4F6DRFQCLJCDV3NUPCDCWSQ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Random832
On Sun, Jan 24, 2021, at 13:18, MRAB wrote:
> Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's 
> probably UTF16-BE and if you see patterns like 
> b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.
> 
> You could also look for, say, sequences of Latin characters and 
> sequences of Han characters.

This is dangerous, as Microsoft discovered: a sequence of ASCII latin 
characters can look a lot like a sequence of UTF-16 Han characters.

On Windows, Notepad always writes UTF-16 with BOM, even though it now writes 
UTF-8 without it by default.

Probably the winning combination is "if there is a UTF-16 BOM, it's UTF-16, 
else if first few non-ASCII bytes encountered are valid UTF-8, it's UTF-8", 
otherwise it's the system default 'ANSI' locale.

The one problem with that is what to do if something like a pipe or a socket 
gets a sequence of bytes that are a valid *partial* UTF-8 character, then 
doesn't get any more data for a while. It's unacceptable to have to wait for 
more data before interpreting data that has been read.

Notepad has the luxury of only working on ordinary files, and being able to 
scan the whole file before making a decision about the character set [I believe 
it mmaps the file rather than using ordinary open/read calls].
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/DR4GEIPOWNQFWHETWM6L5Y2GGRZL2YRH/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Richard Damon
On 1/24/21 1:18 PM, MRAB wrote:
> On 2021-01-24 17:04, Chris Angelico wrote:
>> On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull
>>  wrote:
>>>
>>> Chris Angelico writes:
>>>  > Right, but as long as there's only one system encoding, that's not
>>>  > our problem. If you're on a Greek system and you want to decode
>>>  > ISO-8859-9 text, you have to state that explicitly. For the
>>>  > situations where you want heuristics based on byte distributions,
>>>  > there's always chardet.
>>>
>>> But that's the big question.  If you're just going to fall back to
>>> chardet, you might as well start there.  No?  Consider: if 'open'
>>> detects the encoding for you, *you can't find out what it is*.  'open'
>>> has no facility to tell you!
>>
>> Isn't that what file objects have attributes for? You can find out,
>> for instance, what newlines a file uses, even if it's being
>> autodetected.
>>
>>>  > In theory, UTF-16 without a BOM can consist entirely of byte values
>>>  > below 128,
>>>
>>> It's not just theory, it's my life.  62/80 of the Japanese "hiragana"
>>> syllabary is composed of 2 printing ASCII characters (including SPC).
>>> A large fraction of the Han ideographs satisfy that condition, and I
>>> wouldn't be surprised if a majority of the 1000 most common ones do.
>>> (Not a good bet because half of the ideographs have a low byte > 127,
>>> but the order of characters isn't random, so if you get a couple of
>>> popular radicals that have 50 or so characters in a group in that
>>> range, you'd be much of the way there.)
>>>
>>>  > But there's no solution to that,
>>>
>>> Well, yes, but that's my line. ;-)
>>>
>>
>> Do you get files that lack the BOM? If so, there's fundamentally no
>> way for the autodetection to recognize them. That's why, in my
>> quickly-whipped-up algorithm above, I basically had it assume that no
>> BOM means not UTF-16. After all, there's no way to know whether it's
>> UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point
>> of it), so IMO it's not unreasonable to assert that all files that
>> don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using
>> the ASCII-compatible detection method.
>>
>> (Of course, this is *ONLY* if you don't specify an encoding. That part
>> won't be going away.)
>>
> Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's
> probably UTF16-BE and if you see patterns like
> b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.
>
> You could also look for, say, sequences of Latin characters and
> sequences of Han characters.
>
Yes, if you happen to see that sort of pattern, you could perhaps make a
guess, but since part of the goal is to not need to read ahead much of
the file, it doesn't become a very reliable test to confirm UTF16 file
in case they don't begin with Latin-1 characters.

-- 
Richard Damon
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/KU7YLC3MZP3SVOAP2YPBQO5H4DIRUBWQ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread MRAB

On 2021-01-24 17:04, Chris Angelico wrote:

On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull
 wrote:


Chris Angelico writes:
 > Right, but as long as there's only one system encoding, that's not
 > our problem. If you're on a Greek system and you want to decode
 > ISO-8859-9 text, you have to state that explicitly. For the
 > situations where you want heuristics based on byte distributions,
 > there's always chardet.

But that's the big question.  If you're just going to fall back to
chardet, you might as well start there.  No?  Consider: if 'open'
detects the encoding for you, *you can't find out what it is*.  'open'
has no facility to tell you!


Isn't that what file objects have attributes for? You can find out,
for instance, what newlines a file uses, even if it's being
autodetected.


 > In theory, UTF-16 without a BOM can consist entirely of byte values
 > below 128,

It's not just theory, it's my life.  62/80 of the Japanese "hiragana"
syllabary is composed of 2 printing ASCII characters (including SPC).
A large fraction of the Han ideographs satisfy that condition, and I
wouldn't be surprised if a majority of the 1000 most common ones do.
(Not a good bet because half of the ideographs have a low byte > 127,
but the order of characters isn't random, so if you get a couple of
popular radicals that have 50 or so characters in a group in that
range, you'd be much of the way there.)

 > But there's no solution to that,

Well, yes, but that's my line. ;-)



Do you get files that lack the BOM? If so, there's fundamentally no
way for the autodetection to recognize them. That's why, in my
quickly-whipped-up algorithm above, I basically had it assume that no
BOM means not UTF-16. After all, there's no way to know whether it's
UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point
of it), so IMO it's not unreasonable to assert that all files that
don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using
the ASCII-compatible detection method.

(Of course, this is *ONLY* if you don't specify an encoding. That part
won't be going away.)

Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's 
probably UTF16-BE and if you see patterns like 
b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.


You could also look for, say, sequences of Latin characters and 
sequences of Han characters.

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/TPJYIC6ECIDYKQV3R4NZ36PTQJPY3CDN/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Chris Angelico
On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull
 wrote:
>
> Chris Angelico writes:
>  > Right, but as long as there's only one system encoding, that's not
>  > our problem. If you're on a Greek system and you want to decode
>  > ISO-8859-9 text, you have to state that explicitly. For the
>  > situations where you want heuristics based on byte distributions,
>  > there's always chardet.
>
> But that's the big question.  If you're just going to fall back to
> chardet, you might as well start there.  No?  Consider: if 'open'
> detects the encoding for you, *you can't find out what it is*.  'open'
> has no facility to tell you!

Isn't that what file objects have attributes for? You can find out,
for instance, what newlines a file uses, even if it's being
autodetected.

>  > In theory, UTF-16 without a BOM can consist entirely of byte values
>  > below 128,
>
> It's not just theory, it's my life.  62/80 of the Japanese "hiragana"
> syllabary is composed of 2 printing ASCII characters (including SPC).
> A large fraction of the Han ideographs satisfy that condition, and I
> wouldn't be surprised if a majority of the 1000 most common ones do.
> (Not a good bet because half of the ideographs have a low byte > 127,
> but the order of characters isn't random, so if you get a couple of
> popular radicals that have 50 or so characters in a group in that
> range, you'd be much of the way there.)
>
>  > But there's no solution to that,
>
> Well, yes, but that's my line. ;-)
>

Do you get files that lack the BOM? If so, there's fundamentally no
way for the autodetection to recognize them. That's why, in my
quickly-whipped-up algorithm above, I basically had it assume that no
BOM means not UTF-16. After all, there's no way to know whether it's
UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point
of it), so IMO it's not unreasonable to assert that all files that
don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using
the ASCII-compatible detection method.

(Of course, this is *ONLY* if you don't specify an encoding. That part
won't be going away.)

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/JG2QBXB7GRFAETYXRDHYCM6YND5E26ZH/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Stephen J. Turnbull
Chris Angelico writes:

 > Can anyone give an example of a current system encoding (ie one that
 > is likely to be the default currently used by open()) that can have
 > byte values below 128 which do NOT mean what they would mean in ASCII?
 > In other words, is it possible to read in a section of a file, think
 > that it's ASCII, and then find that you decoded it wrongly?

Japanese Shift JIS, as mentioned by Richard.  The Japanese just
redefine the glyph used for Windows paths and character escapes to be
the yen sign.  So it's a total muddle, because they also use that for
the yen sign.  They also use a broken vertical bar for the pipe
symbol, but the visual similarity there is so strong that you have to
know a *lot* of computational Japanese to realize that they're
different characters (they are, in JIS, but nobody cares -- there's
almost never a reason to use both).

 > I'm assuming here that there is a *single* default system encoding,
 > meaning that the automatic handler has only three cases to worry
 > about: UTF-16 (with BOM), UTF-8 (including pure ASCII), and the system
 > encoding.

Sure that handles a lot of cases ... but the vast majority are already
handled with just the system encoding and UTF-8.  In my experience the
UTF-16 cases are not going to be the majority of what's left.  YMMV.

 > Right, but as long as there's only one system encoding, that's not
 > our problem. If you're on a Greek system and you want to decode
 > ISO-8859-9 text, you have to state that explicitly. For the
 > situations where you want heuristics based on byte distributions,
 > there's always chardet.

But that's the big question.  If you're just going to fall back to
chardet, you might as well start there.  No?  Consider: if 'open'
detects the encoding for you, *you can't find out what it is*.  'open'
has no facility to tell you!

As somebody else pointed out, if you're writing a text editor,
autodetection makes a lot of sense.  You just provide a facility for
the user to chose something different and reread the file.  But if
you're running non-interactive, it's much harder to recover -- and
'open' can't do it for you.

 > > Program source code where the higher-level functions (likely to
 > > contain literal strings) come late in the file are frequently
 > > misdetected based on the earlier bytes.
 > 
 > Yup; and the real question is whether anything would have been decoded
 > incorrectly.

If I recall correctly there are several Latin-1 characters in UTF-8
which are plausible Windows 125x digraphs.  So, yes, it's quite possible.

 > If you read in a bunch of ASCII-only text and yield it to
 > the app, and then come across something that proves that the file is
 > not UTF-8, then as far as I am aware, you won't have to un-yield any
 > of the previous text - it'll all have been correctly decoded.

Not if it's UTF-16.  And again, if you put the detection logic in
'open', once you've yielded anything to the main logic *it's too late
to change your mind*.

 > In theory, UTF-16 without a BOM can consist entirely of byte values
 > below 128,

It's not just theory, it's my life.  62/80 of the Japanese "hiragana"
syllabary is composed of 2 printing ASCII characters (including SPC).
A large fraction of the Han ideographs satisfy that condition, and I
wouldn't be surprised if a majority of the 1000 most common ones do.
(Not a good bet because half of the ideographs have a low byte > 127,
but the order of characters isn't random, so if you get a couple of
popular radicals that have 50 or so characters in a group in that
range, you'd be much of the way there.)

 > But there's no solution to that,

Well, yes, but that's my line. ;-)
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/CIBX3EFFW2OMFUXQ4KPUJ4OZIYMQK5PH/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Richard Damon
On 1/24/21 6:00 AM, Chris Angelico wrote:
> Sorry, let me clarify.
>
> Can anyone give an example of a current system encoding (ie one that
> is likely to be the default currently used by open()) that can have
> byte values below 128 which do NOT mean what they would mean in ASCII?
> In other words, is it possible to read in a section of a file, think
> that it's ASCII, and then find that you decoded it wrongly?

EBCDIC is one big option.
There are also some National Character sets which change a couple of the lower 
128 characters for use with characters that language needed. (This was the 
cause of adding Trigraphs to C, to provide a way enter those characters on 
systems that didn't have those characters.

One common example was a Japanese character set that replaced \ with the Yen 
sign (and a few others) and then used some above 128 codes for multi-byte 
sequences. Users of such systems just got used to use the Yen sign as the path 
separator. 

The EBCDIC cases would likely be well know on those systems, and planned for. 
Having a system with a few of the lower 128 being substituted for could be a 
bigger surprise for a system.

-- 
Richard Damon
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/24PLMY635JZAS32BY2G5YVHBXTQPEFE5/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread 2QdxY4RzWzUUiLuE
On 2021-01-25 at 00:29:41 +1100,
Steven D'Aprano  wrote:

> On Sat, Jan 23, 2021 at 03:24:12PM +, Barry Scott wrote:
> 
> > I think that you are going to create a bug magnet if you attempt to auto
> > detect the encoding.
> > 
> > First problem I see is that the file may be a pipe and then you will block
> > until you have enough data to do the auto detect.
> 
> Can you use `open('filename')` to read a pipe?

Yes.  Named pipes are files, at least on POSIX.

And no.  Unnamed pipes are identified by OS-level file descriptors, so
you can't open them with open('filename'), but you can open them with
os.fdopen.  Once opened, such data sources "should be" interchangeable.

> Is blocking a problem in practice? If you try to open a network file,
> that could block too, if there are network issues. And since you're
> likely to follow the open with a read, the read is likely to block. So
> over all I don't think that blocking is an issue.

If open blocks too many bytes, then my application never gets to respond
unless enough data comes through the pipe.  Consider protocols like FTP
and SMTP, where commands and responses are often only handfuls of bytes
long.  OTOH, if I'm opening a file (or a pipe) for such a protocol, then
both ends should know the encoding ahead of time and there's no need to
guess.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/55ZMKKQES3EYMXZFYPHOT3WYOKXMUG3Q/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Chris Angelico
On Mon, Jan 25, 2021 at 12:33 AM Steven D'Aprano  wrote:
>
> On Sat, Jan 23, 2021 at 03:24:12PM +, Barry Scott wrote:
>
> > I think that you are going to create a bug magnet if you attempt to auto
> > detect the encoding.
> >
> > First problem I see is that the file may be a pipe and then you will block
> > until you have enough data to do the auto detect.
>
> Can you use `open('filename')` to read a pipe?

Yes. You can even use it with stdin:

>>> open("/proc/self/fd/0").read(1)
a
'a'

The second line was me typing something, even though I was otherwise
at the REPL.

> Is blocking a problem in practice? If you try to open a network file,
> that could block too, if there are network issues. And since you're
> likely to follow the open with a read, the read is likely to block. So
> over all I don't think that blocking is an issue.

Definitely could be a problem if you read too much just for the sake
of autodetection. It needs to be possible to do everything with an
absolute minimum of reading.

> > Second problem is that the first N bytes are all in ASCII and only later
> > do you see Windows code page signature (odd lack of utf-8 signature).
>
> UTF-8 is a strict superset of ASCII, so if the file is actually
> ASCII, there is no harm in using UTF-8.
>
> The bigger issue is if you have N bytes of pure ASCII followed by some
> non-UTF superset, such as one of the ISO-8859-* encodings. So you end up
> detecting what you think is ASCII/UTF-8 but is actually some legacy
> encoding. But if N is large, say 512 bytes, that's unlikely in practice.

There's no problem if you think it's ASCII, so the only problem would
be if you start thinking that it's UTF-8 and then discover that it
isn't. The scheme used by UTF-8 is designed such that this is highly
unlikely with random data or actual text in an eight-bit encoding, so
it's most likely to be broken UTF-8 than legit ISO-8859-X.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/MBBCCHLFHFHYPCS54AKOVOCA4ELBFNPD/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Steven D'Aprano
On Sat, Jan 23, 2021 at 03:24:12PM +, Barry Scott wrote:

> I think that you are going to create a bug magnet if you attempt to auto
> detect the encoding.
> 
> First problem I see is that the file may be a pipe and then you will block
> until you have enough data to do the auto detect.

Can you use `open('filename')` to read a pipe?

Is blocking a problem in practice? If you try to open a network file, 
that could block too, if there are network issues. And since you're 
likely to follow the open with a read, the read is likely to block. So 
over all I don't think that blocking is an issue.


> Second problem is that the first N bytes are all in ASCII and only later
> do you see Windows code page signature (odd lack of utf-8 signature).

UTF-8 is a strict superset of ASCII, so if the file is actually 
ASCII, there is no harm in using UTF-8.

The bigger issue is if you have N bytes of pure ASCII followed by some 
non-UTF superset, such as one of the ISO-8859-* encodings. So you end up 
detecting what you think is ASCII/UTF-8 but is actually some legacy 
encoding. But if N is large, say 512 bytes, that's unlikely in practice.


> > That auto-detection behaviour could be enough to differentiate it from 
> > the regular open(), thus solving the "but in ten years time it will be 
> > redundant and will need to be deprecated" objection.
> > 
> > Having said that, I can't say I'm very keen on the name "open_text", but 
> > I can't think of any other bikeshed colour I prefer.
> 
> Given the the functions purpose is to open unicode text use a name that
> reflects that it is the encoding that is set not the mode (binary vs. text).
> 
> open_unicode maybe?

I guess that depends on whether the auto-detection is intended to 
support non-Unicode legacy encodings or not.

> If you are teaching open_text then do you also need to have open_binary?

No. There are no frustrating, difficult, platform-specific encoding 
issues when reading binary files. Bytes are bytes.


-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/MVX5PNZM7W4I42XDSACOQTW3YRJPRQHI/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Random832
On Sat, Jan 23, 2021, at 22:43, Matt Wozniski wrote:
> 1. Deprecate calling `open` for text mode (the default) unless an 
> `encoding=` is specified,

I have a suggestion, if this is going to be done:

If the third positional argument to open is a string, accept it as encoding 
instead of buffering. Maybe even allow the fourth to be errors.

It might be worthwhile to consider making the other arguments keyword-only - 
are they ever used positionally in real-world code?
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HX37WBOP5PSMTVVNK7FVHLMEEGW4B2VX/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Steven D'Aprano
On Sun, Jan 24, 2021 at 10:00:47PM +1100, Chris Angelico wrote:
> On Sun, Jan 24, 2021 at 9:13 PM Stephen J. Turnbull
>  wrote:
> >
> > Chris Angelico writes:
> >
> >  > Can anyone give an example of a current in-use system encoding that
> >  > would have [ASCII bytes in non-ASCII text]?
> >
> > Shift JIS, Big5.  (Both can have bytes < 128 inside multibyte
> > characters.)  I don't know if Big5 is still in use as the default
> > encoding anywhere, but Shift JIS is, although it's decreasing.
> 
> Sorry, let me clarify.
> 
> Can anyone give an example of a current system encoding (ie one that
> is likely to be the default currently used by open()) that can have
> byte values below 128 which do NOT mean what they would mean in ASCII?
> In other words, is it possible to read in a section of a file, think
> that it's ASCII, and then find that you decoded it wrongly?

I believe that IBM mainframes such as the Z series still use 
EBCDIC. Python for z/OS has EBCDIC/UTF interoperability as a selling 
point. I think that just means the codec module :-)

https://www.ibm.com/products/open-enterprise-python-zos


-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/COR53MJK4URT77P77SRYMQYS6ZLHYMEU/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Chris Angelico
On Sun, Jan 24, 2021 at 9:13 PM Stephen J. Turnbull
 wrote:
>
> Chris Angelico writes:
>
>  > Can anyone give an example of a current in-use system encoding that
>  > would have [ASCII bytes in non-ASCII text]?
>
> Shift JIS, Big5.  (Both can have bytes < 128 inside multibyte
> characters.)  I don't know if Big5 is still in use as the default
> encoding anywhere, but Shift JIS is, although it's decreasing.

Sorry, let me clarify.

Can anyone give an example of a current system encoding (ie one that
is likely to be the default currently used by open()) that can have
byte values below 128 which do NOT mean what they would mean in ASCII?
In other words, is it possible to read in a section of a file, think
that it's ASCII, and then find that you decoded it wrongly?

> For both of those once you encounter a non-ASCII byte you can just
> switch over, and none of the previous text was mis-decoded.

Good to know, so these two won't be a problem.

I'm assuming here that there is a *single* default system encoding,
meaning that the automatic handler has only three cases to worry
about: UTF-16 (with BOM), UTF-8 (including pure ASCII), and the system
encoding.

> But
> that's only if you *know* the language was Japanese (respectively
> Chinese).  Remember, there is no encoding that can be distinguished
> from ISO 8859-1 (and several other Latin encodings) simply based on
> the bytes found, since it uses all 256 bytes.

Right, but as long as there's only one system encoding, that's not our
problem. If you're on a Greek system and you want to decode ISO-8859-9
text, you have to state that explicitly. For the situations where you
want heuristics based on byte distributions, there's always chardet.

>  > How likely is it that you'd get even one line of text that purports
>  > to be ASCII?
>
> Program source code where the higher-level functions (likely to
> contain literal strings) come late in the file are frequently
> misdetected based on the earlier bytes.

Yup; and the real question is whether anything would have been decoded
incorrectly. If you read in a bunch of ASCII-only text and yield it to
the app, and then come across something that proves that the file is
not UTF-8, then as far as I am aware, you won't have to un-yield any
of the previous text - it'll all have been correctly decoded.

In theory, UTF-16 without a BOM can consist entirely of byte values
below 128, and that's an absolute pain. But there's no solution to
that, other than demanding a BOM (or hoping that the first few
characters are all ASCII, so you can see "H\0e\0l\0l\0o\0", which I
wouldn't call reliable, although your odds probably aren't that bad in
real-world cases).

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GZPXWOYPSAE733ZMTKFBK26C2LVCNOQU/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Stephen J. Turnbull
Chris Angelico writes:

 > Can anyone give an example of a current in-use system encoding that
 > would have [ASCII bytes in non-ASCII text]?

Shift JIS, Big5.  (Both can have bytes < 128 inside multibyte
characters.)  I don't know if Big5 is still in use as the default
encoding anywhere, but Shift JIS is, although it's decreasing.

For both of those once you encounter a non-ASCII byte you can just
switch over, and none of the previous text was mis-decoded.  But
that's only if you *know* the language was Japanese (respectively
Chinese).  Remember, there is no encoding that can be distinguished
from ISO 8859-1 (and several other Latin encodings) simply based on
the bytes found, since it uses all 256 bytes.

 > How likely is it that you'd get even one line of text that purports
 > to be ASCII?

Program source code where the higher-level functions (likely to
contain literal strings) come late in the file are frequently
misdetected based on the earlier bytes.

Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ZB2LM3KYLQ34DHA276SPZA73BHJBRQMF/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Stephen J. Turnbull
Matt Wozniski writes:

 > Rather than introducing a new `open_utf8` function, I'd suggest the
 > following:
 > 
 > 1. Deprecate calling `open` for text mode (the default) unless an
 > `encoding=` is specified,

For that, we should have a sentinel for "system default encoding" (as
you acknowledge, but I want to foot-stomp it).  The current dance to
get that is quite annoying.

 > I think a __future__ import [of 'open_text' by some name] solves
 > the problem better than introducing a new function would.

Only if you redefine the problem.  If the problem is casual coders who
want a quick-and-dirty ready-to-bake function to read UTF-8 when their
default encodings are something else, then it's builtin or Just Don't
-- teach them to copy-paste "encoding='utf-8'" FTW.  I'm perfectly
happy with "Just Don't" followed by "It's Time to Work on UTF-8 by
Default".  You'll have to ask Naoki how he feels about that.

Your proposal (1. above) is an interesting one for that.

Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/QQKQ5HYTR2RLVGUPH44I3QVOZGOD7QEK/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Stephen J. Turnbull
Cameron Simpson writes:

 > I thought I'd seen [UTF-16 BOM] on Windows text files within the
 > last year or so (I don't use Windows often, so this is happenstance
 > from receiving some data, not an observation of the Windows
 > ecosystem; my recollection is that it was a UTF16 CSV file.)

OK; my experience is limited.

 > But BOMs may be commonplace. This isn't a text file example,

I don't care at all about BOMs in specialized protocols in this
thread.  This thread is about 'open'.

 > I do not consider the BOM dead, and it is so cheap to recognise
 > that not bothering to do so seems almost mean sprited.

Not if you view it from the point of view of cognitive burden on
casual coders.  See my reply to Guido.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GOYUG5WDUQDQTKUZN6V4EDFH6U23656R/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-24 Thread Stephen J. Turnbull
Guido van Rossum writes:

 > I have definitely seen BOMs written by Notepad on Windows 10.

I'm not clear on what circumstances we care if a UTF-8 file has or
doesn't have a UTF-8 signature.  Most software doesn't care, it just
reads it and spits it back out if it's there and hasn't been edited
out.

If people are seeing UTF-16 BOMs, that may be worth detecting,
depending on how often and how much trouble it is to deal with them.
I'm just saying that I never see them.  I was pretty careful about
saying that my sample is quite restricted.

However ...

 > Why can’t the future be that open() in text mode guesses the
 > encoding?

The medium-term future is UTF-8 in all UIs and public APIs, except for
archivists.  I think we all agree on that.

There are two issues with encoding guessing.  The statistically
unimportant one (at least for UTFs) is that guessing is guessing.  It
will get it wrong.  The people who want guessing are mostly people who
will be hurt most by wrong guesses.

Second, and a real issue for design AFAICS: if you introduce detection
of other encodings to 'open', the programmer may need to (1) discover
that encoding in order to match it on output (open does not return
that), or (2) choose the correct encoding on output, which may or may
not be the detected one depending on what the next software in the
pipeline expects.  At that point "in the face of ambiguity" really
does bind, "although practicality" notwithstanding.  I'm not sure that
putting detection into 'open' solves any problems, it just pushes them
into other parts of the code.

Remark: As I understand it, Naoki's proposal is about the casual coder
in a monolingual environment where either defaulting to
getpreferredencoding DTRTs or they need UTF-8 because some engineer
decided "UTF-8 is the future, and in my project the future is now!"
I don't think it's intended to be more general than that, but you'll
have to ask him about that.

Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ZRUF34M5QWQKCDCMEMJOAIIONISCMZIJ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Matt Wozniski
On Sat, Jan 23, 2021 at 10:51 PM Chris Angelico  wrote:

> On Sun, Jan 24, 2021 at 2:46 PM Matt Wozniski  wrote:
> > 2. At the same time as the deprecation is announced, introduce a new
> __future__ import named "utf8_open" or something like that, to opt into the
> future behavior of `open` defaulting to utf-8-sig or utf-8 when opening a
> file in text mode and no explicit encoding is specified.
> >
> > I think a __future__ import solves the problem better than introducing a
> new function would.
>
> Note that, since this doesn't involve any language or syntax changes,
> a regular module import would work here - something like "from
> utf8mode import open", which would then shadow the builtin. Otherwise
> no change to your proposal - everything else works exactly the same
> way.
>

True - that's an even better idea. That even allows it to be wrapped in a
try/except ImportError, allowing someone to write code that's backwards
compatible to versions before the new function is introduced. Though it
does mean that the new function will need to stick around, even though it
will eventually be identical to the builtin open() function.

That would also allow the option of introducing a locale_open as well,
which would behave as though encoding=locale.getpreferredencoding(False) is
the default encoding for files opened in text mode. I can imagine putting
both functions in io, and allowing the user to silence the deprecation
warning by either opting into the new behavior:

from io import utf8_open as open

or explicitly declaring their desire for the legacy behavior:

from io import locale_open as open

~Matt
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ETJ6BADTVM5IICDLICGFIWQDMRDD34XS/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Chris Angelico
On Sun, Jan 24, 2021 at 2:46 PM Matt Wozniski  wrote:
> 2. At the same time as the deprecation is announced, introduce a new 
> __future__ import named "utf8_open" or something like that, to opt into the 
> future behavior of `open` defaulting to utf-8-sig or utf-8 when opening a 
> file in text mode and no explicit encoding is specified.
>
> I think a __future__ import solves the problem better than introducing a new 
> function would.

Note that, since this doesn't involve any language or syntax changes,
a regular module import would work here - something like "from
utf8mode import open", which would then shadow the builtin. Otherwise
no change to your proposal - everything else works exactly the same
way.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HFIMUG2JVQ2QULCWEHSXAEALSQOAY2TL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Matt Wozniski
On Sat, Jan 23, 2021 at 9:22 PM Inada Naoki  wrote:

> On Sun, Jan 24, 2021 at 10:17 AM Guido van Rossum 
> wrote:
> >
> > I have definitely seen BOMs written by Notepad on Windows 10.
> >
> > Why can’t the future be that open() in text mode guesses the encoding?
>
> I don't like guessing. As a Japanese, I have seen many mojibake caused
> by the wrong guess.
> I don't think guessing encoding is not a good part of reliable software.
>

I agree that guessing encodings in general is a bad idea and is an avenue
for subtle localization issues - bad things will happen when it guesses
wrong, and it will lead to code that works properly on the developer's
machine and fails for end users. It makes sense for a text editor to try to
guess, because showing the user something is better than nothing (and if it
guesses wrong the user can easily see that, and perhaps take some manual
action to correct it). It does not make sense for a programming language to
guess, because the user cannot easily detect or correct an incorrect guess,
and mistakes will tend to be propagated rather than caught.

On the other hand, if we add `open_utf8()`, it's easy to ignore BOM:
>

Rather than introducing a new `open_utf8` function, I'd suggest the
following:

1. Deprecate calling `open` for text mode (the default) unless an
`encoding=` is specified, and 3 years after deprecation change the default
encoding for `open` to "utf-8-sig" for reading and "utf-8" for writing (to
ignore a BOM if one exists when reading, but to not create a BOM when
writing).
2. At the same time as the deprecation is announced, introduce a new
__future__ import named "utf8_open" or something like that, to opt into the
future behavior of `open` defaulting to utf-8-sig or utf-8 when opening a
file in text mode and no explicit encoding is specified.

I think a __future__ import solves the problem better than introducing a
new function would. Users who already have a UTF-8 locale (the majority of
users on the majority of platforms) could simply turn on the new __future__
import in any files where they're calling open() with no change in
behavior, suppressing the deprecation warning. Users who have a non-UTF-8
locale and want to keep opening text files in that non-UTF-8 locale by
default can add encoding=locale.getpreferredencoding(False) to retain the
old behavior, suppressing the deprecation warning. And perhaps we could
make a shortcut for that, like encoding="locale".

~Matt
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/UACU527OLD6DLI5URTMALWVOSPEKKADA/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Steven D'Aprano
On Sun, Jan 24, 2021 at 01:32:28AM +, MRAB wrote:
> On 2021-01-24 01:14, Guido van Rossum wrote:
> >I have definitely seen BOMs written by Notepad on Windows 10.
> >
> >Why can’t the future be that open() in text mode guesses the encoding?
> >
> "In the face of ambiguity, refuse the temptation to guess."

"Although practicality beats purity."


The Zen is like scripture: there's a koan for any position you wish to 
take :-)

If you want to be pedantic, and I certainly do *wink*, providing any 
default for the encoding parameter is a guess. The encoding of all text 
files is ambiguous (the intended encoding is metadata which is not 
recorded in the file format). Most text files on Linux and Mac OS use 
UTF-8, and many on Windows too, but not *all* so setting the default to 
UTF-8 is just a guess.

I understand that there are good heuristics for auto-detection of 
encodings which are reliable and used in many other software. If 
auto-detection is a "guess", its an *educated* guess and not much 
different from the status quo, which usually guesses correctly on Linux 
and Mac but often guesses wrongly on Windows. This proposal is to 
improve the quality of the guess by inspecting the file's contents.

For example, a file opened in text mode where every second character is 
a NULL is *almost certainly* UTF-16. The chances that somebody actually 
intended to write:

H\0e\0l\0l\0o\O \OW\0o\0r\0l\0d\0

rather than "Hello World" is negligible.

Before we consider changing the default encoding to "auto-detect", I 
would like to see some estimate of how many UTF-8 encoded files will be 
misclassified as something else. That is, if we make this change, how 
much software that currently guesses UTF-8 correctly (the default 
encoding is the actual intended encoding) will break because it guesses 
something else? That surely won't happen with mostly-ASCII files, but I 
suppose it could happen with some non-English languages?

-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/U2T4JSKOUGSEXVVW3Y7LTXR7HQ5UJUKI/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Inada Naoki
On Sun, Jan 24, 2021 at 10:17 AM Guido van Rossum  wrote:
>
> I have definitely seen BOMs written by Notepad on Windows 10.
>
> Why can’t the future be that open() in text mode guesses the encoding?

I don't like guessing. As a Japanese, I have seen many mojibake caused
by the wrong guess.
I don't think guessing encoding is not a good part of reliable software.

On the other hand, if we add `open_utf8()`, it's easy to ignore BOM:

* When reading, use "utf-8-sig". (it can read UTF-8 without bom)
* When writing, use "utf-8".

Although UTF-8 with BOM is not recommended, and Notepad uses UTF-8
without BOM as default encoding from 1903, UTF-8 with BOM is still
used in some cases.
For example, Excel reads CSV file with UTF-8 with BOM or legacy
encoding. So some CSV files is written with BOM.

Regards,
-- 
Inada Naoki  
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/BJC6LCYNO2HHRLHF4TFHWTG53M4YL6LL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread MRAB

On 2021-01-24 01:14, Guido van Rossum wrote:

I have definitely seen BOMs written by Notepad on Windows 10.

Why can’t the future be that open() in text mode guesses the encoding?


"In the face of ambiguity, refuse the temptation to guess."
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/KVLLWSHHVZPLC3OLPAIT7BOXJJK2VPNU/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Guido van Rossum
I have definitely seen BOMs written by Notepad on Windows 10.

Why can’t the future be that open() in text mode guesses the encoding?
-- 
--Guido (mobile)
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FCIMN3PSTAZT4ST3FH3QALGBH5H5IA6P/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Steven D'Aprano
On Sat, Jan 23, 2021 at 11:59:12PM +1100, Chris Angelico wrote:

> So Windows is being a pain in the behind, once again, because it
> doesn't move forward. 

*cough*

That would be called "backwards compatibility" :-)

Microsoft's attitude towards backwards compatibility is probably even 
stricter than ours.


> File names on Mac OS and most Linux systems will
> be in UTF-8, regardless of your chosen language. Why stick to other
> encodings as the default?

Aren't we talking about the file *contents*, not the file names?

The file name depends on the file system, not the OS. On Mac OS, the 
file system used until High Sierra was HFS+, where file names are 
UTF-16. I expect that there will still be many Mac systems with HFS+ 
file systems.

After High Sierra, the default file system shifted to APFS which does 
use UTF-8.

Linux file systems such as ext4 are bytes. Any UTF-8 support is enforced 
by the desktop manager or shell, not the file system, and so can be 
subverted, either deliberately or accidently (mojibake).


-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/F3IH5PQJ7F4WQZCIODK3QSKBX6V3RWVK/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Random832
On Sat, Jan 23, 2021, at 08:00, Stephen J. Turnbull wrote:
> I see very little use in detecting the BOMs.  I haven't seen a UTF-16
> BOM in the wild in a decade (as usual for me, that's Japan-specific,
> and may be limited to the academic community as well), and the UTF-8
> BOM is a no-op if the default is UTF-8 anyway.

It's not *entirely* a no-op, you'd want the decoder to consume the leading BOM 
rather than returning '\ufeff' on the first read. And AIUI they're much more 
common on Windows (being able to detect UTF-16 *without* BOMs might be useful 
as well, but has historically been a source of problems on Windows) - until 
recently all UTF-8 or UTF-16 files saved with notepad would have them.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GNV2JJVRUI5QGXRAA6VTZYNPCD7OGVNA/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Random832
On Sat, Jan 23, 2021, at 05:06, Inada Naoki wrote:
> On Sat, Jan 23, 2021 at 2:43 PM Random832  wrote:
> >
> > On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
> > > * Default encoding is "utf-8".
> >
> > it might be worthwhile to be a little more sophisticated than this.
> >
> > Notepad itself uses character set detection [it might not be reasonable to 
> > do this on the whole file as notepad does, but maybe the first 512 bytes, 
> > or the result of read1(512)?] when opening a file of unknown encoding, and 
> > msvcrt's "ccs=UTF-8" option to fopen will at least detect at the presence 
> > of UTF-8 and UTF-16 BOMs [and treat the file as UTF-16 in the latter case].
> 
> I meant Notepad (and VS code) use UTF-8 without BOM when creating new text 
> file.
> Students learning Python can not read it with `open()`.

Right, I was simply suggesting it might be worthwhile to target "be able to 
open all files that notepad can open" as the goal rather than simply defaulting 
to UTF8-no-BOM only, which requires a little more sophistication than just a 
default encoding.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/VJ67ZCY7HG6JTWM4K2JDZDQAJIXEMF4T/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Cameron Simpson
On 23Jan2021 22:00, Stephen J. Turnbull  
wrote:
>I see very little use in detecting the BOMs.  I haven't seen a UTF-16
>BOM in the wild in a decade (as usual for me, that's Japan-specific,
>and may be limited to the academic community as well), and the UTF-8
>BOM is a no-op if the default is UTF-8 anyway.

I thought I'd seen them on Windows text files within the last year or so 
(I don't use Windows often, so this is happenstance from receiving some 
data, not an observation of the Windows ecosystem; my recollection is 
that it was a UTF16 CSV file.)

But BOMs may be commonplace. This isn't a text file example, ut the 
ISO14496 standard (the basis for all MOV and MP4 files) has a text field 
type which may be UTF-16LE, UTF16BE or UTF-8, detected by a BOM of the 
right flavour for UTF16 and not BOM implying UTF8. I'm sure this is to 
accomodate easy writing by various systems.

I do not consider the BOM dead, and it is so cheap to recognise that not 
bothering to do so seems almost mean sprited.

Cheers,
Cameron Simpson 
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/KWBRCLYQHZK5ETJOT6KFRN7MJMGXX5H6/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread MRAB

On 2021-01-23 10:11, Chris Angelico wrote:
[snip]


Okay. If the goal is to make UTF-8 the default, may I request that PEP
597 say so, please? With a heading of "deprecation", it's not really
clear what its actual goal is.

  From the sound of things - and it's still possible I'm misreading PEP
597, my apologies if so - this open_text function wouldn't really
solve anything much, and the original goal of "change the default
encoding to UTF-8" is better served by 597.

I use Windows and I switched to UTF-8 years ago. However, the standard 
on Windows is 'utf-8-sig', so I'd probably prefer it if the default when 
_reading_ was 'utf-8-sig'. (I'm not bothered about writing; I can still 
be explicit if I want 'utf-8-sig' for Windows-specific UTF-8 files.)

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/SVDIUALZVHPQLBZPFRETXFKN2GIJNQCD/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Chris Angelico
On Sun, Jan 24, 2021 at 2:31 AM Barry Scott  wrote:
> I think that you are going to create a bug magnet if you attempt to auto
> detect the encoding.
>
> First problem I see is that the file may be a pipe and then you will block
> until you have enough data to do the auto detect.
>
> Second problem is that the first N bytes are all in ASCII and only later
> do you see Windows code page signature (odd lack of utf-8 signature).

Both can be handled, just as universal newlines can, by remaining in
an "uncertain" state.

When the file is first opened, we know nothing about its encoding.
Once you request that anything be read (eg by pumping the iterator or
anything), it reads, as per current status. Then:

1) If it looks like UTF-16, assume UTF-16. Rather than falling for the
"Bush hid the facts" issue, this might be restricted to files that
start with a BOM.

2) If it's entirely ASCII, decode it as ASCII and stay uncertain.

3) If it can be decoded UTF-8, remember that this is a UTF-8 file, and
from there on, error out if anything isn't UTF-8.

4) Otherwise, use the system encoding.

On subsequent reads, if we're in ASCII mode, repeat steps 2-4. Until
it finds a non-ASCII byte value, it doesn't really matter how it
decodes it.

Unlike chardet, this can be done completely dependably. I'm not sure
what would happen if the system encoding isn't an eight-bit
ASCII-compatible one, though. The algorithm might produce some odd
results if the file looks like ASCII, but then switches to some
incompatible encoding. Can anyone give an example of a current in-use
system encoding that would have this issue? How likely is it that
you'd get even one line of text that purports to be ASCII?

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ZEXFMCCD5L647HSAMB3U6W6CDQKVN5JA/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Barry Scott



> On 23 Jan 2021, at 11:00, Steven D'Aprano  wrote:
> 
> On Sat, Jan 23, 2021 at 12:40:55AM -0500, Random832 wrote:
>> On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
>>> * Default encoding is "utf-8".
>> 
>> it might be worthwhile to be a little more sophisticated than this.
>> 
>> Notepad itself uses character set detection [it might not be 
>> reasonable to do this on the whole file as notepad does, but maybe the 
>> first 512 bytes, or the result of read1(512)?] when opening a file of 
>> unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at 
>> least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the 
>> file as UTF-16 in the latter case].
> 
> 
> I like Random's idea. If we add a new "open text file" builtin function, 
> we should seriously consider having it attempt to auto-detect the 
> encoding. It need not be as sophisticated as `chardet`.

I think that you are going to create a bug magnet if you attempt to auto
detect the encoding.

First problem I see is that the file may be a pipe and then you will block
until you have enough data to do the auto detect.

Second problem is that the first N bytes are all in ASCII and only later
do you see Windows code page signature (odd lack of utf-8 signature).

> That auto-detection behaviour could be enough to differentiate it from 
> the regular open(), thus solving the "but in ten years time it will be 
> redundant and will need to be deprecated" objection.
> 
> Having said that, I can't say I'm very keen on the name "open_text", but 
> I can't think of any other bikeshed colour I prefer.

Given the the functions purpose is to open unicode text use a name that
reflects that it is the encoding that is set not the mode (binary vs. text).

open_unicode maybe?

If you are teaching open_text then do you also need to have open_binary?

Barry

> 
> 
> -- 
> Steve
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/VAWFPIAA4WIVLIF4LFJ4OATJK6JDJS2N/
> Code of Conduct: http://python.org/psf/codeofconduct/
> 
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/4LHLZ5QIBOCLIZUVYQ2UXAU6MEX6VMJH/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Stephen J. Turnbull
Steven D'Aprano writes:
 > On Sat, Jan 23, 2021 at 12:40:55AM -0500, Random832 wrote:
 > > On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
 > > > * Default encoding is "utf-8".
 > > 
 > > it might be worthwhile to be a little more sophisticated than this.
 > > 
 > > Notepad itself uses character set detection [it might not be 
 > > reasonable to do this on the whole file as notepad does, but maybe the 
 > > first 512 bytes, or the result of read1(512)?] when opening a file of 
 > > unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at 
 > > least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the 
 > > file as UTF-16 in the latter case].
 > 
 > 
 > I like Random's idea. If we add a new "open text file" builtin
 > function, we should seriously consider having it attempt to
 > auto-detect the encoding. It need not be as sophisticated as
 > `chardet`.

It definitely should not be as sophisticated as chardet.  Detection of
ISO 8859, ISO 2022, and EUC family encodings is reliable as long as
you know that only one of each family is going to be used.  But you
cannot easily tell which of the many ISO 8859 (also Windows-12xx)
family are present, and similarly for the other families.

I see very little use in detecting the BOMs.  I haven't seen a UTF-16
BOM in the wild in a decade (as usual for me, that's Japan-specific,
and may be limited to the academic community as well), and the UTF-8
BOM is a no-op if the default is UTF-8 anyway.

I'm definitely leaning to the suggestion I made elsewhere (if it's
adopted at all): force UTF-8, and name it 'open_utf8'.

Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/LPUM3JPQD3RJCYFZ42GWTISCAHKF462C/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Chris Angelico
On Sat, Jan 23, 2021 at 11:34 PM Stephen J. Turnbull
 wrote:
>  > I'd rather focus on just moving to UTF-8 as the default, rather
>  > than bringing in a new function - especially with such a confusing
>  > name.
>
> I expect there are several bodies of users who will experience that as
> quite obnoxious for a long time to come.  I *still* see a ton of stuff
> that is Shift JIS, a fair amount of email in ISO-2022-JP, and in China
> gb18030 isn't just a good idea, it's the law.  (OK, the precise
> statement of the law is "must support", not "must use", but my Chinese
> students all default to GB.)

But "UTF-8 as the default if you don't specify an encoding" doesn't
stop you from using all those other encodings. The only change is
that, if you don't specify an encoding, you get a cross-platform
consistent default that can be easily described, rather than one which
depends on system settings.

> The problem is that these users use some software that will create
> text in a national language encoding by default and other that use
> UTF-8 by default.  So I guess Naoki's hope is that "when I'm
> processing Microsoft/Oracle-generated data, I use 'open_text', when
> it's local software I use 'open'" becomes an easy and natural reponse
> in such environments.

Exactly, so no single default will work.

Is there an easy way to say open("filename", encoding="use my system
default") ? Currently encoding=None does that, and maybe that can be
retained (just with the default becoming "utf-8"), or maybe some other
keyword can be used. But that should cover the situations where you
specifically *want* a platform-dependent selection.

>  > What exactly are the blockers on making open(fn) use UTF-8 by
>  > default?
>
> Backward incompatibility with just about every script in existence?

Or for a large number of them, sudden cross-platform compatibility
that they didn't previously have. This is *fixing a bug* for many
scripts.

>  > Can the proposals be written with that as the ultimate goal (even if
>  > it's going to take X versions and multiple deprecation phases), rather
>  > than aiming for a messy goal where people aren't sure which function
>  > to use?
>
> The problem is that on Windows there are a lot of installations that
> continue to use non-UTF-8 encodings enough that users set their
> preferred encoding that way.  I guess that folks where the majority of
> their native-language alphabet is drawn from ASCII are by now almost
> all using UTF-8 by default, but this is not so for East Asians (who
> almost all still use a mixture of several encodings every day because
> email still often defaults to national standard encodings).  I can't
> speak to Cyrillic, Hebrew, Arabic, Indic languages, but I wouldn't be
> surprised if they're somewhere in the middle.

So Windows is being a pain in the behind, once again, because it
doesn't move forward. File names on Mac OS and most Linux systems will
be in UTF-8, regardless of your chosen language. Why stick to other
encodings as the default?

(I repeat: I am NOT advocating abolishing support for all other
encodings. The ONLY thing I want to see is that UTF-8 becomes the
default.)

> Naoki can document that "open(..., encoding='...')" is strongly
> preferred to 'open_text'.  Maybe a better name is "open_utf8", to
> discourage people who want to use non-default encodings, or
> programmatically chosen encodings, in that function.

TBH I don't think a separate built-in is of value here, but perhaps
it'd be beneficial as an alternative to the wall-of-text help info
that open() has. But I do rather like Random's and Steve's suggestion
that the alternate function be specifically documented as magic. It'd
actually tie in very nicely with a change of default: open() does what
it's explicitly told, and has cross-platform defaults, but
open_sesame() probes the file to try to guess at its encoding,
attempting to use a platform-specific eight bit encoding if
applicable. It'd "just work" for reading most text files, regardless
of their source, as long as they came from this current computer. (All
bets are off anyway if they came from some other system and are in an
eight-bit encoding.)

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/PKUN6TDU6R3CDX2LCI34DF5CCLGHMVIX/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Stephen J. Turnbull
Chris Angelico writes:
 > On Sat, Jan 23, 2021 at 12:37 PM Inada Naoki  wrote:

 > > ## 1. Add `io.open_text()`, builtin `open_text()`, and
 > > `pathlib.Path.open_text()`.
 > >
 > > All functions are same to `io.open()` or `Path.open()`, except:
 > >
 > > * Default encoding is "utf-8".

I wonder if it might not be better to remove the encoding parameter
for this version.  Further comments below.

 > > * "b" is not allowed in the mode option.
 > 
 > I *really* don't like this, because it implies that open() will open
 > in binary mode.

I doubt that will be a common misunderstanding, as long as 'open_text'
is documented as a convenience wrapper for 'open' aimed primarily at
Windows programmers.

 > > How do you think about this idea? Is this worth enough to add a new
 > > built-in function?
 > 
 > Highly dubious.

I won't go so far as "highly", but yeah, dubious to me.  In my own
environment, while I still see Shift JIS data quite a bit, the rule is
that this or that correspondent sends it to me.  While a lot of the
University infrastructure used to default to Shift JIS, it now
defaults to UTF-8.  So I don't have a consistent rule by "kind of
data", ie, which scripts use 'open_text' and which 'open'.  If the
script processes data from "JIS users", it needs to accept a
command-line flag because other users *will* be sending that kind of
data in UTF-8.  Naoki's mileage may vary.

See below for additional comments.

 > I'd rather focus on just moving to UTF-8 as the default, rather
 > than bringing in a new function - especially with such a confusing
 > name.

I expect there are several bodies of users who will experience that as
quite obnoxious for a long time to come.  I *still* see a ton of stuff
that is Shift JIS, a fair amount of email in ISO-2022-JP, and in China
gb18030 isn't just a good idea, it's the law.  (OK, the precise
statement of the law is "must support", not "must use", but my Chinese
students all default to GB.)

The problem is that these users use some software that will create
text in a national language encoding by default and other that use
UTF-8 by default.  So I guess Naoki's hope is that "when I'm
processing Microsoft/Oracle-generated data, I use 'open_text', when
it's local software I use 'open'" becomes an easy and natural reponse
in such environments.

We don't see very many Asian language users on the python-* lists.  We
see a few more Russian users, I suspect quite a few Hebrew and Indic
users, maybe a few Arabic users.  So we should listen very carefully
to the few we do have where they come from tiny minorities of python-*
subscribers.

 > What exactly are the blockers on making open(fn) use UTF-8 by
 > default?

Backward incompatibility with just about every script in existence?

 > Can the proposals be written with that as the ultimate goal (even if
 > it's going to take X versions and multiple deprecation phases), rather
 > than aiming for a messy goal where people aren't sure which function
 > to use?

The problem is that on Windows there are a lot of installations that
continue to use non-UTF-8 encodings enough that users set their
preferred encoding that way.  I guess that folks where the majority of
their native-language alphabet is drawn from ASCII are by now almost
all using UTF-8 by default, but this is not so for East Asians (who
almost all still use a mixture of several encodings every day because
email still often defaults to national standard encodings).  I can't
speak to Cyrillic, Hebrew, Arabic, Indic languages, but I wouldn't be
surprised if they're somewhere in the middle.

Naoki can document that "open(..., encoding='...')" is strongly
preferred to 'open_text'.  Maybe a better name is "open_utf8", to
discourage people who want to use non-default encodings, or
programmatically chosen encodings, in that function.

As someone who avoids Windows like the plague, I have no real sense of
how important this is, and I like your argument from first
principles.  So on net, I guess I'm +/- 0 only because Naoki thinks it
important enough to spend quite a bit of skull sweat and effort on
this.

Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/E2X4QYTOW47BVYVRWACOIBQA3H5BVZMQ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Steven D'Aprano
On Sat, Jan 23, 2021 at 12:40:55AM -0500, Random832 wrote:
> On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
> > * Default encoding is "utf-8".
> 
> it might be worthwhile to be a little more sophisticated than this.
> 
> Notepad itself uses character set detection [it might not be 
> reasonable to do this on the whole file as notepad does, but maybe the 
> first 512 bytes, or the result of read1(512)?] when opening a file of 
> unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at 
> least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the 
> file as UTF-16 in the latter case].


I like Random's idea. If we add a new "open text file" builtin function, 
we should seriously consider having it attempt to auto-detect the 
encoding. It need not be as sophisticated as `chardet`.

That auto-detection behaviour could be enough to differentiate it from 
the regular open(), thus solving the "but in ten years time it will be 
redundant and will need to be deprecated" objection.

Having said that, I can't say I'm very keen on the name "open_text", but 
I can't think of any other bikeshed colour I prefer.


-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/VAWFPIAA4WIVLIF4LFJ4OATJK6JDJS2N/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Steven D'Aprano
On Sat, Jan 23, 2021 at 01:31:28PM +0300, Paul Sokolovsky wrote:

> > * Teachers can teach to use `open_text` to open text files. Students
> > can use "utf-8" by default without knowing about what encoding is.
> 
> Let's also add max_int(), min_int(), max_float(), min_float() builtins.
> Teachers can teach that if you need to min ints, then to use min_int(),
> if you need to min floats, then to use min_float(), and otherwise, use
> min(). Bonus point: max_int(), min_int(), max_float(), min_float() are
> all easier to annotate.

Why would we need to do that? The proposed `open_text()` builtin solves 
an actual problem with opening files on one platform. Is there an 
equivalent issue with some platform where min() and max() misbehave by 
default with ints and floats?

If not, then your analogy is invalid.

If so, please raise a bug on the tracker.

Adding this proposed `open_text` function does not require us to add 
multiple redundant functions that solve no problems.


> > So `open_text()` can provide better developer experience, without
> > waiting 10 years.
> 
> Except that in 10 years, when the default encoding is finally changed,
> open_text() is a useless function, which now needs to be deprecated and
> all the fun process repeated again.

It won't be useless. It will still work as well as it ever did, so 
useful. It might be redundant, in which case we could deprecate it in 
documentation and take no further action until Python 5000.


-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ZM66MNQT32WFABXM6CVEMCTBXDVB5GA4/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Inada Naoki
On Sat, Jan 23, 2021 at 7:31 PM Paul Sokolovsky  wrote:
> >
> > * Replacing open with open_text is easier than adding `,
> > encoding="utf-8"`.
>
> How is it easier, if "open_text" exists only in imagination, while
> encoding="utf-8" has been there all this time?
>

Note that the warning is not enabled by default anytime soon.
If we decide to change the default encoding and enable the
EncodingWarning by default in Python 3.15, user can use `open_text()`
for 3.10~3.15.
It will be enough backward compatibility for most users.

>
> > * Teachers can teach to use `open_text` to open text files. Students
> > can use "utf-8" by default without knowing about what encoding is.
>
> Let's also add max_int(), min_int(), max_float(), min_float() builtins.

It is off-topic. Please don't compare apple and orange.

>
> > So `open_text()` can provide better developer experience, without
> > waiting 10 years.
>
> Except that in 10 years, when the default encoding is finally changed,
> open_text() is a useless function, which now needs to be deprecated and
> all the fun process repeated again.

Yes, if we can change the default encoding in 2030, two open functions
will become messy.
But there is no promise for the change. Without mitigating the pain,
we can not change the default encoding forever.

Anyway, thank you for your feedback.
Two people prefer `encoding="utf-8"` to `open_text()`.

I still wait for feedbacks from more people before updating the PEP 597.

Regards,
-- 
Inada Naoki  
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6UKLKB6JRAJZOCSYPTZTS6XA6VJPQYR3/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Inada Naoki
On Sat, Jan 23, 2021 at 7:13 PM Chris Angelico  wrote:
>
> > On the other hand, if we add `open_text()`:
> >
> > * Replacing open with open_text is easier than adding `, encoding="utf-8"`.
> > * Teachers can teach to use `open_text` to open text files. Students
> > can use "utf-8" by default without knowing about what encoding is.
> >
> > So `open_text()` can provide better developer experience, without
> > waiting 10 years.
>
> But this has a far worse end goal - two open functions with subtly
> incompatible defaults, and a big question of "why should I choose this
> over that". And if you start using open_text, suddenly your code won't
> work on older Pythons.
>

Yes. There is cons too.
That's why I posted this thread before including the idea in the PEP.
Thank you for your feedback.


> >
> > Ultimate goal is make the "utf-8" default. But I don't know when we
> > can change it.
> > So I focus on what we can do in near future (< 5 years, I hope).
> >
>
> Okay. If the goal is to make UTF-8 the default, may I request that PEP
> 597 say so, please? With a heading of "deprecation", it's not really
> clear what its actual goal is.

No. I avoid it intentionally.  I am making the PEP useful even if we
can not change the default encoding.
The PEP can be discussed without discussing we can change the default
encoding or not.

Please read the first motivation section in the PEP.
https://www.python.org/dev/peps/pep-0597/#using-the-default-encoding-is-a-common-mistake

Regards,
-- 
Inada Naoki  
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6OIKAWIQ6OPVDJ5ZUJECZPAY4FDUOZVD/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Paul Sokolovsky
Hello,

On Sat, 23 Jan 2021 19:04:08 +0900
Inada Naoki  wrote:

> On Sat, Jan 23, 2021 at 10:47 AM Chris Angelico 
> wrote:
> >
> >
> > Highly dubious. I'd rather focus on just moving to UTF-8 as the
> > default, rather than bringing in a new function - especially with
> > such a confusing name.
> >
> > What exactly are the blockers on making open(fn) use UTF-8 by
> > default?  
> 
> Backward compatibility. That's what PEP 597 tries to solve.
> 
> 1. Add optional warning for `open()` call without specifying
> `encoding` option. (PEP 597)
> 2. (Several years later) Make the warning default.
> 3. (Several years later) Change the default encoding.
> 
> When (2) happens, users are forced to write `encoding="utf-8"` to
> suppress the warning.
> 
> But note that the default encoding is "utf-8" already in (most) Linux
> including WSL, macOS, iOS, and Android.
> And Windows user can read ASCII text files without specifying
> `encoding` regardless default encoding is legacy codec or "utf-8".
> So adding `, encoding="utf-8"` everywhere `open()` is used might be
> tedious job.
> 
> On the other hand, if we add `open_text()`:
> 
> * Replacing open with open_text is easier than adding `,
> encoding="utf-8"`.

How is it easier, if "open_text" exists only in imagination, while
encoding="utf-8" has been there all this time?

The only easier thing than adding 'encoding="utf-8"' would be:

1. Just go ahead and switch the default encoding to utf-8 right away.
2. For backward compatibility, add "python3 --backward-compatibility"
switch. Perhaps even tell users to use it straight in
the UnicodeDecodeError backtrace.

> * Teachers can teach to use `open_text` to open text files. Students
> can use "utf-8" by default without knowing about what encoding is.

Let's also add max_int(), min_int(), max_float(), min_float() builtins.
Teachers can teach that if you need to min ints, then to use min_int(),
if you need to min floats, then to use min_float(), and otherwise, use
min(). Bonus point: max_int(), min_int(), max_float(), min_float() are
all easier to annotate.

> So `open_text()` can provide better developer experience, without
> waiting 10 years.

Except that in 10 years, when the default encoding is finally changed,
open_text() is a useless function, which now needs to be deprecated and
all the fun process repeated again.


[]

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/23HBISVYGAJ5G25ZPXDNLD4YZX2XXZAQ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Chris Angelico
On Sat, Jan 23, 2021 at 9:04 PM Inada Naoki  wrote:
>
> On Sat, Jan 23, 2021 at 10:47 AM Chris Angelico  wrote:
> >
> >
> > Highly dubious. I'd rather focus on just moving to UTF-8 as the
> > default, rather than bringing in a new function - especially with such
> > a confusing name.
> >
> > What exactly are the blockers on making open(fn) use UTF-8 by default?
>
> Backward compatibility. That's what PEP 597 tries to solve.
>
> 1. Add optional warning for `open()` call without specifying
> `encoding` option. (PEP 597)
> 2. (Several years later) Make the warning default.
> 3. (Several years later) Change the default encoding.
>
> When (2) happens, users are forced to write `encoding="utf-8"` to
> suppress the warning.
>
> But note that the default encoding is "utf-8" already in (most) Linux
> including WSL, macOS, iOS, and Android.
> And Windows user can read ASCII text files without specifying
> `encoding` regardless default encoding is legacy codec or "utf-8".
> So adding `, encoding="utf-8"` everywhere `open()` is used might be tedious 
> job.

Okay, but this (a) has a good end goal, and (b) is only
backward-incompatible with its default - adding the encoding parameter
makes your code compatible with all versions of Python.

> On the other hand, if we add `open_text()`:
>
> * Replacing open with open_text is easier than adding `, encoding="utf-8"`.
> * Teachers can teach to use `open_text` to open text files. Students
> can use "utf-8" by default without knowing about what encoding is.
>
> So `open_text()` can provide better developer experience, without
> waiting 10 years.

But this has a far worse end goal - two open functions with subtly
incompatible defaults, and a big question of "why should I choose this
over that". And if you start using open_text, suddenly your code won't
work on older Pythons.

> > Can the proposals be written with that as the ultimate goal (even if
> > it's going to take X versions and multiple deprecation phases), rather
> > than aiming for a messy goal where people aren't sure which function
> > to use?
> >
>
> Ultimate goal is make the "utf-8" default. But I don't know when we
> can change it.
> So I focus on what we can do in near future (< 5 years, I hope).
>

Okay. If the goal is to make UTF-8 the default, may I request that PEP
597 say so, please? With a heading of "deprecation", it's not really
clear what its actual goal is.

>From the sound of things - and it's still possible I'm misreading PEP
597, my apologies if so - this open_text function wouldn't really
solve anything much, and the original goal of "change the default
encoding to UTF-8" is better served by 597.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/U6BL5RWB4OPDZNM3NEFO3UPPZEIVYKYZ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Inada Naoki
On Sat, Jan 23, 2021 at 2:43 PM Random832  wrote:
>
> On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
> > * Default encoding is "utf-8".
>
> it might be worthwhile to be a little more sophisticated than this.
>
> Notepad itself uses character set detection [it might not be reasonable to do 
> this on the whole file as notepad does, but maybe the first 512 bytes, or the 
> result of read1(512)?] when opening a file of unknown encoding, and msvcrt's 
> "ccs=UTF-8" option to fopen will at least detect at the presence of UTF-8 and 
> UTF-16 BOMs [and treat the file as UTF-16 in the latter case].

I meant Notepad (and VS code) use UTF-8 without BOM when creating new text file.
Students learning Python can not read it with `open()`.

-- 
Inada Naoki  
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/5WYWXLCHL6MORJDU4V7JFRI2XD7E3G5Z/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-23 Thread Inada Naoki
On Sat, Jan 23, 2021 at 10:47 AM Chris Angelico  wrote:
>
>
> Highly dubious. I'd rather focus on just moving to UTF-8 as the
> default, rather than bringing in a new function - especially with such
> a confusing name.
>
> What exactly are the blockers on making open(fn) use UTF-8 by default?

Backward compatibility. That's what PEP 597 tries to solve.

1. Add optional warning for `open()` call without specifying
`encoding` option. (PEP 597)
2. (Several years later) Make the warning default.
3. (Several years later) Change the default encoding.

When (2) happens, users are forced to write `encoding="utf-8"` to
suppress the warning.

But note that the default encoding is "utf-8" already in (most) Linux
including WSL, macOS, iOS, and Android.
And Windows user can read ASCII text files without specifying
`encoding` regardless default encoding is legacy codec or "utf-8".
So adding `, encoding="utf-8"` everywhere `open()` is used might be tedious job.

On the other hand, if we add `open_text()`:

* Replacing open with open_text is easier than adding `, encoding="utf-8"`.
* Teachers can teach to use `open_text` to open text files. Students
can use "utf-8" by default without knowing about what encoding is.

So `open_text()` can provide better developer experience, without
waiting 10 years.

> Can the proposals be written with that as the ultimate goal (even if
> it's going to take X versions and multiple deprecation phases), rather
> than aiming for a messy goal where people aren't sure which function
> to use?
>

Ultimate goal is make the "utf-8" default. But I don't know when we
can change it.
So I focus on what we can do in near future (< 5 years, I hope).

Regards,

-- 
Inada Naoki  
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/KGQFKMX2GBDIYITJCM6MHAS5ZGUA6YDL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-22 Thread Random832
On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
> * Default encoding is "utf-8".

it might be worthwhile to be a little more sophisticated than this.

Notepad itself uses character set detection [it might not be reasonable to do 
this on the whole file as notepad does, but maybe the first 512 bytes, or the 
result of read1(512)?] when opening a file of unknown encoding, and msvcrt's 
"ccs=UTF-8" option to fopen will at least detect at the presence of UTF-8 and 
UTF-16 BOMs [and treat the file as UTF-16 in the latter case].
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/7TUNPIXTWSWKTFD2LE4UBV5SOOEUBGMY/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

2021-01-22 Thread Chris Angelico
On Sat, Jan 23, 2021 at 12:37 PM Inada Naoki  wrote:
> ## 1. Add `io.open_text()`, builtin `open_text()`, and
> `pathlib.Path.open_text()`.
>
> All functions are same to `io.open()` or `Path.open()`, except:
>
> * Default encoding is "utf-8".
> * "b" is not allowed in the mode option.

I *really* don't like this, because it implies that open() will open
in binary mode.

> How do you think about this idea? Is this worth enough to add a new
> built-in function?

Highly dubious. I'd rather focus on just moving to UTF-8 as the
default, rather than bringing in a new function - especially with such
a confusing name.

What exactly are the blockers on making open(fn) use UTF-8 by default?
Can the proposals be written with that as the ultimate goal (even if
it's going to take X versions and multiple deprecation phases), rather
than aiming for a messy goal where people aren't sure which function
to use?

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/46RCX23FGYZY7YN4EOUL5QXYTQO6OO2H/
Code of Conduct: http://python.org/psf/codeofconduct/