[Python-Dev] Announcing importlib_resources 0.1

2017-12-07 Thread Barry Warsaw
Brett and I have been working on a little skunkworks project for a few weeks, 
and it’s now time to announce the first release.  We’re calling it 
importlib_resources and its intent is to replace the “Basic Resource Access” 
APIs of pkg_resources with more efficient implementations based directly on 
importlib.

importlib_resources 0.1 provides support for Python 2.7, and 3.4-3.7.  It 
defines an ABC that loaders can implement to provide direct access to resources 
inside packages.  importlib_resources has fallbacks for file system and zip 
file loaders, so it should work out of the box in most of the places that 
pkg_resources is currently used.  We even have a migration guide for folks who 
want to drop pkg_resources altogether and adopt importlib_resources.  
importlib_resources explicitly does not support pkg_resources features like 
entry points, working sets, etc.  Still, we think the APIs provided will be 
good enough for most current use cases.

http://importlib-resources.readthedocs.io/

We are calling it “importlib_resources” because we intend to port this into 
Python 3.7 under a new importlib.resources subpackage, so starting with Python 
3.7, you will get this for free.  The API is going to officially be 
provisional, but I’ve already done an experimental port of at least one big 
application (I’ll let you guess which one :) and it’s fairly straightforward, 
if not completely mechanical unfortunately.  Take a look at the migration guide 
for details:

http://importlib-resources.readthedocs.io/en/latest/migration.html

We also intend to include the ABC in Python 3.7:

http://importlib-resources.readthedocs.io/en/latest/abc.html

You can of course `pip install importlib_resources`.

We’re hosting the project on GitLab, and welcome feedback, bug fixes, 
improvements, etc!

 * Project home: https://gitlab.com/python-devs/importlib_resources
 * Report bugs at: https://gitlab.com/python-devs/importlib_resources/issues
 * Code hosting: https://gitlab.com/python-devs/importlib_resources.git
 * Documentation: http://importlib_resources.readthedocs.io/

Cheers.
-Barry and Brett



signature.asc
Description: Message signed with OpenPGP
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] iso8601 parsing

2017-12-07 Thread Chris Barker
On Wed, Dec 6, 2017 at 3:07 PM, Paul Ganssle  wrote:

> Here is the PR I've submitted:
>
> https://github.com/python/cpython/pull/4699
>
> The contract that I'm supporting (and, I think it can be argued, the only
> reasonable contract in the intial implementation) is the following:
>
> dtstr = dt.isoformat(*args, **kwargs)
> dt_rt = datetime.fromisoformat(dtstr)
> assert dt_rt == dt# The two points represent the
> same absolute time
> assert dt_rt.replace(tzinfo=None) == dt.replace(tzinfo=None)   # And
> the same wall time
>


that looks good.

And I'm sorry, I got a bit lost in the PR, but you are attaching an
"offset" tzinfo, when parsing an iso string that has one, yes?

I see this in the comments in the PR:


"""
This does not support parsing arbitrary ISO 8601 strings - it is only
intended
as the inverse operation of :meth:`datetime.isoformat`
"""

I fully agree that that's the MVP -- but is it that hard to parse arbitrary
ISO8601 strings in once you've gotten this far? It's a bit uglier than I'd
like, but not THAT bad a spec.

what ISO8601 compatible features are not supported?

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Issues with PEP 526 Variable Notation at the class level

2017-12-07 Thread Raymond Hettinger
Both typing.NamedTuple and dataclasses.dataclass use the somewhat beautiful PEP 
526 variable notations at the class level:

@dataclasses.dataclass
class Color:
hue: int
saturation: float
lightness: float = 0.5

and

class Color(typing.NamedTuple):
hue: int
saturation: float
lightness: float = 0.5

I'm looking for guidance or workarounds for two issues that have arisen.

First, the use of default values seems to completely preclude the use of 
__slots__.  For example, this raises a ValueError:

class A:
__slots__ = ['x', 'y']
x: int = 10
y: int = 20

The second issue is that the different annotations give different signatures 
than would produced for manually written classes.  It is unclear what the best 
practice is for where to put the annotations and their associated docstrings.

In Pydoc for example, this class:

class A:
'Class docstring. x is distance in miles'
x: int
y: int

gives a different signature and docstring than for this class:

class A:
   'Class docstring'
   def __init__(self, x: int, y: int):
   'x is distance in kilometers'
   pass

or for this class:

class A:
'Class docstring'
def __new__(cls, x: int, y: int) -> A:
   '''x is distance in inches
  A is a singleton (once instance per x,y)
   '''
   if (x, y) in cache:
   return cache[x, y]
   return object.__new__(cls, x, y)

The distinction is important because the dataclass decorator allows you to 
suppress the generation of __init__ when you need more control than dataclass 
offers or when you need a __new__ method.  I'm unclear on where the docstring 
and signature for the class is supposed to go so that we get useful signatures 
and matching docstrings.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Issues with PEP 526 Variable Notation at the class level

2017-12-07 Thread Eric V. Smith

On 12/7/17 3:27 PM, Raymond Hettinger wrote:
...


I'm looking for guidance or workarounds for two issues that have arisen.

First, the use of default values seems to completely preclude the use of 
__slots__.  For example, this raises a ValueError:

class A:
__slots__ = ['x', 'y']
x: int = 10
y: int = 20


Hmm, I wasn't aware of that. I'm not sure I understand why that's an 
error. Maybe it could be fixed?


Otherwise, I have a decorator that takes a dataclass and returns a new 
class with slots set:


>>> from dataclasses import dataclass
>>> from dataclass_tools import add_slots
>>> @add_slots
... @dataclass
... class C:
...   x: int = 0
...   y: int = 0
...
>>> c = C()
>>> c
C(x=0, y=0)
>>> c.z = 3
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'C' object has no attribute 'z'

This doesn't help the general case (your class A), but it does at least 
solve it for dataclasses. Whether it should be actually included, and 
what the interface would look like, can be (and I'm sure will be!) argued.


The reason I didn't include it (as @dataclass(slots=True)) is because it 
has to return a new class, and the rest of the dataclass features just 
modifies the given class in place. I wanted to maintain that conceptual 
simplicity. But this might be a reason to abandon that. For what it's 
worth, attrs does have an @attr.s(slots=True) that returns a new class 
with __slots__ set.



The second issue is that the different annotations give different signatures 
than would produced for manually written classes.  It is unclear what the best 
practice is for where to put the annotations and their associated docstrings.


I don't have any suggestions here.

Eric.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] iso8601 parsing

2017-12-07 Thread Paul G
> And I'm sorry, I got a bit lost in the PR, but you are attaching an
> "offset" tzinfo, when parsing an iso string that has one, yes?

Yes, a fixed offset time zone (since the original zone information is lost):

>>> from dateutil import tz
>>> from datetime import datetime
>>> datetime(2014, 12, 11, 9, 30, tzinfo=tz.gettz('US/Eastern'))
datetime.datetime(2014, 12, 11, 9, 30, 
tzinfo=tzfile('/usr/share/zoneinfo/US/Eastern'))
>>> datetime(2014, 12, 11, 9, 30, tzinfo=tz.gettz('US/Eastern')).isoformat()
'2014-12-11T09:30:00-05:00'
>>> datetime.fromisoformat('2014-12-11T09:30:00-05:00')
datetime.datetime(2014, 12, 11, 9, 30, 
tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=68400)))

> I fully agree that that's the MVP -- but is it that hard to parse arbitrary
> ISO8601 strings in once you've gotten this far? It's a bit uglier than I'd
> like, but not THAT bad a spec.

No, and in fact this PR is adapted from a *more general* ISO-8601 parser that I 
wrote (which is now merged into master on python-dateutil). In the CPython PR I 
deliberately limited it to be the inverse of `isoformat()` for two major 
reasons:

1. It allows us to get something out there that everyone can agree on - not 
only would we have to agree on whether to support arcane ISO8601 formats like 
-Www-D, but we also have to then discuss whether we want to be strict and 
disallow MM like ISO-8601 does, do we want fractional minute support? What 
about different variations (we're already supporting replacing T with any 
character in `.isoformat()` and outputting time zones in the form hh:mm:ss, so 
what other non-compliant variations do we want to add... and then maintain? We 
can have these discussions later if we want, but we might as well start with 
the part everyone can agree on - if it comes out of `isoformat()` it should be 
able to go back in througuh `fromisoformat()`.

2. It makes it *much* easier to understand what formats are supported. You can 
say, "This function is for reading in dates serialized with `.isoformat()`", 
you *immediately* know how to create compliant dates. Not to mention, the 
specific of formats emitted by `isoformat()` can be written very cleanly as: 
-MM-DD[*[HH[:MM[:SS[.mmm[mmm][+HH:MM]] (where * means any character). 
ISO 8601 supports -MM-DD and MMDD but not -MMDD or MM-DD

So, basically, it's not that it's amazingly hard to write a fully-featured 
ISO-8601, it's more that it doesn't seem like a great match for the problem 
this is intended to solve at this point.

Best,
Paul

On 12/07/2017 08:12 PM, Chris Barker wrote:
> 
>> Here is the PR I've submitted:
>>
>> https://github.com/python/cpython/pull/4699
>>
>> The contract that I'm supporting (and, I think it can be argued, the only
>> reasonable contract in the intial implementation) is the following:
>>
>> dtstr = dt.isoformat(*args, **kwargs)
>> dt_rt = datetime.fromisoformat(dtstr)
>> assert dt_rt == dt# The two points represent the
>> same absolute time
>> assert dt_rt.replace(tzinfo=None) == dt.replace(tzinfo=None)   # And
>> the same wall time
>>
> 
> 
> that looks good.
> 

> I see this in the comments in the PR:
> 
> 
> """
> This does not support parsing arbitrary ISO 8601 strings - it is only
> intended
> as the inverse operation of :meth:`datetime.isoformat`
> """
> 

> 
> what ISO8601 compatible features are not supported?
> 
> -CHB
> 
> 



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Victor Stinner
While I'm not strongly convinced that open() error handler must be
changed for surrogateescape, first I would like to make sure that it's
really a very bad idea because changing it :-)


2017-12-07 7:49 GMT+01:00 INADA Naoki :
> I just came up with crazy idea; changing default error handler of open()
> to "surrogateescape" only when open mode is "w" or "a".

The idea is tempting but I'm not sure that it's a good idea. Moreover,
what about "r+" and "w+" modes?

I dislike getting a different behaviour for inputs and outputs. The
motivation for surrogateescape is to "pass through" undecodable bytes:
you need to handle them on the input side and on the output side.

That's why I decided to not only change sys.stdin error handler to
surrogateescape for the POSIX locale, but also sys.stdout:
https://bugs.python.org/issue19977


> When reading, "surrogateescape" error handler is dangerous because
> it can produce arbitrary broken unicode string by mistake.

I'm fine with that. I wouldn't say that it's the purpose of the PEP,
but sadly it's an expected, known and documented side effect.

You get the same behaviour with Unix command line tools and most
Python 2 applications (processing data as bytes). Nothing new under
the sun.

The PEP 540 allows users to write applications behaving like Unix
tools/Python 2 with the power of the Python 3 language and stdlib.

Again, use the Strict UTF8 mode if you prioritize *correctness* over
*usability*.

Honestly, I'm not even sure that the Strict UTF-8 mode is *usable* in
practice, since we are all surrounded by old documents encoded to
various "legacy" encodings (where legay means: "not UTF-8", like
Latin1 or ShiftJIS). The first non-ASCII character which is not
encoded to UTF-8 is going to "crash" the application (big traceback
with an unicode error).


Maybe the problem is the feature name: "UTF-8 mode". Users may think
to "strict" when they read "UTF-8", since UTF-8 is known to be a
strict encoding. For example, UTF-8 is much stricter than latin1 which
is unable to tell if a document was encoded latin1 or whatever else.
UTF-8 is able to tell if a document was actually encoded to UTF-8 or
not, thanks to the design of the encoding itself.



> And it doesn't allow following code:
>
> with open("image.jpg", "r") as f:  # Binary data, not UTF-8
> return f.read()

Using a JPEG image, the example is obviously wrong.

But using surrogateescape on open() is written to read *text files*
which are mostly correctly encoded to UTF-8, except a few bytes.

I'm not sure how to explain the issue. The Mercurial wiki page has a
good example of this issue that they call the "Makefile problem":
https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.22

While it's not exactly the discussed issue, it gives you an issue of
the kind of issues that you have when you use open(filename,
encoding="utf-8", errors="strict") versus open(filename,
encoding="utf-8", errors="surrogateescape")


> I'm not sure about this is good idea.  And I don't know when is good for
> changing write error handler; only when PEP 538 or PEP 540 is used?
> Or always when os.fsencoding() is UTF-8?
>
> Any thoughts?

The PEP 538 doesn't affect the error handler. The PEP 540 only changes
the error handler for the POSIX locale, it's a deliberate choice. The
PEP 538 is only enabled for the POSIX locale, and the PEP 540 will
also be enabled by default by this locale.

I dislike the idea of chaning the error handler if the filesystem
encoding is UTF-8. The UTF-8 mode must be enabled explicitly on
purpose. The reduce any risk of regression, and prepare users who
enable it for any potential issue.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Victor Stinner
2017-12-06 5:07 GMT+01:00 INADA Naoki :
> And opening binary file without "b" option is very common mistake of new
> developers.  If default error handler is surrogateescape, they lose a chance
> to notice their bug.

To come back to your original point, I didn't know that it was a
common mistake to open binary files in text mode.

Honestly, I didn't try recently. How does Python behave when you do that?

Is it possible to write a full binary parser using the text mode? You
should quickly get issues pointing you to your mistake, no?

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Guido van Rossum
On Thu, Dec 7, 2017 at 3:02 PM, Victor Stinner 
wrote:

> 2017-12-06 5:07 GMT+01:00 INADA Naoki :
> > And opening binary file without "b" option is very common mistake of new
> > developers.  If default error handler is surrogateescape, they lose a
> chance
> > to notice their bug.
>
> To come back to your original point, I didn't know that it was a
> common mistake to open binary files in text mode.
>

It probably is because in Python 2 it makes no difference on UNIX, and on
Windows the only difference is that binary mode preserves \r.


> Honestly, I didn't try recently. How does Python behave when you do that?
>
> Is it possible to write a full binary parser using the text mode? You
> should quickly get issues pointing you to your mistake, no?
>

You will quickly get decoding errors, and that is INADA's point. (Unless
you use encoding='Latin-1'.) His worry is that the surrogateescape error
handler makes it so that you won't get decoding errors, and then the
failure mode is much harder to debug.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Victor Stinner
2017-12-08 0:26 GMT+01:00 Guido van Rossum :
> You will quickly get decoding errors, and that is INADA's point. (Unless you
> use encoding='Latin-1'.) His worry is that the surrogateescape error handler
> makes it so that you won't get decoding errors, and then the failure mode is
> much harder to debug.

Hum, my question was more to know if Python fails because of an
operation failing with strings whereas bytes were expected, or if
Python fails with a decoding error... But now I'm not sure aynmore
that this level of detail really matters.


Let me think out loud. To explain unicode issues, I like to use
filenames, since it's something that users view commonly, handle
directly and can modify (and so enter many non-ASCII characters like
diacritics and emojis ;-)).

Filenames can be found on the command line, in environment variables
(PYTHONSTARTUP), stdin (read a list of files from stdin), stdout
(write the list of files into stdout), but also in text files (the
Mercurial "makefile problem).

I consider that the command line and environment variables should
"just work" and so use surrogateescape. It would be too annoying to
not even be able to *start* Python because of an Unicode error. For
example, it wouldn't be easy to identify which environment variable
causes the issue. Hopefully, the UTF-8 doesn't change anything here:
surrogateescape is already used since Python 3.3 for the command line
and environment variables.

For stdin/stdout, I think that the main motivation here is to write
Unix command line tools using Python 3: pass-through undecodable bytes
without bugging the user with Unicode. Users don't use stdin and
stdout as regular files, they are more used as pipes to pass data
between programs with the Unix pipe in a shell like "producer |
consumer". Sometimes stdout is redirected to a file, but I consider
that it is expected to behave as a pipe and the regular TTY stdout.
IMHO we are still in the safe surrogateescape area (for the specific
case of the UTF-8 mode).


Ok, now comes the real question, open().

For open(), I used the example of a code snippet *writing* the content
of a directory (os.listdir) into a text file. Another example is to
read filenames from a text files but pass-through undecodable bytes
thanks to surrogateescape.

But Naoki explained that open() is commonly misused to open binary
files and Python should somehow fail badly to notify the developer of
their mistake.

If I should make a choice between the two categories of usage of
open(), "read undecodable bytes in UTF-8 from a text file" versus
"misuse open() on binary file", I expect that the later is more common
that that open() shouldn't use surrogateescape by default.

While stdin and stdout are usually associated to Unix pipes and Unix
tools working on bytes, files are more commonly associated to
important data that must not be lost nor corrupted. Python is expected
to "help" the developer to use the proper options to read content from
a file and to write content into a file. So I understand that open()
should use the "strict" error handler in the UTF-8 mode, rather than
"surrogateescape".

I can survive to this "tiny" change to my PEP. I just posted a 3rd
version of my PEP where open() error handler remains strict (is no
more changed by the PEP).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread Victor Stinner
Hi,

I made the following two changes to the PEP 540:

* open() error handler remains "strict"
* remove the "Strict UTF8 mode" which doesn't make much sense anymore

I wrote the Strict UTF-8 mode when open() used surrogateescape error
handler in the UTF-8 mode. I don't think that a Strict UTF-8 mode is
required just to change the error handler of stdin and stdout. Well,
read the "Passthough undecodable bytes: surrogateescape" section of
the PEP rationale :-)


https://www.python.org/dev/peps/pep-0540/

Victor


PEP: 540
Title: Add a new UTF-8 mode
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner 
BDFL-Delegate: INADA Naoki
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 5-January-2016
Python-Version: 3.7


Abstract


Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise
disabled by default.

The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode.


Rationale
=

Locale encoding and UTF-8
-

Python 3.6 uses the locale encoding for filenames, environment
variables, standard streams, etc. The locale encoding is inherited from
the locale; the encoding and the locale are tightly coupled.

Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
locale, but are unable change the locale for different reasons. This
encoding is very limited in term of Unicode support: any non-ASCII
character is likely to cause troubles.

It is not easy to get the expected locale. Locales don't get the exact
same name on all Linux distributions, FreeBSD, macOS, etc. Some
locales, like the recent ``C.UTF-8`` locale, are only supported by a few
platforms. For example, a SSH connection can use a different encoding
than the filesystem or terminal encoding of the local host.

On the other side, Python 3.6 is already using UTF-8 by default on
macOS, Android and Windows (PEP 529) for most functions, except of
``open()``. UTF-8 is also the default encoding of Python scripts, XML
and JSON file formats. The Go programming language uses UTF-8 for
strings.

When all data are stored as UTF-8 but the locale is often misconfigured,
an obvious solution is to ignore the locale and use UTF-8.

PEP 538 attempts to mitigate this problem by coercing the C locale
to a UTF-8 based locale when one is available, but that isn't a
universal solution. For example, CentOS 7's container images default
to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's
locale coercion is ineffective.


Passthough undecodable bytes: surrogateescape
-

When decoding bytes from UTF-8 using the ``strict`` error handler, which
is the default, Python 3 raises a ``UnicodeDecodeError`` on the first
undecodable byte.

Unix command line tools like ``cat`` or ``grep`` and most Python 2
applications simply do not have this class of bugs: they don't decode
data, but process data as a raw bytes sequence.

Python 3 already has a solution to behave like Unix tools and Python 2:
the ``surrogateescape`` error handler (:pep:`383`). It allows to process
data "as bytes" but uses Unicode in practice (undecodable bytes are
stored as surrogate characters).

The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``
and ``stdout`` since these streams as commonly associated to Unix
command line tools.

However, users have a different expectation on files. Files are expected
to be properly encoded. Python is expected to fail early when ``open()``
is called with the wrong options, like opening a JPEG picture in text
mode. The ``open()`` default error handler remains ``strict`` for these
reasons.


No change by default for best backward compatibility


While UTF-8 is perfect in most cases, sometimes the locale encoding is
actually the best encoding.

This PEP changes the behaviour for the POSIX locale since this locale
usually gives the ASCII encoding, whereas UTF-8 is a much better choice.
It does not change the behaviour for other locales to prevent any risk
or regression.

As users are responsible to enable explicitly the new UTF-8 mode, they
are responsible for any potential mojibake issues caused by this mode.


Proposal


Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise
disabled by default.

The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added. The UTF-8 mode is enabled by ``-X utf8`` or
``PYTHONUTF8=1``.

The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.

For standard streams, the ``PYT

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Glenn Linderman

On 12/7/2017 4:48 PM, Victor Stinner wrote:


Ok, now comes the real question, open().

For open(), I used the example of a code snippet *writing* the content
of a directory (os.listdir) into a text file. Another example is to
read filenames from a text files but pass-through undecodable bytes
thanks to surrogateescape.

But Naoki explained that open() is commonly misused to open binary
files and Python should somehow fail badly to notify the developer of
their mistake.


So the real problem here is that open has a default mode of text. 
Instead of forcing the user to specify either "text" or "binary" when 
opening, text is used as a default, binary as an option to be specified.


I understand that default has a long history in Unix-land, dating at 
last as far back as 1977 when I first learned how to use the Unix open() 
function.


And now it would be an incompatible change to change it.

The real question is whether or not it is a good idea to change it... at 
this point in time, with Unicode and UTF-8 so prevalent, text and binary 
modes are far different than back in 1977, when they mostly just 
documented that this was a binary file that was being opened, and that 
one could more likely expect to see read() than fgets() in the following 
code.


If it were to be changed, one could add a text-mode option in 3.7, say 
"t" in the mode string, and a PendingDeprecationWarning for open calls 
without the specification of either t or b in the mode string.


In 3.8, the warning would be changed to DeprecationWarning.

In 3.9, all open calls would need to have either t or b, or would fail.

Meanwhile, back on the PEP 540 ranch, text mode open calls could 
immediately use surrogateescape, binary mode open calls would not, and 
unspecified open calls would not.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Jonathan Goble
On Thu, Dec 7, 2017 at 8:38 PM Glenn Linderman 
wrote:

> If it were to be changed, one could add a text-mode option in 3.7, say "t"
> in the mode string, and a PendingDeprecationWarning for open calls without
> the specification of either t or b in the mode string.
>

"t" is already supported in open()'s mode argument [1] as a way to
explicitly request text mode, though it's essentially ignored right now
since text is the default anyway. So since the option is already present,
the only thing needed at this stage for your plan would be to begin
deprecating not using it.

*goes back to lurking*

[1] https://docs.python.org/3/library/functions.html#open
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Glenn Linderman

On 12/7/2017 5:45 PM, Jonathan Goble wrote:
On Thu, Dec 7, 2017 at 8:38 PM Glenn Linderman > wrote:


If it were to be changed, one could add a text-mode option in 3.7,
say "t" in the mode string, and a PendingDeprecationWarning for
open calls without the specification of either t or b in the mode
string.


"t" is already supported in open()'s mode argument [1] as a way to 
explicitly request text mode, though it's essentially ignored right 
now since text is the default anyway. So since the option is already 
present, the only thing needed at this stage for your plan would be to 
begin deprecating not using it.


*goes back to lurking*

[1] https://docs.python.org/3/library/functions.html#open


Thanks for briefly de-lurking.

So then for PEP 540... use surrogateescape immediately for t mode.

Then, when the user encounters an encoding error, there would be three 
solutions: switch to t mode, explicitly switch to surrogateescape, or 
fix the locale.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] iso8601 parsing

2017-12-07 Thread Chris Barker - NOAA Federal
>but is it that hard to parse arbitrary

ISO8601 strings in once you've gotten this far? It's a bit uglier than I'd

like, but not THAT bad a spec.


No, and in fact this PR is adapted from a *more general* ISO-8601 parser
that I wrote (which is now merged into master on python-dateutil). In the
CPython PR I deliberately limited it to be the inverse of `isoformat()` for
two major reasons:

1. It allows us to get something out there that everyone can agree on - not
only would we have to agree on whether to support arcane ISO8601 formats
like -Www-D,


I don’t know — would anyone complain about it supporting too arcane a
format?

Also — “most ISO compliant “ date time strings would get us a long way.

but we also have to then discuss whether we want to be strict and disallow
MM like ISO-8601 does,


Well, I think disallowing something has little utility - we really don’t
want this to be a validator.

do we want fractional minute support? What about different variations
(we're already supporting replacing T with any character in `.isoformat()`
and outputting time zones in the form hh:mm:ss, so what other non-compliant
variations do we want to add..


Wait — does datetime.isoformat() put out non-compliant strings?

Anyway, supporting all of what .isoformat() puts out, plus Most of iso8601
would be a great start.

 - if it comes out of `isoformat()` it should be able to go back in
througuh `fromisoformat()`.


Yup.

But had anyone raised objections to it being more flexible?

2. It makes it *much* easier to understand what formats are supported. You
can say, "This function is for reading in dates serialized with
`.isoformat()`", you *immediately* know how to create compliant dates.


We could still document that as the preferred form.

You’re writing the code, and I don’t have time to help, so by all means do
what you think is best.

But if you’ve got code that’s more flexible, I can’t imagine anyone
complaining about a more flexible parser.

Though I have a limited imagination about such things.

But I hope it will at least accept both with and without the T.

Thanks for working on this.

-Chris

On 12/07/2017 08:12 PM, Chris Barker wrote:


Here is the PR I've submitted:


https://github.com/python/cpython/pull/4699


The contract that I'm supporting (and, I think it can be argued, the only

reasonable contract in the intial implementation) is the following:


   dtstr = dt.isoformat(*args, **kwargs)

   dt_rt = datetime.fromisoformat(dtstr)

   assert dt_rt == dt# The two points represent the

same absolute time

   assert dt_rt.replace(tzinfo=None) == dt.replace(tzinfo=None)   # And

the same wall time




that looks good.



I see this in the comments in the PR:



"""

This does not support parsing arbitrary ISO 8601 strings - it is only

intended

as the inverse operation of :meth:`datetime.isoformat`

"""




what ISO8601 compatible features are not supported?


-CHB
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Chris Barker - NOAA Federal
I’m a bit confused:

File names and the like are one thing, and the CONTENTS of files is quite
another.

I get that there is theoretically a “default” encoding for the contents of
text files, but that is SO likely to be wrong as to be ignorable.

open() already defaults to utf-8. Which is a fine default if you are going
to have one, but it seems a bad idea to have it default to surrogateescape
EVER, regardless of the locale or anything else.

If the file is binary, or a different encoding, or simply broken, it’s much
better to get an encoding error as soon as possible.

Why does this have anything to do with the PEP?

Perhaps the issue of reading a filename from the system, writing it to a
file, then reading it back in again.

I actually do that a lot — but mostly so I can pass that file to another
system, so I really don’t want broken encoding in it anyway.

-CHB


Sent from my iPhone

On Dec 7, 2017, at 5:53 PM, Glenn Linderman  wrote:

On 12/7/2017 5:45 PM, Jonathan Goble wrote:

On Thu, Dec 7, 2017 at 8:38 PM Glenn Linderman 
wrote:

> If it were to be changed, one could add a text-mode option in 3.7, say "t"
> in the mode string, and a PendingDeprecationWarning for open calls without
> the specification of either t or b in the mode string.
>

"t" is already supported in open()'s mode argument [1] as a way to
explicitly request text mode, though it's essentially ignored right now
since text is the default anyway. So since the option is already present,
the only thing needed at this stage for your plan would be to begin
deprecating not using it.

*goes back to lurking*

[1] https://docs.python.org/3/library/functions.html#open


Thanks for briefly de-lurking.

So then for PEP 540... use surrogateescape immediately for t mode.

Then, when the user encounters an encoding error, there would be three
solutions: switch to t mode, explicitly switch to surrogateescape, or fix
the locale.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread Chris Barker - NOAA Federal
I made the following two changes to the PEP 540:

* open() error handler remains "strict"
* remove the "Strict UTF8 mode" which doesn't make much sense anymore


+1 — ignore my previous note.

-CHB


I wrote the Strict UTF-8 mode when open() used surrogateescape error
handler in the UTF-8 mode. I don't think that a Strict UTF-8 mode is
required just to change the error handler of stdin and stdout. Well,
read the "Passthough undecodable bytes: surrogateescape" section of
the PEP rationale :-)


https://www.python.org/dev/peps/pep-0540/

Victor


PEP: 540
Title: Add a new UTF-8 mode
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner 
BDFL-Delegate: INADA Naoki
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 5-January-2016
Python-Version: 3.7


Abstract


Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise
disabled by default.

The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode.


Rationale
=

Locale encoding and UTF-8
-

Python 3.6 uses the locale encoding for filenames, environment
variables, standard streams, etc. The locale encoding is inherited from
the locale; the encoding and the locale are tightly coupled.

Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
locale, but are unable change the locale for different reasons. This
encoding is very limited in term of Unicode support: any non-ASCII
character is likely to cause troubles.

It is not easy to get the expected locale. Locales don't get the exact
same name on all Linux distributions, FreeBSD, macOS, etc. Some
locales, like the recent ``C.UTF-8`` locale, are only supported by a few
platforms. For example, a SSH connection can use a different encoding
than the filesystem or terminal encoding of the local host.

On the other side, Python 3.6 is already using UTF-8 by default on
macOS, Android and Windows (PEP 529) for most functions, except of
``open()``. UTF-8 is also the default encoding of Python scripts, XML
and JSON file formats. The Go programming language uses UTF-8 for
strings.

When all data are stored as UTF-8 but the locale is often misconfigured,
an obvious solution is to ignore the locale and use UTF-8.

PEP 538 attempts to mitigate this problem by coercing the C locale
to a UTF-8 based locale when one is available, but that isn't a
universal solution. For example, CentOS 7's container images default
to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's
locale coercion is ineffective.


Passthough undecodable bytes: surrogateescape
-

When decoding bytes from UTF-8 using the ``strict`` error handler, which
is the default, Python 3 raises a ``UnicodeDecodeError`` on the first
undecodable byte.

Unix command line tools like ``cat`` or ``grep`` and most Python 2
applications simply do not have this class of bugs: they don't decode
data, but process data as a raw bytes sequence.

Python 3 already has a solution to behave like Unix tools and Python 2:
the ``surrogateescape`` error handler (:pep:`383`). It allows to process
data "as bytes" but uses Unicode in practice (undecodable bytes are
stored as surrogate characters).

The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``
and ``stdout`` since these streams as commonly associated to Unix
command line tools.

However, users have a different expectation on files. Files are expected
to be properly encoded. Python is expected to fail early when ``open()``
is called with the wrong options, like opening a JPEG picture in text
mode. The ``open()`` default error handler remains ``strict`` for these
reasons.


No change by default for best backward compatibility


While UTF-8 is perfect in most cases, sometimes the locale encoding is
actually the best encoding.

This PEP changes the behaviour for the POSIX locale since this locale
usually gives the ASCII encoding, whereas UTF-8 is a much better choice.
It does not change the behaviour for other locales to prevent any risk
or regression.

As users are responsible to enable explicitly the new UTF-8 mode, they
are responsible for any potential mojibake issues caused by this mode.


Proposal


Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise
disabled by default.

The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added. The UTF-8 mode is enabled by ``-X utf8`` or
``PYTHONUTF8=1``.

The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``

Re: [Python-Dev] iso8601 parsing

2017-12-07 Thread Mike Miller
Guess the argument for limiting what it accepts would be that every funky 
variation will need to be supported until the endtimes, even those of little use 
or utility.


On the other hand, it might be good to keep the two implementations the same for 
consistency reasons.


Thanks either way,
-Mike


On 2017-12-07 17:57, Chris Barker - NOAA Federal wrote:
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread INADA Naoki
Looks nice.

But I want to clarify more about difference/relationship between PEP
538 and 540.

If I understand correctly:

Both of PEP 538 (locale coercion) and PEP 540 (UTF-8 mode) shares
same logic to detect POSIX locale.

When POSIX locale is detected, locale coercion is tried first. And if
locale coercion
succeeds,  UTF-8 mode is not used because locale is not POSIX anymore.

If locale coercion is disabled or failed, UTF-8 mode is used automatically,
unless it is disabled explicitly.

UTF-8 mode is similar to C.UTF-8 or other locale coercion target locales.
But UTF-8 mode is different from C.UTF-8 locale in these ways because
actual locale is not changed:

* Libraries using locale (e.g. readline) works as in POSIX locale.  So UTF-8
  cannot be used in such libraries.
* locale.getpreferredencoding() returns 'ASCII' instead of 'UTF-8'.  So
  libraries depending on locale.getpreferredencoding() may raise
  UnicodeErrors.

Am I correct?
Or locale.getpreferredencoding() returns UTF-8 in UTF-8 mode too?

INADA Naoki  


On Fri, Dec 8, 2017 at 9:50 AM, Victor Stinner  wrote:
> Hi,
>
> I made the following two changes to the PEP 540:
>
> * open() error handler remains "strict"
> * remove the "Strict UTF8 mode" which doesn't make much sense anymore
>
> I wrote the Strict UTF-8 mode when open() used surrogateescape error
> handler in the UTF-8 mode. I don't think that a Strict UTF-8 mode is
> required just to change the error handler of stdin and stdout. Well,
> read the "Passthough undecodable bytes: surrogateescape" section of
> the PEP rationale :-)
>
>
> https://www.python.org/dev/peps/pep-0540/
>
> Victor
>
>
> PEP: 540
> Title: Add a new UTF-8 mode
> Version: $Revision$
> Last-Modified: $Date$
> Author: Victor Stinner 
> BDFL-Delegate: INADA Naoki
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 5-January-2016
> Python-Version: 3.7
>
>
> Abstract
> 
>
> Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
> change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
> This mode is enabled by default in the POSIX locale, but otherwise
> disabled by default.
>
> The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
> variable are added to control the UTF-8 mode.
>
>
> Rationale
> =
>
> Locale encoding and UTF-8
> -
>
> Python 3.6 uses the locale encoding for filenames, environment
> variables, standard streams, etc. The locale encoding is inherited from
> the locale; the encoding and the locale are tightly coupled.
>
> Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
> locale, but are unable change the locale for different reasons. This
> encoding is very limited in term of Unicode support: any non-ASCII
> character is likely to cause troubles.
>
> It is not easy to get the expected locale. Locales don't get the exact
> same name on all Linux distributions, FreeBSD, macOS, etc. Some
> locales, like the recent ``C.UTF-8`` locale, are only supported by a few
> platforms. For example, a SSH connection can use a different encoding
> than the filesystem or terminal encoding of the local host.
>
> On the other side, Python 3.6 is already using UTF-8 by default on
> macOS, Android and Windows (PEP 529) for most functions, except of
> ``open()``. UTF-8 is also the default encoding of Python scripts, XML
> and JSON file formats. The Go programming language uses UTF-8 for
> strings.
>
> When all data are stored as UTF-8 but the locale is often misconfigured,
> an obvious solution is to ignore the locale and use UTF-8.
>
> PEP 538 attempts to mitigate this problem by coercing the C locale
> to a UTF-8 based locale when one is available, but that isn't a
> universal solution. For example, CentOS 7's container images default
> to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's
> locale coercion is ineffective.
>
>
> Passthough undecodable bytes: surrogateescape
> -
>
> When decoding bytes from UTF-8 using the ``strict`` error handler, which
> is the default, Python 3 raises a ``UnicodeDecodeError`` on the first
> undecodable byte.
>
> Unix command line tools like ``cat`` or ``grep`` and most Python 2
> applications simply do not have this class of bugs: they don't decode
> data, but process data as a raw bytes sequence.
>
> Python 3 already has a solution to behave like Unix tools and Python 2:
> the ``surrogateescape`` error handler (:pep:`383`). It allows to process
> data "as bytes" but uses Unicode in practice (undecodable bytes are
> stored as surrogate characters).
>
> The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``
> and ``stdout`` since these streams as commonly associated to Unix
> command line tools.
>
> However, users have a different expectation on files. Files are expected
> to be properly encoded. Python is expected to fail early when ``open()``
> is called with the wrong options, l

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread INADA Naoki
> Or locale.getpreferredencoding() returns UTF-8 in UTF-8 mode too?

Or should we change loale.getpreferredencoding() to return UTF-8
instead of ASCII always, regardless of PEP 538 and 540?

INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Greg Ewing

Victor Stinner wrote:

Users don't use stdin and
stdout as regular files, they are more used as pipes to pass data
between programs with the Unix pipe in a shell like "producer |
consumer". Sometimes stdout is redirected to a file, but I consider
that it is expected to behave as a pipe and the regular TTY stdout.


It seems weird to me to make a distinction between stdin/stdout
connected to a file and accessing the file some other way.

It would be surprising, for example, if the following two
commands behaved differently with respect to encoding:

   cat foo | sort

   cat < foo | sort


But Naoki explained that open() is commonly misused to open binary
files and Python should somehow fail badly to notify the developer of
their mistake.


Maybe if you *explicitly* open the file in text mode it
should default to surrogateescape, but use strict if text
mode is being used by default?

I.e.

   open("foo", "rt") --> surrogateescape
   open("foo")   --> strict

That way you can easily open a file in a way that's
compatible with the way stdin/stdout behave, but you
will get bitten if you mistakenly open a binary file
as text.

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com