Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-05 Thread Steven D'Aprano
On Fri, Jan 06, 2017 at 02:54:49AM +0100, Victor Stinner wrote:

> Let's say that you have the filename b'nonascii\xff': it's decoded as
> 'nonascii\xdcff' by the UTF-8 mode. How do GUIs handle such filename?
> (I don't know the answer, it's a real question ;-))

I ran this in Python 2.7 to create the file:

open(b'/tmp/nonascii\xff-', 'w')

and then confirmed the filename:

[steve@ando tmp]$ ls -b nonascii*
nonascii\377-

Konquorer in KDE 3 displays it with *two* "missing character" glyphs 
(small hollow boxes) before the hyphen. The KDE "Open File" dialog box 
shows the file with two blank spaces before the hyphen.

My interpretation of this is that the difference is due to using 
different fonts: the file name is shown the same way, but in one font 
the missing character is a small box and in the other it is a blank 
space.

I cannot tell what KDE is using for the invalid character, if I copy it 
as text and paste it into a file I just get the original \xFF.

The Geany text editor, which I think uses the same GUI toolkit as Gnome, 
shows the file with a single "missing glyph" character, this time a 
black diamond with a question mark in it.

It looks like Geany (Gnome?) is displaying the invalid byte as U+FFFD, 
the Unicode "REPLACEMENT CHARACTER".

So at least two Linux GUI environments are capable of dealing with 
filenames that are invalid UTF-8, in two different ways.

Does this answer your question about GUIs?


-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] incremental hashing in __hash__

2017-01-05 Thread Neil Girdhar
On Thu, Jan 5, 2017 at 9:10 PM Stephen J. Turnbull <
turnbull.stephen...@u.tsukuba.ac.jp> wrote:

> Paul Moore writes:
>
>  > The debate here regarding tuple/frozenset indicates that there may not
>  > be a "standard way" of hashing an iterable (should order matter?).
>
> If part of the data structure, yes, if an implementation accident, no.
>
>  > Although I agree that assuming order matters is a reasonable
>  > assumption to make in the absence of any better information.
>
> I don't think so.  Eg, with dicts now ordered by insertion, an
> order-dependent default hash for collections means
>
> a = {}
> b = {}
> a['1'] = 1
> a['2'] = 2
> b['2'] = 2
> b['1'] = 1
> hash(a) != hash(b)# modulo usual probability of collision
>
> (and modulo normally not hashing mutables).  For the same reason I
> expect I'd disagree with Neil's proposal for an ImmutableWhatever
> default __hash__ although the hash comparison is "cheap", it's still a
> pessimization.  Haven't thought that through, though.
>

I don't understand this?  How is providing a default method in an abstract
base class a pessimization?  If it happens to be slower than the code in
the current methods, it can still be overridden.


>
> BTW, it occurs to me that now that dictionaries are versioned, in some
> cases it *may* make sense to hash dictionaries even though they are
> mutable, although the "hash" would need to somehow account for the
> version changing.  Seems messy but maybe someone has an idea?
>

I think it's important to keep in mind that dictionaries are not versioned
in Python. They happen to be versioned in CPython as an unexposed
implementation detail.  I don't think that such details should have any
bearing on potential changes to Python.


> Steve
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-05 Thread Victor Stinner
2017-01-06 3:10 GMT+01:00 Stephen J. Turnbull
:
> The point of this, I suppose, is that piping to xargs works by
> default.

Please read the second version (latest) version of my PEP 540 which
contains a new "Use Cases" section which helps to define issues and
the behaviour of the different modes.


> I haven't read the PEPs (don't have time, mea culpa), but my ideal
> would be three options:
>
> --transparent ->  errors=surrogateescape on input and output
> --postel ->  errors=surrogateescape on input, =strict on output
> --unicode-me-harder ->  errors=strict on input and output

PEP 540:

--postel is the default
--transparent is the UTF-8 mode
--unicode-me-harder is the UTF-8 configured to strict

The POSIX locale enables --transparent.

> with --postel being default.  Unix afficianados with lots of xargs use
> can use --transparent.  Since people have different preferences, I
> guess there should be an envvar for this.

The PEP adds new -X utf8 command line option and PYTHONUTF8
environment variable to configure the UTF-8 mode.


> Others probably should configure open() by open().

My PEP 540 does change the encoding used by open() by default:
https://www.python.org/dev/peps/pep-0540/#encoding-and-error-handler

Obviously, you can still explicitly set the encoding when calling open().


> I'll try to get to the PEPs over the weekend but can't promise.

Please read at least the abstract of my PEP 540 ;-)

Victor
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-05 Thread Stephen J. Turnbull
Victor Stinner writes:

 > Python 3.6 is not exactly in the first or the later category: "it
 > depends".
 > 
 > To read data from the operating system, Python 3.6 behaves in "UNIX
 > mode": os.listdir() *does* return invalid filenames, it uses a funny
 > encoding using surrogates.
 > 
 > To write data back to the operating system, Python 3.6 wears its
 > "Unicode nazi" hat and becomes strict. It's no more possible to write
 > data from from the operating system back to the operating system.
 > Writing a filename read from os.listdir() into stdout or into a text
 > file fails with an encode error.
 > 
 > Subtle behaviour: since Python 3.6, with the POSIX locale, Python 3.6
 > uses the "UNIX mode" but only to write into stdout. It's possible to
 > write a filename into stdout, but not into a text file.

The point of this, I suppose, is that piping to xargs works by
default.

I haven't read the PEPs (don't have time, mea culpa), but my ideal
would be three options:

--transparent ->  errors=surrogateescape on input and output
--postel ->  errors=surrogateescape on input, =strict on output
--unicode-me-harder ->  errors=strict on input and output

with --postel being default.  Unix afficianados with lots of xargs use
can use --transparent.  Since people have different preferences, I
guess there should be an envvar for this.

Others probably should configure open() by open().  I'll try to get to
the PEPs over the weekend but can't promise.

Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] incremental hashing in __hash__

2017-01-05 Thread Stephen J. Turnbull
Paul Moore writes:

 > The debate here regarding tuple/frozenset indicates that there may not
 > be a "standard way" of hashing an iterable (should order matter?).

If part of the data structure, yes, if an implementation accident, no.

 > Although I agree that assuming order matters is a reasonable
 > assumption to make in the absence of any better information.

I don't think so.  Eg, with dicts now ordered by insertion, an
order-dependent default hash for collections means

a = {}
b = {}
a['1'] = 1
a['2'] = 2
b['2'] = 2
b['1'] = 1
hash(a) != hash(b)# modulo usual probability of collision

(and modulo normally not hashing mutables).  For the same reason I
expect I'd disagree with Neil's proposal for an ImmutableWhatever
default __hash__ although the hash comparison is "cheap", it's still a
pessimization.  Haven't thought that through, though.

BTW, it occurs to me that now that dictionaries are versioned, in some
cases it *may* make sense to hash dictionaries even though they are
mutable, although the "hash" would need to somehow account for the
version changing.  Seems messy but maybe someone has an idea?

Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-05 Thread Victor Stinner
2017-01-06 2:15 GMT+01:00 INADA Naoki :
>>> Always use UTF-8 (...)
>>Please don't! (...)
>
> For stdio (including console), PYTHONIOENCODING can be used for
> supporting legacy system.
> e.g. `export PYTHONIOENCODING=$(locale charmap)`

The problem with ignoring the locale by default and forcing UTF-8 is
that Python works with many libraries which use the locale, not UTF-8.
The PEP 538 also describes mojibake issues if Python is embedded in an
application.


> For commandline argument and filepath, UTF-8/surrogateescape can round trip.
> But mojibake may happens when pass the path to GUI.

Let's say that you have the filename b'nonascii\xff': it's decoded as
'nonascii\xdcff' by the UTF-8 mode. How do GUIs handle such filename?
(I don't know the answer, it's a real question ;-))


> If we chose "Always use UTF-8 for fs encoding", I think
> PYTHONFSENCODING envvar should be
> added again.  (It should be used from startup: decoding command line 
> argument).

Last time I implemented PYTHONFSENCODING, I had many major issues:
https://mail.python.org/pipermail/python-dev/2010-October/104509.html

Do you mean that these issues are now outdated and that you have an
idea how to fix them?


> 3) unzip zip file sent by Windows.   Windows user use no-ASCII filenames, and
> create legacy (no UTF-8) zip file very often.
>
> I think people using non UTF-8 should solve encoding issue by themselves.
> People should use ASCII or UTF-8 always if they don't want to see mojibake.

ZIP files are out the scope of the PEPs 538 and 540. Python cannot
guess the encoding, so it was proposed to add an option to give to
user the ability to specify an encoding: see
https://bugs.python.org/issue10614 for example.

But yeah, data encoded to encodings different than UTF-8 are still
common, and it's not going to change shortly. Since many Windows
applications use the ANSI code page, I easily imagine that many
documents are encoded to various incompatible code pages...

What I understood is that many users don't want Python to complain on
data encoded to different incompatible encodings: process data as a
stream of bytes or characters, it depends. Something closer to Python
2 (stream of bytes). That's what I try to describe in this section:
https://www.python.org/dev/peps/pep-0540/#old-data-stored-in-different-encodings-and-surrogateescape

Victor
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-05 Thread Victor Stinner
Ok, I modified my PEP: the POSIX locale now enables the UTF-8 mode.

2017-01-05 18:10 GMT+01:00 Victor Stinner :
> A common request is that "Python just works" without having to pass a
> command line option or set an environment variable. Maybe the default
> behaviour should be left unchanged, but the behaviour with the POSIX
> locale should change.

http://bugs.python.org/issue28180 asks to "change the default" to get
a Python which "just works" without any kind of configuration, in the
context of a Docker image (I don't any detail about the image yet).


> Maybe we can enable the UTF-8 mode (or "UNIX mode") of the PEP 540
> when the POSIX locale is used?

I read again other issues and I confirm that users are looking for a
Python 3 which behaves like Python 2: simply don't bother them with
encodings. I see the UTF-8 mode as an opportunity to answer to this
request.

Moreover, the most common cause of encoding issues is a program run
with no locale variable set and so using the POSIX locale.

So I modified my PEP 540: the POSIX locale now enables the UTF-8 mode.
I had to update the "Backward Compatibility" section since the PEP now
introduces a backward incompatible change (POSIX locale), but my bet
is that the new behaviour is the one expected by users and that it
cannot break applications.

I moved my initial proposition as an alternative.

I added a "Use Cases" section to explain in depth the "always work"
behaviour, which I called the "UNIX mode" in my previous email.

Latest version of the PEP:
https://github.com/python/peps/blob/master/pep-0540.txt

https://www.python.org/dev/peps/pep-0540/ will be updated shortly.

Victor
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-05 Thread INADA Naoki
>> Always use UTF-8
>> 
>>
>> Python already always use the UTF-8 encoding on Mac OS X, Android and 
>> Windows.
>> Since UTF-8 became the defacto encoding, it makes sense to always use it on 
>> all
>> platforms with any locale.
>
>Please don't! I use different locales and encodings, sometimes it's
> utf-8, sometimes not - but I have properly configured LC_* settings and
> I prefer Python to follow my command. It'd be disgusting if Python
> starts to bend me to its preferences.

For stdio (including console), PYTHONIOENCODING can be used for
supporting legacy system.
e.g. `export PYTHONIOENCODING=$(locale charmap)`

For commandline argument and filepath, UTF-8/surrogateescape can round trip.
But mojibake may happens when pass the path to GUI.

If we chose "Always use UTF-8 for fs encoding", I think
PYTHONFSENCODING envvar should be
added again.  (It should be used from startup: decoding command line argument).

>
>> The risk is to introduce mojibake if the locale uses a different encoding,
>> especially for locales other than the POSIX locale.
>
>There is no such risk for me as I already have mojibake in my
> systems. Two most notable sources of mojibake are:
>
> 1) FTP servers - people create files (both names and content) in
>different encodings; w32 FTP clients usually send file names and
>content in cp1251 (Russian Windows encoding), sometimes in cp866
>(Russian Windows OEM encoding).
>
> 2) MP3 tags and play lists - almost always cp1251.
>
>So whatever my personal encoding is - koi8-r or utf-8 - I have to
> deal with file names and content in different encodings.

3) unzip zip file sent by Windows.   Windows user use no-ASCII filenames, and
create legacy (no UTF-8) zip file very often.

I think people using non UTF-8 should solve encoding issue by themselves.
People should use ASCII or UTF-8 always if they don't want to see mojibake.

>
> Oleg.
> --
>  Oleg Broytmanhttp://phdru.name/p...@phdru.name
>Programmers don't die, they just GOSUB without RETURN.
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-05 Thread Steven D'Aprano
On Thu, Jan 05, 2017 at 04:38:22PM +0100, Victor Stinner wrote:

[...]
> Python 3 promotes Unicode everywhere including filenames. A solution to
> support filenames not decodable from the locale encoding was found: the
> ``surrogateescape`` error handler (`PEP 393
> `_), store undecodable bytes
> as surrogate characters.

PEP 393 is the Flexible String Respresentation.

I think you want PEP 383, Non-decodable Bytes in System Character 
Interfaces.

https://www.python.org/dev/peps/pep-0383/

> The problem is that operating system data like filenames are decoded
> using the ``surrogateescape`` error handler (PEP 393).

/s/393/283/



-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] incremental hashing in __hash__

2017-01-05 Thread Neil Girdhar
On Thu, Jan 5, 2017 at 10:58 AM Paul Moore  wrote:

> On 5 January 2017 at 13:28, Neil Girdhar  wrote:
> > The point is that the OP doesn't want to write his own hash function, but
> > wants Python to provide a standard way of hashing an iterable.  Today,
> the
> > standard way is to convert to tuple and call hash on that.  That may not
> be
> > efficient. FWIW from a style perspective, I agree with OP.
>
> The debate here regarding tuple/frozenset indicates that there may not
> be a "standard way" of hashing an iterable (should order matter?).
> Although I agree that assuming order matters is a reasonable
> assumption to make in the absence of any better information.
>

That's another good point.  In keeping with my abc proposal, why not add
abstract base classes with __hash__:
* ImmutableIterable, and
* ImmutableSet.

ImmutableSet inherits from ImmutableIterable, and overrides __hash__ in
such a way that order is ignored.

This presumably involves very little new code — it's just a propagating up
of the code that's already in set and tuple.

The advantage is that instead of implementing __hash__ for your type, you
declare your intention by inheriting from an abc and get an
automatically-provided hash function.

Hashing is low enough level that providing helpers in the stdlib is
> not unreasonable. It's not obvious (to me, at least) that it's a
> common enough need to warrant it, though. Do we have any information
> on how often people implement their own __hash__, or how often
> hash(tuple(my_iterable)) would be an acceptable hash, except for the
> cost of creating the tuple? The OP's request is the only time this has
> come up as a requirement, to my knowledge. Hence my suggestion to copy
> the tuple implementation, modify it to work with general iterables,
> and publish it as a 3rd party module - its usage might give us an idea
> of how often this need arises. (The other option would be for someone
> to do some analysis of published code).
>
> Assuming it is a sufficiently useful primitive to add, then we can
> debate naming. But I'd prefer it to be named in such a way that it
> makes it clear that it's a low-level helper for people writing their
> own __hash__ function, and not some sort of variant of hashing (which
> hash.from_iterable implies to me).
>
> Paul
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-05 Thread Victor Stinner
> https://www.python.org/dev/peps/pep-0540/

I read the PEP 538, PEP 540, and issues related to switching to UTF-8. At
least, I can say one thing: people have different points of view :-)

To understand why people disagree, I tried to categorize the different point of
views and Python expectations:

"UNIX mode":

   Python 2 developers and long UNIX users expect that their code "just
   works". They like Python 3 features, but Python 3 annoy them with
   various encoding errors. The expectation is to be able to read data
   encoded to various incompatible encodings and write it into stdout or
   a text file. In short, mojibake is not a bug but a feature!

"Strict Unicode mode" for real Unicode fans:

   Python 3 is strict and it's a good thing! Strict codec helps to
   detect very early bugs in the code. These developers understand very
   well Unicode and are able to fix complex encoding issues. Mojibake is
   a no-no for them.

Python 3.6 is not exactly in the first or the later category: "it
depends".

To read data from the operating system, Python 3.6 behaves in "UNIX
mode": os.listdir() *does* return invalid filenames, it uses a funny
encoding using surrogates.

To write data back to the operating system, Python 3.6 wears its
"Unicode nazi" hat and becomes strict. It's no more possible to write
data from from the operating system back to the operating system.
Writing a filename read from os.listdir() into stdout or into a text
file fails with an encode error.

Subtle behaviour: since Python 3.6, with the POSIX locale, Python 3.6
uses the "UNIX mode" but only to write into stdout. It's possible to
write a filename into stdout, but not into a text file.

In its current shame, my PEP 540 leaves Python default unchanged, but
adds two modes: UTF-8 and UTF-8 strict. The UTF-8 mode is more or less
the UNIX mode generalized for all inputs and outputs: mojibake is a
feature, just pass bytes unchanged. The UTF-8 strict mode is more
extreme that the current "Strict Unicode mode" since it fails on
*decoding* data from the operating system.

Now that I have a better view of what we have and what we want, the
question is if the default behaviour should be changed and if yes,
how.

Nick's PEP 538 does exactly move to the "UNIX mode" (open() doesn't
use surrogateescape) nor the "Strict Unicode mode" (fsdecode() still
uses surrogateescape), it's still in a grey area. Maybe Nick can
elaborate the use case or update his PEP?

I guess that all users and most developers are more in the "UNIX mode"
camp. *If* we want to change the default, I suggest to use the "UNIX
mode" by default.

The question is if someone relies/likes on the current Python 3.6
behaviour: reading "just works", writing is strict.

If you like this behaviour, what do you think of the tiny Python 3.6
change: use surrogateescape for stdout when the locale is POSIX.

Victor
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] incremental hashing in __hash__

2017-01-05 Thread Paul Moore
On 5 January 2017 at 13:28, Neil Girdhar  wrote:
> The point is that the OP doesn't want to write his own hash function, but
> wants Python to provide a standard way of hashing an iterable.  Today, the
> standard way is to convert to tuple and call hash on that.  That may not be
> efficient. FWIW from a style perspective, I agree with OP.

The debate here regarding tuple/frozenset indicates that there may not
be a "standard way" of hashing an iterable (should order matter?).
Although I agree that assuming order matters is a reasonable
assumption to make in the absence of any better information.

Hashing is low enough level that providing helpers in the stdlib is
not unreasonable. It's not obvious (to me, at least) that it's a
common enough need to warrant it, though. Do we have any information
on how often people implement their own __hash__, or how often
hash(tuple(my_iterable)) would be an acceptable hash, except for the
cost of creating the tuple? The OP's request is the only time this has
come up as a requirement, to my knowledge. Hence my suggestion to copy
the tuple implementation, modify it to work with general iterables,
and publish it as a 3rd party module - its usage might give us an idea
of how often this need arises. (The other option would be for someone
to do some analysis of published code).

Assuming it is a sufficiently useful primitive to add, then we can
debate naming. But I'd prefer it to be named in such a way that it
makes it clear that it's a low-level helper for people writing their
own __hash__ function, and not some sort of variant of hashing (which
hash.from_iterable implies to me).

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-05 Thread Victor Stinner
Hi,

Nick Coghlan asked me to review his PEP 538 "Coercing the legacy C
locale to C.UTF-8":
https://www.python.org/dev/peps/pep-0538/

Nick wants to change the default behaviour. I'm not sure that I'm
brave enough to follow this direction, so I proposed my old "-X utf8"
command line idea as a new PEP: add a new UTF-8 mode, *disabled by
default*.

These 2 PEPs are the follow-up of the Windows PEP 529 (Change Windows
filesystem encoding to UTF-8) and the issue #19977 (Use
"surrogateescape" error handler for sys.stdin and sys.stdout on UNIX
for the C locale).

The topic (switching to UTF-8 on UNIX) is actively discussed on:
http://bugs.python.org/issue28180

Read the PEP online (HTML):
https://www.python.org/dev/peps/pep-0540/

Victor


PEP: 540
Title: Add a new UTF-8 mode
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner 
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 5-January-2016
Python-Version: 3.7


Abstract


Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system
data instead of the locale encoding. Add ``-X utf8`` command line option
and ``PYTHONUTF8`` environment variable.


Context
===

Locale and operating system data


Python uses the ``LC_CTYPE`` locale to decide how to encode and decode
data from/to the operating system:

* file content
* command line arguments: ``sys.argv``
* standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr``
* environment variables: ``os.environ``
* filenames: ``os.listdir(str)`` for example
* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
* error messages
* name of a timezone
* user name, terminal name: ``os``, ``grp`` and ``pwd`` modules
* host name, UNIX socket path: see the ``socket`` module
* etc.

At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
``LC_CTYPE`` locale and then store the locale encoding,
``sys.getfilesystemencoding()``. In the whole lifetime of a Python process,
the same encoding and error handler are used to encode and decode data
from/to the operating system.

.. note::
   In some corner case, the *current* ``LC_CTYPE`` locale must be used
   instead of ``sys.getfilesystemencoding()``. For example, the ``time``
   module uses the *current* ``LC_CTYPE`` locale to decode timezone
   names.


The POSIX locale and its encoding
-

The following environment variables are used to configure the locale, in
this preference order:

* ``LC_ALL``, most important variable
* ``LC_CTYPE``
* ``LANG``

The POSIX locale,also known as "the C locale", is used:

* if the first set variable is set to ``"C"``
* if all these variables are unset, for example when a program is
  started in an empty environment.

The encoding of the POSIX locale must be ASCII or a superset of ASCII.

On Linux, the POSIX locale uses the ASCII encoding.

On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
``os.fsencode()`` and ``os.fsdecode()`` use
``locale.getpreferredencoding()`` codec. For example, if command line
arguments are decoded by ``mbstowcs()`` and encoded back by
``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
of retrieving the original byte string.

To fix this issue, Python now checks since Python 3.4 if ``mbstowcs()``
really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
alias to ASCII). If not (the effective encoding is not ASCII), Python
uses its own ASCII codec instead of using ``mbstowcs()`` and
``wcstombs()`` functions for operating system data.

See the `POSIX locale (2016 Edition)
`_.


C.UTF-8 and C.utf8 locales
--

Some operating systems provide a variant of the POSIX locale using the
UTF-8 encoding:

* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
* Debian (eglibc 2.13-1, 2011): ``"C.UTF-8"``
* HP-UX: ``"C.utf8"``

It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8
proposal `_.


Popularity of the UTF-8 encoding


Python 3 uses UTF-8 by default for Python source files.

On Mac OS X, Windows and Android, Python always use UTF-8 for operating
system data instead of the locale encoding. For Windows, see the `PEP
529: Change Windows filesystem encoding to UTF-8
`_.

On Linux, UTF-8 became the defacto standard encoding by default,
replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
using different encodings for filenames and standard streams is likely
to create mojibake, so UTF-8 is now used *everywhere*.

The UTF-8 encoding is the default encoding of XML and JSON 

Re: [Python-ideas] incremental hashing in __hash__

2017-01-05 Thread Random832
On Thu, Jan 5, 2017, at 04:00, Matt Gilson wrote:
> But, I think that the problem with adding `__hash__` to
> collections.abc.Iterable is that not all iterables are immutable -- And
> if
> they aren't immutable, then allowing them to be hashed is likely to be a
> pretty bad idea...

Why? This should never cause an interpreter-crashing bug, because
user-defined types can have bad hash methods anyway. And without that,
the reason for not applying the "consenting adults" principle and
allowing people to add mutable objects to a *short-lived* dict without
intending to change them while the dict is in use has never been clear
to me. I think mutable types not having a hash method was a mistake in
the first place.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] incremental hashing in __hash__

2017-01-05 Thread Neil Girdhar
On Thu, Jan 5, 2017 at 4:00 AM Matt Gilson  wrote:

> But, I think that the problem with adding `__hash__` to
> collections.abc.Iterable is that not all iterables are immutable -- And if
> they aren't immutable, then allowing them to be hashed is likely to be a
> pretty bad idea...
>

Good point.  A better option is to add collections.abc.ImmutableIterable
that derives from Iterable and provides __hash__.  Since tuple inherits
from it, it can choose to delegate up.  Then I think everyone is happy.

>
> I'm still having a hard time being convinced that this is very much of an
> optimization at all ...
>
> If you start hashing tuples that are large enough that memory is a
> concern, then that's going to also take a *really* long time and probably
> be prohibitive anyway.  Just for kicks, I decided to throw together a
> simple script to time how much penalty you pay for hashing a tuple:
>
> class F(object):
> def __init__(self, arg):
> self.arg = arg
>
> def __hash__(self):
> return hash(tuple(self.arg))
>
>
> class T(object):
> def __init__(self, arg):
> self.arg = tuple(arg)
>
> def __hash__(self):
> return hash(self.arg)
>
>
> class C(object):
> def __init__(self, arg):
> self.arg = tuple(arg)
> self._hash = None
>
> def __hash__(self):
> if self._hash is None:
> self._hash = hash(tuple(self.arg))
> return self._hash
>
> import timeit
>
> print(timeit.timeit('hash(f)', 'from __main__ import F; f =
> F(list(range(500)))'))
> print(timeit.timeit('hash(t)', 'from __main__ import T; t =
> T(list(range(500)))'))
> print(timeit.timeit('hash(c)', 'from __main__ import C; c =
> C(list(range(500)))'))
>
> results = []
> for i in range(1, 11):
> n = i * 100
> t1 = timeit.timeit('hash(f)', 'from __main__ import F; f =
> F(list(range(%d)))' % i)
> t2 = timeit.timeit('hash(t)', 'from __main__ import T; t =
> T(list(range(%d)))' % i)
> results.append(t1/t2)
> print(results)
>
>
> F is going to create a new tuple each time and then hash it.  T already
> has a tuple, so we'll only pay the cost of hashing a tuple, not the cost of
> constructing a tuple and C caches the hash value and re-uses it once it is
> known.  C is the winner by a factor of 10 or more (no surprise there).  But
> the real interesting thing is that the the ratio of the timing results from
> hashing `F` vs. `T` is relatively constant in the range of my test (up to
> 1000 elements) and that ratio's value is approximately 1.3.  For most
> applications, that seems reasonable.  If you really need a speed-up, then I
> suppose you could recode the thing in Cython and see what happens, but I
> doubt that will be frequently necessary.  If you _do_ code it up in Cython,
> put it up on Pypi and see if people use it...
>
>
> On Wed, Jan 4, 2017 at 5:04 PM, Neil Girdhar 
> wrote:
>
> Couldn't you add __hash__ to collections.abc.Iterable ?  Essentially,
> expose __hash__ there; then all iterables automatically have a default hash
> that hashes their ordered contents.
>
> On Wednesday, January 4, 2017 at 7:37:26 PM UTC-5, Steven D'Aprano wrote:
>
> On Wed, Jan 04, 2017 at 04:38:05PM -0500, j...@math.brown.edu wrote:
> > Instead of the proposals like "hash.from_iterable()", would it make
> sense
> > to allow tuple.__hash__() to accept any iterable, when called as a
> > classmethod?
>
> The public API for calculating the hash of something is to call the
> hash() builtin function on some object, e.g. to call tuple.__hash__ you
> write hash((a, b, c)). The __hash__ dunder method is implementation, not
> interface, and normally shouldn't be called directly.
>
> Unless I'm missing something obvious, your proposal would require the
> caller to call the dunder methods directly:
>
> class X:
> def __hash__(self):
> return tuple.__hash__(iter(self))
>
> I consider that a poor interface design.
>
> But even if we decide to make an exception in this case, tuple.__hash__
> is currently an ordinary instance method right now. There's probably
> code that relies on that fact and expects that:
>
> tuple.__hash__((a, b, c))
>
> is currently the same as
>
> (a, b, c).__hash__()
>
>
> (Starting with the hash() builtin itself, I expect, although that is
> easy enough to fix if needed.) Your proposal will break backwards
> compatibility, as it requires a change in semantics:
>
> (1) (a, b, c).__hash__() must keep the current behaviour, which
> means behaving like a bound instance method;
>
> (2) But tuple.__hash__ will no longer return an unbound method (actually
> a function object, but the difference is unimportant) and instead will
> return something that behaves like a bound class method.
>
> Here's an implementation which does this:
>
> http://code.activestate.com/recipes/577030-dualmethod-descriptor/
>
> so such a thing is possible. But it breaks backwards-compatability and
> introduces something which I 

Re: [Python-ideas] incremental hashing in __hash__

2017-01-05 Thread M.-A. Lemburg
On 28.12.2016 04:13, j...@math.brown.edu wrote:
> Suppose you have implemented an immutable Position type to represent
> the state of a game played on an MxN board, where the board size can
> grow quite large.
> ...
> 
> According to 
> https://docs.python.org/3/reference/datamodel.html#object.__hash__
> :
> 
> 
> """
> it is advised to mix together the hash values of the components of the
> object that also play a part in comparison of objects by packing them
> into a tuple and hashing the tuple. Example:
> 
> def __hash__(self):
> return hash((self.name, self.nick, self.color))
> 
> """
> 
> 
> Applying this advice to the use cases above would require creating an
> arbitrarily large tuple in memory before passing it to hash(), which
> is then just thrown away. It would be preferable if there were a way
> to pass multiple values to hash() in a streaming fashion, such that
> the overall hash were computed incrementally, without building up a
> large object in memory first.

I think there's a misunderstanding here: the hash(obj) built-in
merely interfaces to the obj.__hash__() method (or the tp_hash slot
for C types) and returns whatever these methods give.

It doesn't implement any logic by itself.

If you would like to implement a more efficient hash algorithm
for your types, just go ahead and write them as .__hash__()
method or tp_hash slot method and you're done.

The example from the docs is just to showcase an example of
how such a hash function should work, i.e. to mix in all
relevant data attributes.

In your case, you'd probably use a simple for loop to calculate
the hash without creating tuples or any other temporary
structures.

Here's the hash implementation tuples use as an example

/* The addend 82520, was selected from the range(0, 100) for
   generating the greatest number of prime multipliers for tuples
   upto length eight:

 1082527, 1165049, 1082531, 1165057, 1247581, 1330103, 1082533,
 1330111, 1412633, 1165069, 1247599, 1495177, 1577699

   Tests have shown that it's not worth to cache the hash value, see
   issue #9685.
*/

static Py_hash_t
tuplehash(PyTupleObject *v)
{
Py_uhash_t x;  /* Unsigned for defined overflow behavior. */
Py_hash_t y;
Py_ssize_t len = Py_SIZE(v);
PyObject **p;
Py_uhash_t mult = _PyHASH_MULTIPLIER;
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
x = (x ^ y) * mult;
/* the cast might truncate len; that doesn't change hash
stability */
mult += (Py_hash_t)(82520UL + len + len);
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
x = -2;
return x;
}

As you can see, there's some magic going on there to make
sure that the hash values behave well when used as "keys"
for the dictionary implementation (which is their main
purpose in Python).

You are free to create your own hash implementation.
The only characteristic to pay attention to is to have
objects which compare equal give the same hash value.
This is needed to be able to map such objects to the same
dictionary slots.

There should be no need to have a special hash function which
works on iterables. As long as those iterable objects define
their own .__hash__() method or tp_slot, the hash() built-in
(and Python's dict implementation) will use these and, if needed,
those methods can then use an approach to build hash values
using iterators on the object's internal data along similar
lines as the above tuple implementation.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 05 2017)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] incremental hashing in __hash__

2017-01-05 Thread Matt Gilson
But, I think that the problem with adding `__hash__` to
collections.abc.Iterable is that not all iterables are immutable -- And if
they aren't immutable, then allowing them to be hashed is likely to be a
pretty bad idea...

I'm still having a hard time being convinced that this is very much of an
optimization at all ...

If you start hashing tuples that are large enough that memory is a concern,
then that's going to also take a *really* long time and probably be
prohibitive anyway.  Just for kicks, I decided to throw together a simple
script to time how much penalty you pay for hashing a tuple:

class F(object):
def __init__(self, arg):
self.arg = arg

def __hash__(self):
return hash(tuple(self.arg))


class T(object):
def __init__(self, arg):
self.arg = tuple(arg)

def __hash__(self):
return hash(self.arg)


class C(object):
def __init__(self, arg):
self.arg = tuple(arg)
self._hash = None

def __hash__(self):
if self._hash is None:
self._hash = hash(tuple(self.arg))
return self._hash

import timeit

print(timeit.timeit('hash(f)', 'from __main__ import F; f =
F(list(range(500)))'))
print(timeit.timeit('hash(t)', 'from __main__ import T; t =
T(list(range(500)))'))
print(timeit.timeit('hash(c)', 'from __main__ import C; c =
C(list(range(500)))'))

results = []
for i in range(1, 11):
n = i * 100
t1 = timeit.timeit('hash(f)', 'from __main__ import F; f =
F(list(range(%d)))' % i)
t2 = timeit.timeit('hash(t)', 'from __main__ import T; t =
T(list(range(%d)))' % i)
results.append(t1/t2)
print(results)


F is going to create a new tuple each time and then hash it.  T already has
a tuple, so we'll only pay the cost of hashing a tuple, not the cost of
constructing a tuple and C caches the hash value and re-uses it once it is
known.  C is the winner by a factor of 10 or more (no surprise there).  But
the real interesting thing is that the the ratio of the timing results from
hashing `F` vs. `T` is relatively constant in the range of my test (up to
1000 elements) and that ratio's value is approximately 1.3.  For most
applications, that seems reasonable.  If you really need a speed-up, then I
suppose you could recode the thing in Cython and see what happens, but I
doubt that will be frequently necessary.  If you _do_ code it up in Cython,
put it up on Pypi and see if people use it...


On Wed, Jan 4, 2017 at 5:04 PM, Neil Girdhar  wrote:

> Couldn't you add __hash__ to collections.abc.Iterable ?  Essentially,
> expose __hash__ there; then all iterables automatically have a default hash
> that hashes their ordered contents.
>
> On Wednesday, January 4, 2017 at 7:37:26 PM UTC-5, Steven D'Aprano wrote:
>>
>> On Wed, Jan 04, 2017 at 04:38:05PM -0500, j...@math.brown.edu wrote:
>> > Instead of the proposals like "hash.from_iterable()", would it make
>> sense
>> > to allow tuple.__hash__() to accept any iterable, when called as a
>> > classmethod?
>>
>> The public API for calculating the hash of something is to call the
>> hash() builtin function on some object, e.g. to call tuple.__hash__ you
>> write hash((a, b, c)). The __hash__ dunder method is implementation, not
>> interface, and normally shouldn't be called directly.
>>
>> Unless I'm missing something obvious, your proposal would require the
>> caller to call the dunder methods directly:
>>
>> class X:
>> def __hash__(self):
>> return tuple.__hash__(iter(self))
>>
>> I consider that a poor interface design.
>>
>> But even if we decide to make an exception in this case, tuple.__hash__
>> is currently an ordinary instance method right now. There's probably
>> code that relies on that fact and expects that:
>>
>> tuple.__hash__((a, b, c))
>>
>> is currently the same as
>>
>> (a, b, c).__hash__()
>>
>>
>> (Starting with the hash() builtin itself, I expect, although that is
>> easy enough to fix if needed.) Your proposal will break backwards
>> compatibility, as it requires a change in semantics:
>>
>> (1) (a, b, c).__hash__() must keep the current behaviour, which
>> means behaving like a bound instance method;
>>
>> (2) But tuple.__hash__ will no longer return an unbound method (actually
>> a function object, but the difference is unimportant) and instead will
>> return something that behaves like a bound class method.
>>
>> Here's an implementation which does this:
>>
>> http://code.activestate.com/recipes/577030-dualmethod-descriptor/
>>
>> so such a thing is possible. But it breaks backwards-compatability and
>> introduces something which I consider to be an unclean API (calling a
>> dunder method directly). Unless there's a *really* strong advantage to
>>
>> tuple.__hash__(...)
>>
>> over
>>
>> hash.from_iterable(...)
>>
>> (or equivalent), I would be against this change.
>>
>>
>>
>> > (And similarly with frozenset.__hash__(), so that the fast C
>> > implementation of that algorithm could be used, 

Re: [Python-ideas] incremental hashing in __hash__

2017-01-05 Thread Paul Moore
On 5 January 2017 at 00:31, Steven D'Aprano  wrote:
> This is a good point. Until now, I've been assuming that
> hash.from_iterable should consider order. But frozenset shows us that
> sometimes the hash should *not* consider order.
>
> This hints that perhaps the hash.from_iterable() should have its own
> optional dunder method. Or maybe we need two functions: an ordered
> version and an unordered version.
>
> Hmmm... just tossing out a wild idea here... let's get rid of the dunder
> method part of your suggestion, and add new public class methods to
> tuple and frozenset:
>
> tuple.hash_from_iter(iterable)
> frozenset.hash_from_iter(iterable)
>
>
> That gets rid of all the objections about backwards compatibility, since
> these are new methods. They're not dunder names, so there are no
> objections to being used as part of the public API.
>
> A possible objection is the question, is this functionality *actually*
> important enough to bother?
>
> Another possible objection: are these methods part of the sequence/set
> API? If not, do they really belong on the tuple/frozenset? Maybe they
> belong elsewhere?

At this point I'd be inclined to say that a 3rd party hashing_utils
module would be a reasonable place to thrash out these design
decisions before committing to a permanent design in the stdlib. The
popularity of such a module would also give a level of indication as
to whether this is an important optimisation in practice.

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/