subject:"hash\(\)"

Re: Password Hash Validation (Posting On Python-List Prohibited)

2024-07-12 Thread Lawrence D'Oliveiro via Python-list

On Fri, 21 Jun 2024 06:32:58 - (UTC), I wrote:

> On Fri, 21 Jun 2024 03:40:55 - (UTC), I wrote:
> 
>> I think I will create my own wrapper using ctypes.
> 
> Done .

The repo now includes an example script that exercises the various 
functions of the module.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-16 Thread Cameron Simpson

On 16Mar2022 14:00, Marco Sulla  wrote:
>On Wed, 16 Mar 2022 at 00:42, Cameron Simpson  wrote:
>> >In this case I currently cache the value -1. The subsequent calls to
>> >__hash__() will check if the value is -1. If so, a TypeError is
>> >immediately raised.
>>
>> This will also make these values behave badly in dicts/sets, as they all
>> hash to the same bucket.
>
>Not sure to understand. If the hash is -1, it's not hashable, so it
>can't be a member of a dict or set.

Sorry, I misread you and took -1 to be a valid hash code, not a signal 
to raise TypeError.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-16 Thread Marco Sulla

On Wed, 16 Mar 2022 at 09:11, Chris Angelico  wrote:
> Caching the hash of a
> string is very useful; caching the hash of a tuple, not so much; again
> quoting from the CPython source code:
>
> /* Tests have shown that it's not worth to cache the hash value, see
>https://bugs.python.org/issue9685 */

This is really interesting. Unluckily I can't use the pyperformance
benchmarks. I should use the code that uses frozendict, but I suppose
it's really hard...
Anyway this discourages me to continue to store unashable value, since
I should also store the error message. Storing only the hash when the
object is hashable is much cheaper, and maybe the extra field is not
so much a problem, since dict consumes more space than a tuple:

>>> sys.getsizeof({})
64
>>> sys.getsizeof(())
40
>>> sys.getsizeof({1:1})
232
>>> sys.getsizeof((1,))
48

> I don't know what use-cases frozendicts have, but I would
> suspect that if they are used at all, they'll often be used in cases
> where their keys are identical (for instance, the __dict__ of an
> immutable object type, where the keys are extremely stable across
> objects of the same type).

Well, I tried to implement them as dicts with shared keys, but I
abandoned it when Inada optimized dict(another_dict), where
another_dict is a compact dict. Since it's compact, you have "only" to
memcopy the entries (oversimplification).

I tried to do the same trick for the sparse dict structure, but
memcopy the keys and the values was not enough. I had to incref all
value *two* times and this slowed down the creation a lot. So I
decided to move to compact structure.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-16 Thread Greg Ewing

On 16/03/22 6:58 pm, Chris Angelico wrote:
And Python's own integers hash to themselves, 

> which isn't what I'd call "widely distributed", but which
> works well in practice.

Not exactly, they seem to be limited to 60 bits:

>>> hex(hash(0xfff))
'0xfff'
>>> hex(hash(0x1fff))
'0x0'

And up to that limit they're as widely distributed as you
can get -- each integer hashes to a unique value!

But keep in mind that the hash value itself is not directly
used to locate a dict slot -- there is extra scrambling that
goes on in the dict code.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-16 Thread Marco Sulla

On Wed, 16 Mar 2022 at 00:59, Chris Angelico  wrote:
>
> (Though it's a little confusing; a frozendict has to have nothing but
> immutable objects, yet it permits them to be unhashable?

It can have mutable objects. For example, a key k can have a list v as
value. You can modify v, but you can't assign to the key k another
value w. It's the same with the tuples, as you said. An index i can
contain a list l. Since it's a tuple, you can't set another object at
the index i, but you can modify the list l.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-16 Thread Marco Sulla

On Wed, 16 Mar 2022 at 00:42, Cameron Simpson  wrote:
>
> Is it sensible to compute the hash only from the immutable parts?
> Bearing in mind that usually you need an equality function as well and
> it may have the same stability issues.

[...]

> In that case I would be inclined to never raise TypeError at all. I'd
> compute the hash entirely from the keys of the dict and compute equality
> in the normal fashion: identical keys and then equal corresponding
> values. That removes the requirement that values be immutable and/or
> hashable.

Well, I followed PEP 416, so I allowed immutable types for values, as
tuple does. Also tuple is hashable only if all its values are
hashable.

The equality function is the same as dict, with a little modification.
I do not check for hash in equality. I could add that, if both the
hash are calculated and different from -1 and they differ, False is
returned.

> >In this case I currently cache the value -1. The subsequent calls to
> >__hash__() will check if the value is -1. If so, a TypeError is
> >immediately raised.
>
> This will also make these values behave badly in dicts/sets, as they all
> hash to the same bucket.

Not sure to understand. If the hash is -1, it's not hashable, so it
can't be a member of a dict or set.

> You could, you know, cache the original exception.

I thought about it :) What prevented me is that is another PySsize_t
to store in memory
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-16 Thread Cameron Simpson

On 16Mar2022 19:09, Chris Angelico  wrote:
>On Wed, 16 Mar 2022 at 18:48, Cameron Simpson  wrote:
>> This may be because the "raw" hash (in this case the int value) is
>> itself further hashed (giving more evenness) and then moduloed into the
>> finite number of slots in the dict.
>
>Not in a CPython dict, no. (There is some collision management that
>uses upper bits, but the initial selection of bucket simply uses the
>raw hash.) See comments at the top of dictobject.c.

Thank you, I stand corrected. That's a very interesting read.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-16 Thread Chris Angelico

On Wed, 16 Mar 2022 at 18:48, Cameron Simpson  wrote:
>
> On 16Mar2022 16:58, Chris Angelico  wrote:
> >On Wed, 16 Mar 2022 at 14:00, Cameron Simpson  wrote:
> >> On 16Mar2022 10:57, Chris Angelico  wrote:
> >> A significant difference is that tuples have no keys, unlike a dict.
> >>
> >> A hash does not have to hash all the internal state, ony the relevant
> >> state, and not even all of that. The objective of the hash is twofold to
> >> my mind:
> >> - that "equal" objects (in the `==` sense) have the same hash, so that
> >>   they hash to the same backet in dicts and can therefore be found
> >> - that hash values are widely distributed, to maximise the evenness of
> >>   the object distribution in buckets
> >>
> >> For dicts to work, the former has to hold.
> >
> >Python only demands the first one.
>
> I think we're in violent agreement here.
>
> >And Python's own integers hash to
> >themselves, which isn't what I'd call "widely distributed", but which
> >works well in practice.
>
> This may be because the "raw" hash (in this case the int value) is
> itself further hashed (giving more evenness) and then moduloed into the
> finite number of slots in the dict.

Not in a CPython dict, no. (There is some collision management that
uses upper bits, but the initial selection of bucket simply uses the
raw hash.) See comments at the top of dictobject.c.

"""
This isn't necessarily bad!  To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints. So
this gives better-than-random behavior in common cases, and that's very
desirable.
"""

> >> The latter has more flexibility. A dict has keys. If the dicts are quite
> >> varied (sets of tags for example), it may be effective to just hash the
> >> keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or
> >> likewise with database rows) this will go badly because the hashes will
> >> all (or maybe mostly) collide.
> >
> >The general recommendation is to use all the same data for hashing as
> >you use for equality checking. What's the point of just hashing the
> >keys, if the values also contribute to equality? I'd just keep it
> >simple, and hash both.
>
> Well, hash functions want to be fast. The case where you look up a key
> in a dict (which needs both the hash and a comparison) is not the only
> situation. When a dict's bucket array is resized, the hash functions of
> all stored objects require recomputation, but the equality test is _not_
> required. Is the hash is very cheap, that is a large benefit here.

Fast is less important than accurate. Hashing functions are usually
*fast enough*, and since the vast majority of the work of hashing is
usually going to be strings, every other data type can just
recalculate its hash based on its children. Caching the hash of a
string is very useful; caching the hash of a tuple, not so much; again
quoting from the CPython source code:

/* Tests have shown that it's not worth to cache the hash value, see
   https://bugs.python.org/issue9685 */

> I just took a look at the CPython hashtable.c, you can see the rehashing
> process here:
>
> 
> https://github.com/python/cpython/blob/7c776521418676c074a483266339d31c950f516e/Python/hashtable.c#L280

Not sure where that file is used; it's purely internal to CPython and
seems to be used for tracemalloc. It's not the dictionary type -
that's in Objects/dictobject.c.

In any case, the rehash operation is basically for after a resize on
the hashtable, and it doesn't rely on the objects caching their hashes
- it retains them in the hashtable.

> Just an aside, that's got a bunch of interesting opimitsations in it,
> people has put serious work into making this fast and it remains very
> readable!
>
> Suppose I have a nested tree of (hashable) frozen dicts of whatever
> flavour. An equality test requires a deep comparison all the way down
> including the values. You could make a much cheaper hash function which
> hashed just keys, yea, even only unto the upper layer or 2, and still
> meet the primary criterion: all equal items have the same hash value.

Yes, but "def __hash__(self): return 42" also meets the primary
criterion. I don't know what use-cases frozendicts have, but I would
suspect that if they are used at all, they'll often be used in cases
where their keys are identical (for instance, the __dict__ of an
immutable object type, where the keys are extremely stable across
objects of the same type).

> My recommendation is not inherently to has

Re: Best practice for caching hash

2022-03-16 Thread Cameron Simpson

On 16Mar2022 16:58, Chris Angelico  wrote:
>On Wed, 16 Mar 2022 at 14:00, Cameron Simpson  wrote:
>> On 16Mar2022 10:57, Chris Angelico  wrote:
>> A significant difference is that tuples have no keys, unlike a dict.
>>
>> A hash does not have to hash all the internal state, ony the relevant
>> state, and not even all of that. The objective of the hash is twofold to
>> my mind:
>> - that "equal" objects (in the `==` sense) have the same hash, so that
>>   they hash to the same backet in dicts and can therefore be found
>> - that hash values are widely distributed, to maximise the evenness of
>>   the object distribution in buckets
>>
>> For dicts to work, the former has to hold.
>
>Python only demands the first one.

I think we're in violent agreement here.

>And Python's own integers hash to
>themselves, which isn't what I'd call "widely distributed", but which
>works well in practice.

This may be because the "raw" hash (in this case the int value) is 
itself further hashed (giving more evenness) and then moduloed into the 
finite number of slots in the dict.

>> The latter has more flexibility. A dict has keys. If the dicts are quite
>> varied (sets of tags for example), it may be effective to just hash the
>> keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or
>> likewise with database rows) this will go badly because the hashes will
>> all (or maybe mostly) collide.
>
>The general recommendation is to use all the same data for hashing as
>you use for equality checking. What's the point of just hashing the
>keys, if the values also contribute to equality? I'd just keep it
>simple, and hash both.

Well, hash functions want to be fast. The case where you look up a key 
in a dict (which needs both the hash and a comparison) is not the only 
situation. When a dict's bucket array is resized, the hash functions of 
all stored objects require recomputation, but the equality test is _not_ 
required. Is the hash is very cheap, that is a large benefit here.

I just took a look at the CPython hashtable.c, you can see the rehashing 
process here:

https://github.com/python/cpython/blob/7c776521418676c074a483266339d31c950f516e/Python/hashtable.c#L280

Just an aside, that's got a bunch of interesting opimitsations in it, 
people has put serious work into making this fast and it remains very 
readable!

Suppose I have a nested tree of (hashable) frozen dicts of whatever 
flavour. An equality test requires a deep comparison all the way down 
including the values. You could make a much cheaper hash function which 
hashed just keys, yea, even only unto the upper layer or 2, and still 
meet the primary criterion: all equal items have the same hash value.

My recommendation is not inherently to hash everything involved in the 
equality test. If that's cheap, then sure. But it isn't necessary and 
(given a suitable use domain) it may even be very detrimental to hash 
everything involved in equality, as in the example above.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-16 Thread Chris Angelico

On Wed, 16 Mar 2022 at 14:00, Cameron Simpson  wrote:
>
> On 16Mar2022 10:57, Chris Angelico  wrote:
> >> Is it sensible to compute the hash only from the immutable parts?
> >> Bearing in mind that usually you need an equality function as well and
> >> it may have the same stability issues.
> >
> >My understanding - and I'm sure Marco will correct me if I'm wrong
> >here - is that this behaves like a tuple: if it contains nothing but
> >hashable objects, it is itself hashable, but if it contains anything
> >unhashable, the entire tuple isn't hashable.
>
> A significant difference is that tuples have no keys, unlike a dict.
>
> A hash does not have to hash all the internal state, ony the relevant
> state, and not even all of that. The objective of the hash is twofold to
> my mind:
> - that "equal" objects (in the `==` sense) have the same hash, so that
>   they hash to the same backet in dicts and can therefore be found
> - that hash values are widely distributed, to maximise the evenness of
>   the object distribution in buckets
>
> For dicts to work, the former has to hold.

Python only demands the first one. And Python's own integers hash to
themselves, which isn't what I'd call "widely distributed", but which
works well in practice.

> The latter has more flexibility. A dict has keys. If the dicts are quite
> varied (sets of tags for example), it may be effective to just hash the
> keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or
> likewise with database rows) this will go badly because the hashes will
> all (or maybe mostly) collide.

The general recommendation is to use all the same data for hashing as
you use for equality checking. What's the point of just hashing the
keys, if the values also contribute to equality? I'd just keep it
simple, and hash both.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-15 Thread Cameron Simpson

On 16Mar2022 10:57, Chris Angelico  wrote:
>> Is it sensible to compute the hash only from the immutable parts?
>> Bearing in mind that usually you need an equality function as well and
>> it may have the same stability issues.
>
>My understanding - and I'm sure Marco will correct me if I'm wrong
>here - is that this behaves like a tuple: if it contains nothing but
>hashable objects, it is itself hashable, but if it contains anything
>unhashable, the entire tuple isn't hashable.

A significant difference is that tuples have no keys, unlike a dict.

A hash does not have to hash all the internal state, ony the relevant 
state, and not even all of that. The objective of the hash is twofold to 
my mind:
- that "equal" objects (in the `==` sense) have the same hash, so that 
  they hash to the same backet in dicts and can therefore be found
- that hash values are widely distributed, to maximise the evenness of 
  the object distribution in buckets

For dicts to work, the former has to hold.

The latter has more flexibility. A dict has keys. If the dicts are quite 
varied (sets of tags for example), it may be effective to just hash the 
keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or 
likewise with database rows) this will go badly because the hashes will 
all (or maybe mostly) collide.

>As such, any valid hash value will be stable, and "asking for a hash
>will raise TypeError" is also stable.

I would seek to avoid a TypeError for a frozendict, but as you can see 
above I have not thought of a way to do that which would also have 
desireable hash characteristics in almost all circumstances. (I think we 
can accept that almost anything will have pathological cases, but the 
bad cases in my hash-the-keys notion are not, to my mind, rare.)

>> >The problem is the first time I get an error with details, for example:
>> >TypeError: unhashable type: 'list'
>> >The subsequent times I simply raise a generic error:
>> >TypeError
>>
>> You could, you know, cache the original exception. That does keep links
>> to the traceback and therefore all the associates stack frames, so that
>> isn't cheap (it is peerfectly fast - just the reference, t just prevents
>> those things from being reclaimed).
>
>I don't like that idea myself, for that exact reason - it'll keep
>arbitrary numbers of objects alive.

I don't like it either, for that exact reason. That reason is the same 
reason which has Python 3 exception variables get unset as you leave an 
`except` clause. I'm sure it irks everyone, but the memory penalty of 
not doing so is high.

>But caching the stringified form
>would be more reasonable here, and have similar effect.

Mmm, yes.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-15 Thread Chris Angelico

On Wed, 16 Mar 2022 at 10:42, Cameron Simpson  wrote:
>
> On 12Mar2022 21:45, Marco Sulla  wrote:
> >I have a custom immutable object, and I added a cache for its hash
> >value. The problem is the object can be composed of mutable or
> >immutable objects, so the hash can raise TypeError.
>
> Is it sensible to compute the hash only from the immutable parts?
> Bearing in mind that usually you need an equality function as well and
> it may have the same stability issues.

My understanding - and I'm sure Marco will correct me if I'm wrong
here - is that this behaves like a tuple: if it contains nothing but
hashable objects, it is itself hashable, but if it contains anything
unhashable, the entire tuple isn't hashable.

(Though it's a little confusing; a frozendict has to have nothing but
immutable objects, yet it permits them to be unhashable? I know of
hashable mutable objects, but can't think of any unhashable immutable
objects. And I'm not sure whether a hashable-mutable would be
permitted in a frozendict, or whether it'd even know.)

As such, any valid hash value will be stable, and "asking for a hash
will raise TypeError" is also stable.

> >The problem is the first time I get an error with details, for example:
> >
> >TypeError: unhashable type: 'list'
> >
> >The subsequent times I simply raise a generic error:
> >
> >TypeError
>
> You could, you know, cache the original exception. That does keep links
> to the traceback and therefore all the associates stack frames, so that
> isn't cheap (it is peerfectly fast - just the reference, t just prevents
> those things from being reclaimed).

I don't like that idea myself, for that exact reason - it'll keep
arbitrary numbers of objects alive. But caching the stringified form
would be more reasonable here, and have similar effect.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-15 Thread Cameron Simpson

On 12Mar2022 21:45, Marco Sulla  wrote:
>I have a custom immutable object, and I added a cache for its hash
>value. The problem is the object can be composed of mutable or
>immutable objects, so the hash can raise TypeError.

Is it sensible to compute the hash only from the immutable parts?  
Bearing in mind that usually you need an equality function as well and 
it may have the same stability issues.

>In this case I currently cache the value -1. The subsequent calls to
>__hash__() will check if the value is -1. If so, a TypeError is
>immediately raised.

This will also make these values behave badly in dicts/sets, as they all 
hash to the same bucket. The performance of a hash index is pretty 
dependent on having roughly even distribution of objects amongst the 
hash slots.

>The problem is the first time I get an error with details, for example:
>
>TypeError: unhashable type: 'list'
>
>The subsequent times I simply raise a generic error:
>
>TypeError

You could, you know, cache the original exception. That does keep links 
to the traceback and therefore all the associates stack frames, so that 
isn't cheap (it is peerfectly fast - just the reference, t just prevents 
those things from being reclaimed).

>Ok, I can improve it by raising, for example, TypeError: not all
>values are hashable. But do you think this is acceptable? Now I'm
>thinking about it and it seems a little hacky to me.

That's a policy issue. "Acceptable" depends on the imagined use cases.  
Im' just having a read of your intro: 
https://github.com/Marco-Sulla/python-frozendict/blob/35611f4cd869383678104dc94f82aa636c20eb24/README.md

So your objective is that a frozendict can itself be hashable, allowing, 
say, sets of frozendicts?

In that case I would be inclined to never raise TypeError at all. I'd 
compute the hash entirely from the keys of the dict and compute equality 
in the normal fashion: identical keys and then equal corresponding 
values. That removes the requirement that values be immutable and/or 
hashable.

It that workable?

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-15 Thread Marco Sulla

On Sat, 12 Mar 2022 at 22:37, <2qdxy4rzwzuui...@potatochowder.com> wrote:
> Once hashing an object fails, why would an application try again?  I can
> see an application using a hashable value in a hashable situation again
> and again and again (i.e., taking advantage of the cache), but what's
> the use case for *repeatedly* trying to use an unhashable value again
> and again and again (i.e., taking advantage of a cached failure)?

Honestly? Don't know. Maybe because the object is passed to different
functions and all of them independently test the hashability? I'm
clutching at straws.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Best practice for caching hash

2022-03-12 Thread 2QdxY4RzWzUUiLuE

On 2022-03-12 at 21:45:56 +0100,
Marco Sulla  wrote:

[ ... ]

> So if I do not cache if the object is unhashable, I save a little
> memory per object (1 int) and I get a better error message every time.

> On the other hand, if I leave the things as they are, testing the
> unhashability of the object multiple times is faster. The code:
> 
> try:
> hash(o)
> except TypeError:
> pass
> 
> execute in nanoseconds, if called more than 1 time, even if o is not
> hashable. Not sure if this is a big advantage.

Once hashing an object fails, why would an application try again?  I can
see an application using a hashable value in a hashable situation again
and again and again (i.e., taking advantage of the cache), but what's
the use case for *repeatedly* trying to use an unhashable value again
and again and again (i.e., taking advantage of a cached failure)?

So I think that caching the failure is a lot of extra work for no
benefit.
-- 
https://mail.python.org/mailman/listinfo/python-list

Best practice for caching hash

2022-03-12 Thread Marco Sulla

I have a custom immutable object, and I added a cache for its hash
value. The problem is the object can be composed of mutable or
immutable objects, so the hash can raise TypeError.

In this case I currently cache the value -1. The subsequent calls to
__hash__() will check if the value is -1. If so, a TypeError is
immediately raised.

The problem is the first time I get an error with details, for example:

TypeError: unhashable type: 'list'

The subsequent times I simply raise a generic error:

TypeError

Ok, I can improve it by raising, for example, TypeError: not all
values are hashable. But do you think this is acceptable? Now I'm
thinking about it and it seems a little hacky to me.

Furthermore, in the C extension I have to define another property in
the struct, ma_hash_calculated, to track if the hash value is cached
or not, since there's no bogus value I can use in cache property,
ma_hash, to signal this. If I don't cache unhashable values, -1 can be
used to signal that ma_hash contains no cached value.

So if I do not cache if the object is unhashable, I save a little
memory per object (1 int) and I get a better error message every time.

On the other hand, if I leave the things as they are, testing the
unhashability of the object multiple times is faster. The code:

try:
hash(o)
except TypeError:
pass

execute in nanoseconds, if called more than 1 time, even if o is not
hashable. Not sure if this is a big advantage.

What do you think about? Here is the python code:
https://github.com/Marco-Sulla/python-frozendict/blob/35611f4cd869383678104dc94f82aa636c20eb24/frozendict/src/3_10/frozendictobject.c#L652-L697
--
https://mail.python.org/mailman/listinfo/python-list

[issue33857] python exception on Solaris : code for hash blake2b was not found

2022-01-06 Thread Irit Katriel



Irit Katriel  added the comment:

Python 3.6 is no longer maintained. Please create a new issue if you are seeing 
this problem on a current (>= 3.9) version of Python.

--
nosy: +iritkatriel
resolution:  -> out of date
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-11-24 Thread Christoph Groth



Christoph Groth  added the comment:

> What concrete action would you propose that the Python core devs take at this 
> point?

Nothing for now.

I stumbled across this issue through 
https://gitlab.kwant-project.org/kwant/tinyarray/-/issues/20 and had the 
impression that the aspect that I raised (that, for example, hash values of 
immutable built-in objects now no longer survive pickling) was not examined in 
this otherwise in-depth discussion.  So I added it for reference.

If problems come up that are caused by this change, I would consider reverting 
it a possible solution.

> The result of the change linked to this PR is that the hash now also reflects 
> that containment depends on object identity, not just object value.

This is a nice way to phrase it.  Thanks for the link to the entertaining talk. 
:-)

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-11-24 Thread Mark Dickinson



Mark Dickinson  added the comment:

Just for fun: I gave a somewhat ranty 10-minute talk on this topic at a 
(virtual) conference a few months ago: 
https://www.youtube.com/watch?v=01oeosRVwgY

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-11-24 Thread Mark Dickinson



Mark Dickinson  added the comment:

@cwg: Yep, we're aware of this. There are no good solutions here - only a mass 
of constraints, compromises and trade-offs. I think we're already somewhere on 
the Pareto boundary of the "best we can do" given the constraints. Moving to 
another point on the boundary doesn't seem worth the code churn.

What concrete action would you propose that the Python core devs take at this 
point?

> it was possible to convert a tuple of floats into a numpy array and back into 
> a tuple, and the hash values of both tuples would be equal.  This is no 
> longer the case.

Sure, but the problem isn't really with hash; that's just a detail. It lies 
deeper than that - it's with containment itself:

>>> import numpy as np
>>> import math
>>> x = math.nan
>>> some_list = [1.5, 2.3, x]
>>> x in some_list
True
>>> x in list(np.array(some_list))  # expect True, get False
False

The result of the change linked to this PR is that the hash now also reflects 
that containment depends on object identity, not just object value. Reverting 
the change would solve the superficial hash problem, but not the underlying 
containment problem, and would re-introduce the performance issue that was 
fixed here.

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-11-24 Thread Christoph Groth



Christoph Groth  added the comment:

Hello.  I would like to point out a possible problem that this change to 
CPython has introduced.

This change looks innocent, but it breaks the unwritten rule that the hash 
value of a number (or actually any built-in immutable type!) in Python depends 
only on its value.

Thus, before this change, it was possible to convert a tuple of floats into a 
numpy array and back into a tuple, and the hash values of both tuples would be 
equal.  This is no longer the case.

Or, more generally, any hashable tuple could be pickled and unpickled, without 
changing its hash value.  I could well imagine that this breaks real code in 
subtle ways.

Likewise, it is now no longer possible to provide a library of sequences of 
numbers that always hashes like built-in tuples.  (As "tinyarray", of which I 
am the author, did.)

--
nosy: +cwg

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2021-11-08 Thread STINNER Victor



Change by STINNER Victor :


--
nosy:  -vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2021-11-04 Thread Terry J. Reedy



Terry J. Reedy  added the comment:

Because today's spammer, whose message was removed, deleted us all.  Restoring 
the version to 3.3 is not possible.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2021-11-04 Thread Guido van Rossum



Guido van Rossum  added the comment:

Hey Erlend, why did you add so many people to the nosy list of this old
issue?

On Thu, Nov 4, 2021 at 07:33 Erlend E. Aasland 
wrote:

>
> Change by Erlend E. Aasland :
>
>
> --
> components: +Interpreter Core -Argument Clinic
> nosy: +Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon,
> PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson,
> christian.heimes, cvrebert, dmalcolm, eric.araujo, eric.snow, fx5,
> georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, jsvaughan,
> lemburg, loewis, mark.dickinson, neologix, pitrou, python-dev, roger.serwy,
> skorgu, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
> -ahmedsayeed1982, larry
>
> ___
> Python tracker 
> 
> ___
>
-- 
--Guido (mobile)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2021-11-04 Thread Erlend E. Aasland



Change by Erlend E. Aasland :


--
components: +Interpreter Core -Argument Clinic
nosy: +Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon, 
PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, 
cvrebert, dmalcolm, eric.araujo, eric.snow, fx5, georg.brandl, grahamd, 
gregory.p.smith, gvanrossum, gz, jcea, jsvaughan, lemburg, loewis, 
mark.dickinson, neologix, pitrou, python-dev, roger.serwy, skorgu, skrah, 
terry.reedy, tim.peters, v+python, vstinner, zbysz -ahmedsayeed1982, larry

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2021-11-04 Thread Erlend E. Aasland



Change by Erlend E. Aasland :


--
Removed message: https://bugs.python.org/msg405707

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2021-11-04 Thread Ahmed Sayeed



Ahmed Sayeed  added the comment:

In collect_register() function of arc-linux-tdep.c, the "eret" 
http://www-look-4.com/travel/london/
(exception return) register value is not being reported correctly.

Background: https://komiya-dental.com/shopping/buy-android/
When asked for the "pc" value, we have to update the "eret" register
with GDB's STOP_PC.  The "eret" instructs the kernel code where to
jump back http://www.iu-bloomington.com/shopping/hatchback-cars/ when an 
instruction has stopped due to a breakpoint.  This
is how collect_register() is doing so: 
https://waytowhatsnext.com/shopping/xbox-release-date/

--8<--
  if (regnum == gdbarch_pc_regnum (gdbarch)) 
http://www.wearelondonmade.com/travel/london/
regnum = ARC_ERET_REGNUM;
  regcache->raw_collect (regnum, buf + arc_linux_core_reg_offsets[regnum]);
-->8-- http://www.jopspeech.com/travel/london/

Root cause:
Although this is using the correct offset (ERET register's), it is also 
http://joerg.li/travel/london/ 
changing the REGNUM itself.  Therefore, raw_collect (regnum, ...) is
not reading from "pc" anymore. http://connstr.net/travel/london/

Consequence:
This bug affects the "native ARC gdb" badly and causes kernel code to jump
to addresses after the breakpoint and not executing the "breakpoint"ed 
http://embermanchester.uk/travel/london/ 
instructions at all.  That "native ARC gdb" feature is not upstream yet and
is in review at the time of writing [1]. 
http://www.slipstone.co.uk/travel/london/
In collect_register() function of arc-linux-tdep.c, the "eret"
(exception return) register value is not being reported correctly. 
http://www.logoarts.co.uk/travel/london/

Background:
When asked for the "pc" value, we have to update the "eret" register
with GDB's STOP_PC. http://www.acpirateradio.co.uk/travel/good/  The "eret" 
instructs the kernel code where to
jump back when an instruction has stopped due to a breakpoint.  This
is how collect_register() is doing so:
http://www.compilatori.com/travel/london/
--8<--
  if (regnum == gdbarch_pc_regnum (gdbarch))
regnum = ARC_ERET_REGNUM;
  regcache->raw_collect (regnum, buf + arc_linux_core_reg_offsets[regnum]);
-->8--

Root cause: https://www.webb-dev.co.uk/shopping/shopping-during-corona/
Although this is using the correct offset (ERET register's), it is also
changing the REGNUM itself.  Therefore, raw_collect (regnum, ...) is
not reading from "pc" anymore.

Consequence:
This bug affects the "native ARC gdb" badly and causes kernel code to jump
to addresses after the breakpoint and not executing the "breakpoint"ed
instructions at all.  That "native ARC gdb" feature is not upstream yet and
is in review at the time of writing [1].

--
components: +Argument Clinic -Interpreter Core
nosy: +ahmedsayeed1982, larry -Arach, Arfrever, Huzaifa.Sidhpurwala, 
Jim.Jewett, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, 
benjamin.peterson, christian.heimes, cvrebert, dmalcolm, eric.araujo, 
eric.snow, fx5, georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, 
jsvaughan, lemburg, loewis, mark.dickinson, neologix, pitrou, python-dev, 
roger.serwy, skorgu, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
versions: +Python 3.11 -Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 
3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue19463] assertGdbRepr depends on hash randomization / endianess

2021-09-10 Thread Irit Katriel



Change by Irit Katriel :


--
resolution:  -> fixed
stage: needs patch -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44140] WeakKeyDictionary should support lookup by id instead of hash

2021-08-26 Thread Alessio Bogon



Change by Alessio Bogon :


--
nosy: +youtux

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44843] Add CLI flag to disable hash randomization

2021-08-05 Thread Filipe Laíns


New submission from Filipe Laíns :

There are select use-cases where hash randomization is undesirable, having a 
CLI option to switch it off would be very helpful.
One example would be packaging, where hash randomization will make the bytecode 
unreproducible.

Currently, we have to set PYTHONHASHSEED to a constant value. Having a CLI 
option (lets say -Z) would allow use to do

  python -Zm install artifact.whl

instead of

  PYTHONHASHSEED=0 python -m install artifact.whl

Which is something that I have to do lots of places.

--
messages: 399026
nosy: FFY00
priority: normal
severity: normal
status: open
title: Add CLI flag to disable hash randomization
versions: Python 3.11

___
Python tracker 
<https://bugs.python.org/issue44843>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44646] hash() of the unity type is not consistent with equality

2021-07-16 Thread Łukasz Langa


Change by Łukasz Langa :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44646] hash() of the unity type is not consistent with equality

2021-07-16 Thread Łukasz Langa


Łukasz Langa  added the comment:


New changeset 705988056e028bab3dbc5cff3671a8ddefc88ec7 by Miss Islington (bot) 
in branch '3.10':
bpo-44646: Fix the hash of the union type. (GH-27179) (#27180)
https://github.com/python/cpython/commit/705988056e028bab3dbc5cff3671a8ddefc88ec7


--

___
Python tracker 
<https://bugs.python.org/issue44646>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44646] hash() of the unity type is not consistent with equality

2021-07-16 Thread miss-islington



Change by miss-islington :


--
nosy: +miss-islington
nosy_count: 5.0 -> 6.0
pull_requests: +25717
pull_request: https://github.com/python/cpython/pull/27180

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44646] hash() of the unity type is not consistent with equality

2021-07-16 Thread Łukasz Langa


Łukasz Langa  added the comment:


New changeset aeaa553d650786afc6e68df1f4813ae1a5b71d05 by Serhiy Storchaka in 
branch 'main':
bpo-44646: Fix the hash of the union type. (#27179)
https://github.com/python/cpython/commit/aeaa553d650786afc6e68df1f4813ae1a5b71d05


--
nosy: +lukasz.langa

___
Python tracker 
<https://bugs.python.org/issue44646>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44646] hash() of the unity type is not consistent with equality

2021-07-16 Thread Serhiy Storchaka



Change by Serhiy Storchaka :


--
keywords: +patch
pull_requests: +25716
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/27179

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44646] hash() of the unity type is not consistent with equality

2021-07-15 Thread Jelle Zijlstra



Change by Jelle Zijlstra :


--
nosy: +Jelle Zijlstra

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44646] hash() of the unity type is not consistent with equality

2021-07-15 Thread Serhiy Storchaka



New submission from Serhiy Storchaka :

There is a rule: equal hashable objects should have the same hash. The unity 
type violates it.

>>> x = int | str
>>> y = str | int
>>> x == y
True
>>> hash(x) == hash(y)
False

And hashes of equal unity type and typing.Union are different too.

>>> import typing
>>> z = typing.Union[int, str]
>>> x == z
True
>>> hash(x) == hash(z)
False

There is also a problem with a single type (see issue44636).

--
components: Interpreter Core
messages: 397567
nosy: gvanrossum, kj, serhiy.storchaka
priority: normal
severity: normal
status: open
title: hash() of the unity type is not consistent with equality
type: behavior
versions: Python 3.10, Python 3.11

___
Python tracker 
<https://bugs.python.org/issue44646>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44140] WeakKeyDictionary should support lookup by id instead of hash

2021-06-23 Thread Andrei Kulakov



Andrei Kulakov  added the comment:

Josh: thanks for the explanation, this makes sense.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44140] WeakKeyDictionary should support lookup by id instead of hash

2021-06-23 Thread Josh Rosenberg



Josh Rosenberg  added the comment:

Andrei: If designed appropriately, a weakref callback attached to the actual 
object would delete the associated ID from the dictionary when the object was 
being deleted to avoid that problem. That's basically how WeakKeyDictionary 
works already; it doesn't store the object itself (if it did, that strong 
reference could never be deleted), it just stores a weak reference for it that 
ensures that when the real object is deleted, a callback removes the weak 
reference from the WeakKeyDictionary; this just adds another layer to that work.

I don't think this would make sense as a mere argument to WeakKeyDictionary; 
the implementation would differ significantly, and probably deserves a separate 
class.

--
nosy: +josh.r

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44140] WeakKeyDictionary should support lookup by id instead of hash

2021-06-22 Thread Andrei Kulakov



Andrei Kulakov  added the comment:

Maybe I am misunderstanding, but if an object is deleted, and another object 
created with the same ID, wouldn't WeakRefDict now be pointing to the wrong 
object?

--
nosy: +andrei.avk

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-16 Thread Mark Dickinson



Mark Dickinson  added the comment:

I think this can be closed now.

--
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-15 Thread Mark Dickinson



Mark Dickinson  added the comment:


New changeset 8d0b2ca493e236fcad8709a622c1ac8ad29c372d by Mark Dickinson in 
branch '3.10':
[3.10] bpo-43475: Add what's new entry for NaN hash changes (GH-26725) 
(GH-26743)
https://github.com/python/cpython/commit/8d0b2ca493e236fcad8709a622c1ac8ad29c372d


--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-15 Thread Mark Dickinson



Change by Mark Dickinson :


--
pull_requests: +25328
pull_request: https://github.com/python/cpython/pull/26743

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-15 Thread Mark Dickinson



Mark Dickinson  added the comment:


New changeset 1d10bf0bb9409a406c56b0de8870df998637fd0f by Mark Dickinson in 
branch 'main':
bpo-43475: Add what's new entry for NaN hash changes (GH-26725)
https://github.com/python/cpython/commit/1d10bf0bb9409a406c56b0de8870df998637fd0f


--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-14 Thread Mark Dickinson



Change by Mark Dickinson :


--
pull_requests: +25315
pull_request: https://github.com/python/cpython/pull/26725

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-14 Thread Mark Dickinson



Mark Dickinson  added the comment:

> Does this change need to be mentioned in What's New?

Yes, I think so, given that it's a change to documented behavior. It's also 
something that third party packages (like NumPy) potentially need to be aware 
of.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

> If one wants to have all NaNs in one equivalency class
> (e.g. if used as a key-value for example in pandas) it
> is almost impossible to do so in a consistent way 
> without taking a performance hit.

ISTM the performance of the equivalent class case is far less important than 
the one we were trying to solve.  Given a choice we should prefer helping 
normal unadorned instances rather than giving preference to a subclass that 
redefines the usual behaviors.  

In CPython, it is a fact of life that overriding builtin behaviors with pure 
python code always incurs a performance hit.  Also, in your example, the 
subclass isn't technically correct because it relies on a non-guaranteed 
implementation details.  It likely isn't even the fastest approach.

The only guaranteed behaviors are that math.isnan(x) reliably detects a NaN and 
that x!=x when x is a NaN.  Those are the only assured tools in the uphill 
battle to fight the weird intrinsic nature of NaNs.

So one possible solution is to replace all the NaNs with a canonical 
placeholder value that doesn't have undesired properties:

{None if isnan(x) else x for x in arr}

That relies on guaranteed behaviors and is reasonably fast.  IMO that beats 
trying to reprogram float('NaN') to behave the opposite of how it was designed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread realead



realead  added the comment:

@mark.dickinson

> ...my expectation was that there would be few cases of breakage, and that for 
> those few cases it shouldn't be difficult to fix the breakage.

This expectation is probably correct.

My issue is somewhat only partly on-topic here: If one wants to have all NaNs 
in one equivalency class (e.g. if used as a key-value for example in pandas) it 
is almost impossible to do so in a consistent way without taking a performance 
hit. It seems to be possible builtin-types (even if frozenset won't be pretty), 
but already something like


class A:
def __init__(self, a):
self.a=a
def __hash__(self):
return hash(self.a)
def __eq__(self, other):
return self.a == other.a

is not easy to handle.

A special comparator for containers would be an ultimative solution, but I see 
how this could be too much hassle for a corner case.

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread Serhiy Storchaka



Serhiy Storchaka  added the comment:

Does this change need to be mentioned in What's New?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread miss-islington



miss-islington  added the comment:


New changeset 128899d8b8d65d86bd9bbea6801e9f36e6f409f2 by Miss Islington (bot) 
in branch '3.10':
bpo-43475: Fix the Python implementation of hash of Decimal NaN (GH-26679)
https://github.com/python/cpython/commit/128899d8b8d65d86bd9bbea6801e9f36e6f409f2


--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread Serhiy Storchaka



Serhiy Storchaka  added the comment:

> Maybe a new comparator is called for (like PY_EQ_FOR_HASH_COLLECTION), which 
> would yield float("nan") equivalent to float("nan") and which should be used 
> in hash-set/dict?

It is difficult to do from technical point of view because all rich comparison 
implementations support currently only 6 operations. If you call tp_richcomp 
with something else it will crash. So we need to add new slot tp_richcomp_ex 
and complex logic to call either tp_richcomp_ex or tp_richcomp and correctly 
handle inheritance. It is too high bar.

> Does GH-26679 need to be backported to the 3.10 branch?

I thought that the original change was 3.11 only, but seems it was merged 
before the split of the 3.10 branch. So PR 26679 needs to be backported. I 
would add a NEWS entry if I knew this.

--
versions: +Python 3.10

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread miss-islington



Change by miss-islington :


--
nosy: +miss-islington
nosy_count: 6.0 -> 7.0
pull_requests: +25292
pull_request: https://github.com/python/cpython/pull/26706

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread Mark Dickinson



Mark Dickinson  added the comment:

@Serhiy: can this be closed again? Does GH-26679 need to be backported to the 
3.10 branch?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread Mark Dickinson



Mark Dickinson  added the comment:

@realead

> This change makes life harder for people trying to get sane behavior with 
> sets [...]

Yes, code that assumes that all NaNs have the same hash value will need to 
change. But that doesn't seem unreasonable for objects that are already 
considered distinct with respect to the containment equivalence relation ("is" 
followed by "=="). I think this change was the right one to make, and my 
expectation was that there would be few cases of breakage, and that for those 
few cases it shouldn't be difficult to fix the breakage. If that's not true 
(either there are lots of cases of breakage, or the breakage is hard to fix), 
perhaps we should reconsider. I haven't yet seen evidence that either of those 
things is true, though.

> Maybe a new comparator is called for [...]

Agreed that that seems like a possible technical solution: Python could add a 
new special method __container_eq__ which is required to provide (at the least) 
a reflexive and symmetric relation, and whose default implementation does the 
same is-followed-by-== check that's already used for list, dict and set 
containment. For what it's worth, this is close to the approach that Julia 
takes: they have an "is_equal" function that's mostly the same as the normal 
equality == but differs for NaNs (and incidentally for signed zeros). But 
whether this would make sense from a language design perspective is another 
matter, and it would be a significant language-level change (certainly 
requiring a PEP). It feels like an enormous conceptual change to introduce to 
the language just to deal with the annoyance of NaNs. And that really is all it 
would be good for, unless perhaps it also provides a solution to the problem of 
NumPy arrays in containers, again caused by NumPy arrays overriding __eq__ in a
  nonstandard way.

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-12 Thread Serhiy Storchaka



Serhiy Storchaka  added the comment:


New changeset 9f1c5f6e8af6ba3f659b2aea1e221ac9695828ba by Serhiy Storchaka in 
branch 'main':
bpo-43475: Fix the Python implementation of hash of Decimal NaN (GH-26679)
https://github.com/python/cpython/commit/9f1c5f6e8af6ba3f659b2aea1e221ac9695828ba


--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-11 Thread Serhiy Storchaka



Change by Serhiy Storchaka :


--
pull_requests: +25265
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/26679

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-11 Thread Serhiy Storchaka



Serhiy Storchaka  added the comment:

There is an error in the Python implementation of Decimal.__hash__. It calls 
super().__hash__(), but the C implementation calls object.__hash__().

Also, the documentation for floating point hash has the same error.

--
stage: resolved -> 
status: closed -> open
versions: +Python 3.11

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-11 Thread realead



realead  added the comment:

This change makes life harder for people trying to get sane behavior with sets 
for float-, complex- and tuple-objects containing NaNs. With "sane" meaning, 
that 

set([nan, nan, nan]) => {nan}

Until now, one has only to override the equal-comparison for these types but 
now also hash-function needs to be overriden (which is more complicated e.g. 
for tuple).



On a more abstract level: there are multiple possible equivalence relations. 

The one used right now for floats is Py_EQ and results in 
float("nan)!=float("nan") due to IEEE 754.

In hash-set/dict we would like to use an equivalence relation for which all 
NaNs would be in the same equivalence class. Maybe a new comparator is called 
for (like PY_EQ_FOR_HASH_COLLECTION), which would yield float("nan") equivalent 
to float("nan") and which should be used in hash-set/dict?

--
nosy: +realead

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44048] test_hashlib failure for "AMD64 RHEL8 FIPS Only Blake2 Builtin Hash" buildbot

2021-06-04 Thread Pablo Galindo Salgado



Pablo Galindo Salgado  added the comment:


New changeset 3f4d801bf907a5fcab50f3b64475d1410b90a80f by Miss Islington (bot) 
in branch '3.10':
bpo-44048: Fix two hashlib test cases under FIPS mode (GH-26470) (GH-26531)
https://github.com/python/cpython/commit/3f4d801bf907a5fcab50f3b64475d1410b90a80f


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44048] test_hashlib failure for "AMD64 RHEL8 FIPS Only Blake2 Builtin Hash" buildbot

2021-06-04 Thread miss-islington



Change by miss-islington :


--
nosy: +miss-islington
nosy_count: 4.0 -> 5.0
pull_requests: +25124
pull_request: https://github.com/python/cpython/pull/26531

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44048] test_hashlib failure for "AMD64 RHEL8 FIPS Only Blake2 Builtin Hash" buildbot

2021-06-04 Thread Pablo Galindo Salgado



Change by Pablo Galindo Salgado :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44048] test_hashlib failure for "AMD64 RHEL8 FIPS Only Blake2 Builtin Hash" buildbot

2021-06-04 Thread Pablo Galindo Salgado



Pablo Galindo Salgado  added the comment:


New changeset a46c220edc5cf716d0b71eb80ac29ecdb4ebb430 by stratakis in branch 
'main':
bpo-44048: Fix two hashlib test cases under FIPS mode (GH-26470)
https://github.com/python/cpython/commit/a46c220edc5cf716d0b71eb80ac29ecdb4ebb430


--
nosy: +pablogsal

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44048] test_hashlib failure for "AMD64 RHEL8 FIPS Only Blake2 Builtin Hash" buildbot

2021-05-31 Thread Charalampos Stratakis



Change by Charalampos Stratakis :


--
keywords: +patch
pull_requests: +25065
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/26470

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue40695] hashlib: OpenSSL hash detection should obey security policy

2021-05-22 Thread Christian Heimes



Christian Heimes  added the comment:

No, nothing left to do. Thanks for the ping!

--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue40695] hashlib: OpenSSL hash detection should obey security policy

2021-05-21 Thread Ned Deily



Ned Deily  added the comment:

Is there anything more that needs to be done for this issue?

--
nosy: +ned.deily

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44140] WeakKeyDictionary should support lookup by id instead of hash

2021-05-21 Thread Brandt Bucher



Change by Brandt Bucher :


--
nosy: +brandtbucher

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44140] WeakKeyDictionary should support lookup by id instead of hash

2021-05-15 Thread conchylicultor



New submission from conchylicultor :

WeakKeyDictionary are great to "associate additional data with an object owned 
by other parts of an application", as quoted from the doc: 
https://docs.python.org/3/library/weakref.html#weakref.WeakKeyDictionary

However, this currently only works for hashable types. Non-hashables are not 
supported:

```
@dataclass
class A:
  pass

a = A()

d = weakref.WeakKeyDictionary()
d[a] = 3  # TypeError: unhashable type: 'A'
```

With regular dict, this could be easilly solved by using `d[id(a)] = 3`, but 
WeakKeyDictionary don't accept `int` of course. I cannot wrap the object either 
as the weakref would not be attached to the original object, but the wrapper.

It would be great to be able to force WeakKeyDictionary to perform lookup on 
`id` internally. Like `d = WeakKeyDictionary(use_id_lookup=True)`

--
components: Library (Lib)
messages: 393707
nosy: conchylicultor
priority: normal
severity: normal
status: open
title: WeakKeyDictionary should support lookup by id instead of hash
type: enhancement
versions: Python 3.10, Python 3.11, Python 3.6, Python 3.7, Python 3.8, Python 
3.9

___
Python tracker 
<https://bugs.python.org/issue44140>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44048] test_hashlib failure for "AMD64 RHEL8 FIPS Only Blake2 Builtin Hash" buildbot

2021-05-05 Thread Shreyan Avigyan



Shreyan Avigyan  added the comment:

The errors are occurring because the code before these commits checked whether 
those algorithms were present or not. If an algorithm was not present it was 
not tested. The new code doesn't check and therefore if even one of the 
algorithm modules are not present raises an error.

--
nosy: +shreyanavigyan

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44048] test_hashlib failure for "AMD64 RHEL8 FIPS Only Blake2 Builtin Hash" buildbot

2021-05-05 Thread Charalampos Stratakis



New submission from Charalampos Stratakis :

The buildbot started experiencing some failures. First after 
https://github.com/python/cpython/commit/ddbef71a2c166a5d5dd168e26493973053a953d6
 this test started failing with:

==
ERROR: test_disallow_instantiation (test.test_hashlib.HashLibTestCase)
--
Traceback (most recent call last):
  File 
"/home/buildbot/buildarea/3.x.cstratak-RHEL8-fips-x86_64.no-builtin-hashes-except-blake2/build/Lib/hashlib.py",
 line 160, in __hash_new
return _hashlib.new(name, data, **kwargs)
ValueError: [digital envelope routines: EVP_DigestInit_ex] disabled for FIPS
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File 
"/home/buildbot/buildarea/3.x.cstratak-RHEL8-fips-x86_64.no-builtin-hashes-except-blake2/build/Lib/test/test_hashlib.py",
 line 912, in test_disallow_instantiation
h = constructor()
  File 
"/home/buildbot/buildarea/3.x.cstratak-RHEL8-fips-x86_64.no-builtin-hashes-except-blake2/build/Lib/test/test_hashlib.py",
 line 133, in _test_algorithm_via_hashlib_new
return hashlib.new(_alg, **kwargs)
  File 
"/home/buildbot/buildarea/3.x.cstratak-RHEL8-fips-x86_64.no-builtin-hashes-except-blake2/build/Lib/hashlib.py",
 line 166, in __hash_new
return __get_builtin_constructor(name)(data)
  File 
"/home/buildbot/buildarea/3.x.cstratak-RHEL8-fips-x86_64.no-builtin-hashes-except-blake2/build/Lib/hashlib.py",
 line 123, in __get_builtin_constructor
raise ValueError('unsupported hash type ' + name)
ValueError: unsupported hash type md5


And then after 
https://github.com/python/cpython/commit/91554e4c5ca3c762998296522f854a7166ba84f0
 there is another failure:

==
ERROR: test_readonly_types (test.test_hashlib.HashLibTestCase)
--
Traceback (most recent call last):
  File 
"/home/buildbot/buildarea/3.x.cstratak-RHEL8-fips-x86_64.no-builtin-hashes-except-blake2/build/Lib/hashlib.py",
 line 160, in __hash_new
return _hashlib.new(name, data, **kwargs)
ValueError: [digital envelope routines: EVP_DigestInit_ex] disabled for FIPS
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File 
"/home/buildbot/buildarea/3.x.cstratak-RHEL8-fips-x86_64.no-builtin-hashes-except-blake2/build/Lib/test/test_hashlib.py",
 line 933, in test_readonly_types
hash_type = type(constructor())
  File 
"/home/buildbot/buildarea/3.x.cstratak-RHEL8-fips-x86_64.no-builtin-hashes-except-blake2/build/Lib/test/test_hashlib.py",
 line 133, in _test_algorithm_via_hashlib_new
return hashlib.new(_alg, **kwargs)
  File 
"/home/buildbot/buildarea/3.x.cstratak-RHEL8-fips-x86_64.no-builtin-hashes-except-blake2/build/Lib/hashlib.py",
 line 166, in __hash_new
return __get_builtin_constructor(name)(data)
  File 
"/home/buildbot/buildarea/3.x.cstratak-RHEL8-fips-x86_64.no-builtin-hashes-except-blake2/build/Lib/hashlib.py",
 line 123, in __get_builtin_constructor
raise ValueError('unsupported hash type ' + name)
ValueError: unsupported hash type md5

--
assignee: christian.heimes
components: SSL
messages: 393003
nosy: christian.heimes, cstratak
priority: normal
severity: normal
status: open
title: test_hashlib failure for "AMD64 RHEL8 FIPS Only Blake2 Builtin Hash" 
buildbot
versions: Python 3.10, Python 3.11

___
Python tracker 
<https://bugs.python.org/issue44048>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-22 Thread Mark Dickinson



Mark Dickinson  added the comment:

Opened https://github.com/numpy/numpy/issues/18833

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-22 Thread Mark Dickinson



Mark Dickinson  added the comment:

Thanks, Raymond. I'll open something on the NumPy bug tracker, too, since they 
may want to consider doing something similar for numpy.float64 and friends.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-22 Thread Raymond Hettinger



Change by Raymond Hettinger :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-22 Thread Raymond Hettinger



Raymond Hettinger  added the comment:


New changeset a07da09ad5bd7d234ccd084a3a0933c290d1b592 by Raymond Hettinger in 
branch 'master':
bpo-43475:  Fix worst case collision behavior for NaN instances (GH-25493)
https://github.com/python/cpython/commit/a07da09ad5bd7d234ccd084a3a0933c290d1b592


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-20 Thread Raymond Hettinger



Change by Raymond Hettinger :


--
keywords: +patch
pull_requests: +24216
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/25493

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43829] MappingProxyType cannot hash a hashable underlying mapping

2021-04-14 Thread Serhiy Storchaka



Serhiy Storchaka  added the comment:

But there is an issue with comparison implementation in MappingProxyType (see 
issue43838). Resolving that issue can affect the decision about hashability.

There should be always true the following predicate: if x == y then hash(x) == 
hash(y).

--

___
Python tracker 
<https://bugs.python.org/issue43829>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43829] MappingProxyType cannot hash a hashable underlying mapping

2021-04-14 Thread Serhiy Storchaka



Serhiy Storchaka  added the comment:

Perhaps MappingProxyType is unhashable by accident. It implements __eq__, and 
it makes it unhashable by default. And nobody made request for this feature 
before. I think that implementing __hash__ would not make anything wrong.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43829] MappingProxyType cannot hash a hashable underlying mapping

2021-04-13 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

Serhiy, what do you think?  Would it make sense to pass through the underlying 
hash?  I don't think there is much use for this but don't see any particular 
reason to block it.

--
nosy: +rhettinger, serhiy.storchaka
priority: normal -> low
versions: +Python 3.10 -Python 3.9

___
Python tracker 
<https://bugs.python.org/issue43829>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43829] MappingProxyType cannot hash a hashable underlying mapping

2021-04-13 Thread Andy Maier



New submission from Andy Maier :

Objects of MappingProxyType do expose a __hash__() method, but if the 
underlying mapping is hashable, it still does not support hashing it.

Example:

Content of mp_hash.py:

--
#!/usr/bin/env python

from nocasedict import NocaseDict, HashableMixin
from types import MappingProxyType

class HashableDict(HashableMixin, NocaseDict):
"""A hashable dictionary"""
pass

hd = HashableDict({'a': 1, 'b': 2})
print("hash(hd): {}".format(hash(hd)))

mp = MappingProxyType(hd)
print("hash(mp): {}".format(hash(mp)))
---

Running the mp_hash.py script:

hash(hd): 3709951335832776636
Traceback (most recent call last):
  File 
"/Users/maiera/Projects/Python/cpython/issues/mappingproxy/./mp_hash.py", line 
14, in 
print("hash(mp): {}".format(hash(mp)))
TypeError: unhashable type: 'mappingproxy'

There are use cases where a function wants to return an immutable view on an 
internal dictionary, and the caller of the function should be able to use the 
returned object like a dictionary, except that it is read-only.

Note there is https://bugs.python.org/issue31209 on the inability to pickle 
MappingProxyType objects which was closed without adding the capability. That 
would fall under the same argument.

--
components: Library (Lib)
messages: 390936
nosy: andymaier
priority: normal
severity: normal
status: open
title: MappingProxyType cannot hash a hashable underlying mapping
type: behavior
versions: Python 3.9

___
Python tracker 
<https://bugs.python.org/issue43829>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-10 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

> I'm loathe to guarantee anything about this in the language itself. 

There aren't language any guarantees being proposed.  Letting the hash depend 
on the object id just helps avoid quadratic behavior.  Making float('NaN') a 
singleton is also perfectly reasonable behavior for an immutable type.  Neither 
is a guaranteed behavior, just a convenient one.


> I'm not convinced it addresses a real-world problem

Maybe yes, maybe no.  I would hope that NaNs arising from bogus calculations 
would be rare.  OTOH, they are commonly used for missing values in Pandas where 
internal dict/set operations abound.  Either way, I would like to close off a 
trivially easy way to invoke quadratic behavior unexpectedly.

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-10 Thread Tim Peters



Tim Peters  added the comment:

I agree hashing a NaN acting like the generic object hash (return rotated 
address) is a fine workaround, although I'm not convinced it addresses a 
real-world problem ;-) But why not? It might.

But that's for CPython. I'm loathe to guarantee anything about this in the 
language itself. If an implementation wants `__contains__()` tests to treat all 
NaNs as equal instead, or consider them equal only if "is" holds, or never 
considers them equal, or resolves equality as bitwise representation equality 
... all are fine by me. There's no truly compelling case to made for any of 
them, although "never considers them equal" is least "surprising" given the 
insane requirement that NaNs never compare equal under IEEE rules, and that 
Python-the-language doesn't formally support different notions of equality in 
different contexts.

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-10 Thread Raymond Hettinger



Change by Raymond Hettinger :


--
assignee:  -> rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-10 Thread Mark Dickinson



Mark Dickinson  added the comment:

> Why not have hash() return the id() like we do for instances of object?

I think that's fine, and IIUC that's what Cong Ma was proposing. It seems like 
the least invasive potential fix.

In principle I find the idea of making NaN a singleton rather attractive - the 
performance hit is likely negligible, and it solves a bunch of subtle 
NaN-related issues. (For example, there's still a proposal to provide an IEEE 
754 total_ordering key so that lists of floats can be ordered sanely in the 
presence of nans, but even if you use that you don't know whether your nans 
will end up at the start or the end of the list, and you could potentially have 
nans in both places; fixing a single nan and its sign bit would fix that.) But 
there are bound to be some people somewhere making use of the NaN payload and 
sign bit, however inadvisedly, and Serhiy's right that this couldn't be 
extended to Decimal, where the sign bit and payload of the NaN are mandated by 
the standard.

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-10 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

Mark, is there any reason hash(float('NaN')) and hash(Decimal('NaN')) have to 
be a constant?  Since NaNs don't compare equal, the hash homomorphism has no 
restrictions.  Why not have hash() return the id() like we do for instances of 
object?

I understand that sys.hash_info.nan would be invalidated, but that was likely 
useless anyway.

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43164] test_nntplib.NetworkedNNTP_SSLTests fails on "AMD64 RHEL8 FIPS Only Blake2 Builtin Hash" buildbot

2021-04-07 Thread Charalampos Stratakis



Charalampos Stratakis  added the comment:

The issue seems to be resolved and the buildbot is green.

--
resolution:  -> third party
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-15 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

> And making float('nan') returning a singleton, 
> but 1e1000 * 0 returning different NaN would cause large confusion.

Not really, it would be just be an implementation detail, no different than int 
and strings being sometimes unique and sometimes not.  Historically, immutable 
objects are allowed to be reused when it is convenient for the implementation.

> What about Decimal NaN?

Decimal isn't used as much, so the need is less pressing, but we can do 
whatever is allowed by the spec.  Presumably, that would be easier than with 
floats because we control all possible ways to construct Decimals.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-15 Thread Serhiy Storchaka



Serhiy Storchaka  added the comment:

What about Decimal NaN? Even if make float NaN a singleton, there will be other 
NaNs.

And making float('nan') returning a singleton, but 1e1000 * 0 returning 
different NaN would cause large confusion.

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-14 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

> The performance thoughts were motivated by the idea of
> making NaN a singleton: adding a check to 
> PyFloat_FromDouble would mean that almost every operation
> that produced a float would have to pass through that check.

It may suffice to move the check upstream from PyFloat_FromDouble so that 
float('NaN') alway produces identically the same object as math.nan.

That would handle the common cases where NaN is used for missing values or is 
generated from string conversions.  We don't need a bullet-proof solution, just 
mitigation of harm.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-13 Thread Mark Dickinson



Mark Dickinson  added the comment:

@Cong Ma: Yes, I'm not worried about performance for the change you're 
proposing (though as you say, we should benchmark anyway). The performance 
thoughts were motivated by the idea of making NaN a singleton: adding a check 
to PyFloat_FromDouble would mean that almost every operation that produced a 
float would have to pass through that check.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-13 Thread Cong Ma



Cong Ma  added the comment:

> Idea:  We could make this problem go away by making NaN a singleton.

Apart from the fact that NaN is not a specific value/object (as pointed out in 
the previous message by @mark.dickinson), currently each builtin singleton 
(None, True, False, etc.) in Python satisfies the following predicate:

`if s is a singleton, then s == s`

This is also satisfied by "flyweight" objects such as small ints.

It feels natural and unsurprising to have this, if not officially promised. 
Requiring NaN to be a singleton would violate this implicit understanding about 
singletons, and cause another surprise on top of the other surprises with NaNs 
(or with rich comparison in general).

Performance-wise, I think the current behaviour (returning _PyHASH_NAN == 0) is 
already nested inside two layers of conditionals in `_Py_HashDouble()` in 
Python/pyhash.c:

```
if (!Py_IS_FINITE(v)) {
if (Py_IS_INFINITY(v))
return v > 0 ? _PyHASH_INF : -_PyHASH_INF;
else
return _PyHASH_NAN;
}
```
(v is the underlying C `double`, field `ob_fval` of `PyFloatObject`).

I don't feel performance would be hurt by rewriting `float_hash()` in 
Objects/floatobject.c to the effect of

```
if (!Py_IS_NAN(v->ob_fval)) {
return _Py_HashDouble(v->ob_fval)
}
else {
return _Py_HashPointer(v);
}
```
and simplify the conditionals in `_Py_HashDouble()`. But of course, only real 
benchmarking could tell. (Also, other float-like types would have to be 
adjusted, too).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-13 Thread Mark Dickinson



Mark Dickinson  added the comment:

> signalling NaNs are quietly silenced on my machine

I just tested this assertion, and it appears that I was lying: this doesn't 
seem to be true. I'm almost *sure* it used to be true, and I'm not sure what 
changed, and when. (FWIW: Python 3.9.2 from macPorts on macOS 10.14.6 + Intel 
hardware. Earlier versions of Python on the same machine give similar results 
to those below, so it's not a core Python change.)

Here's an attempt to create an snan Python float:

Python 3.9.2 (default, Feb 28 2021, 13:47:30) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import struct
>>> snan_bits = 0x7ff000123456
>>> snan = struct.unpack(">> snan
nan

Converting back shows that the attempt was successful: the payload, including 
the signalling bit (bit 51, counting with the lsb as bit 0), was preserved:

>>> hex(struct.unpack(">> hex(struct.unpack("

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-13 Thread Mark Dickinson



Mark Dickinson  added the comment:

> We could make this problem go away by making NaN a singleton.

That could work, though we'd annoy anyone who cared about preserving NaN 
payloads and signs in Python. I don't know if such people exist. Currently the 
sign is accessible through things like math.copysign, and the payload through 
struct manipulations (though on most platforms we still don't see the full 
range of NaNs: signalling NaNs are quietly silenced on my machine).

We'd also want to do some performance checks: the obvious way to do this would 
be to have an "is_nan" check in PyFloat_FromDouble. I'd *guess* that a typical 
CPU would do a reasonable job of branch prediction on that check so that it had 
minimal impact in normal use, but only timings will tell for sure.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-12 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

Idea:  We could make this problem go away by making NaN a singleton.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Mark Dickinson



Mark Dickinson  added the comment:

> I also wonder if there's security implication for servers that process 
> user-submitted input

Yes, the "malicious actor" scenario is another one to consider. But unlike the 
string hashing attack, I'm not seeing a realistic way for the nan hash 
collisions to be used in attacks, and I'm content not to worry about that until 
someone gives an actual proof of concept. Many of Python's hash functions are 
fairly predictable (by design!) and there are already lots of other ways to 
deliberately construct lots of hash collisions with non-string non-float values.

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Cong Ma



Cong Ma  added the comment:

Sorry, please ignore my rambling about "float() returning aliased object" -- in 
that case the problem with hashing doesn't arise.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Cong Ma



Cong Ma  added the comment:

Thank you @mark.dickinson for the detailed analysis.

In addition to your hypothetical usage examples, I am also trying to understand 
the implications for user code.

If judging by the issues people open on GitHub like this: 
https://github.com/pandas-dev/pandas/issues/28363 yes apparently people do run 
into the "NaN as key" problem every now and then, I guess. (I'm not familiar 
enough with the poster's real problem in the Pandas setting to comment about 
that one, but it seems to suggest people do run into "real world" problems that 
shares some common features with this one). There's also StackOverflow threads 
like this (https://stackoverflow.com/a/17062985) where people discuss hashing a 
data table that explicitly use NaN for missing values. The reason seems to be 
that "[e]mpirical evidence suggests that hashing is an effective strategy for 
dimensionality reduction and practical nonparametric estimation." 
(https://arxiv.org/pdf/0902.2206.pdf).

I cannot imagine whether the proposed change would make life easier for people 
who really want NaN keys to begin with. However I think it might reduce the 
exposure to worst-case performances in dict (and maybe set/frozenset), while 
not hurting Python's own self-consistency.

Maybe there's one thing to consider about future compatibility though... 
because the proposed fix depends on the current behaviour that floats (and by 
extension, built-in float-like objects such as Decimal and complex) are not 
cached, unlike small ints and interned Unicode objects. So when you explicitly 
construct a NaN by calling float(), you always get a new Python object. Is this 
guaranteed now on different platforms with different compilers? I'm trying to 
parse the magic in Include/pymath.h about the definition of macro `Py_NAN`, 
where it seems to me that for certain configs it could be a `static const 
union` translation-unit-level constant. Does this affect the code that actually 
builds a Python object from an underlying NaN? (I apologize if this is a bad 
question). But if it doesn't and we're guaranteed to really have Python 
NaN-objects that don't alias if explicitly constructed, is this something 
unlikely to change in the future?

I also wonder if there's security implication for servers that process 
user-submitted input, maybe running a float() constructor on some string list, 
checking exceptions but silently succeeding with "nan": arguably this is not 
going to be as common an concern as the string-key collision DoS now foiled by 
hash randomization, but I'm not entirely sure.

On "theoretical interest".. yes theoretical interests can also be "real world" 
if one teaches CS theory in real world using Python, see 
https://bugs.python.org/issue43198#msg386849

So yes, I admit we're in an obscure corner of Python here. It's a tricky, 
maybe-theoretical, but seemingly annoying but easy-to-fix issue..

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Mark Dickinson



Mark Dickinson  added the comment:

On third thoughts, of course it *would* help, because the Counter is keeping 
references to the realised NaN values. I think I'll go away now and come back 
when my brain is working.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Mark Dickinson



Mark Dickinson  added the comment:

Hmm. On second thoughts, the proposed solution wouldn't actually *help* with 
the situation I gave: the elements (lazily) realised from the NumPy array are 
highly likely to all end up with the same address in RAM. :-(

>>> x = np.full(10, np.nan)
>>> for v in x:
... print(id(v))
... del v
... 
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Mark Dickinson



Mark Dickinson  added the comment:

Sigh. When I'm undisputed ruler of the multiverse, I'm going to make "NaN == 
NaN" return True, IEEE 754 be damned. NaN != NaN is fine(ish) at the level of 
numerics; the problems start when the consequences of that choice leak into the 
other parts of the language that care about equality. NaNs just shouldn't be 
considered "special enough to break the rules" (where the rules here are that 
the equivalence relation being used as the basis of equality for a general 
container should actually *be* an equivalence relation - reflexive, symmetric, 
and transitive).

Anyway, more constructively ...

I agree with the analysis, and the proposed solution seems sound: if we're 
going to change this, then using the object hash seems like a workable 
solution. The question is whether we actually do need to change this.

I'm not too worried about backwards compatibility here: if we change this, 
we're bound to break *someone's* code somewhere, but it's hard to imagine that 
there's *that* much code out there that makes useful use of the property that 
hash(nan) == 0. The most likely source of breakage I can think of is in test 
code that checks that 3rd-party float-like things (NumPy's float64, or gmpy2 
floats, or ...) behave like Python's floats.

@Cong Ma: do you have any examples of cases where this has proved, or is likely 
to prove, a problem in real-world code, or is this purely theoretical at this 
point?

I'm finding it hard to imagine cases where a developer *intentionally* has a 
dictionary with several NaN values as keys. (Even a single NaN as a key seems a 
stretch; general floats as dictionary keys are rare in real-world code that 
I've encountered.) But it's somewhat easier to imagine situations where one 
could run into this accidentally. Here's one such:

>>> import collections, numpy as np
>>> x = np.full(10, np.nan)
>>> c = collections.Counter(x)

That took around 4 minutes of non-ctrl-C-interruptible time on my laptop. (I 
was actually expecting it not to "work" as a bad example: I was expecting that 
NaNs realised from a NumPy array would all be the same NaN object, but that's 
not what happens.) For comparison, this appears instant:

>>> x = np.random.rand(10)
>>> c = collections.Counter(x)

And it's not a stretch to imagine having a NumPy array representing a masked 
binary 1024x1024-pixel image (float values of NaN, 0.0 and 1.0) and using 
Counter to find out how many 0s and 1s there were.

On the other hand, we've lived with the current behaviour for some decades now 
without it apparently ever being a real issue.

On balance, I'm +1 for fixing this. It seems like a failure mode that would be 
rarely encountered, but it's quite unpleasant when it *is* encountered.

--
nosy: +rhettinger, tim.peters

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Mark Dickinson



Change by Mark Dickinson :


--
nosy: +mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Cong Ma



New submission from Cong Ma :

Summary: CPython hash all NaN values to 0. This guarantees worst-case behaviour 
for dict if numerous existing keys are NaN. I think by hashing NaN using the 
generic object (or "pointer") hash instead, the worst-case situation can be 
alleviated without changing the semantics of either dict or float. However, 
this also requires changes to how complex and Decimal objects hash, and 
moreover incompatible change to sys.hash_info. I would like to hear how Python 
developers think about this matter.



Currently CPython uses the hard-coded macro constant 0 (_PyHASH_NAN, defined in 
Include/pyhash.h) for the hash value of all floating point NaNs. The value is 
part of the sys.hashinfo API and is re-used by complex and Decimal in computing 
its hash in accordance with Python builtin-type documentation. [0]

(The doc [0] specifically says that "[a]ll hashable nans have the same hash 
value.")

This is normally not a great concern, except for the worst case performance. 
The problem is that, since they hash to the same value and they're guaranteed 
to compare unequal to any compatible numeric value -- not even to themselves, 
this means they're guaranteed to collide.

For this reason I'd like to question whether it is a good idea to have all 
hashable NaNs with the same hash value.

There has been some discussions about this over the Web for some time (see 
[1]). In [1] the demo Python script times the insertion of k distinct NaN keys 
(as different objects) into a freshly created dict. Since the keys are distinct 
and are guaranteed to collide with each other (if any), the run time of a 
single lookup/insertion is roughly linear to the existing number of NaN keys. 
I've recreated the same script using with a more modern Python (attached).

I'd suggest a fix for this worst-case behaviour: instead of returning the hash 
value 0 for all NaNs, use the generic object (pointer) hash for these objects. 
As a PoC (also in the attached script), it roughly means

```
class myfloat(float):
def __hash__(self):
if self != self:  # i.e., math.isnan(self)
return object.__hash__(self)
return super().__hash__(self)
```

This will
- keep the current behaviour of dict intact;
- keep the invariant `a == b implies hash(a) == hash(b)` intact, where 
applicable;
- uphold all the other rules for Python numeric objects listed in [0];
- make hash collisions no more likely than with object() instances (dict lookup 
time is amortized constant w.r.t. existing number of NaN keys).

However, it will
- not keep the current rule "All hashable nans have the same hash value";
- not be compatible with the current sys.hash_info API (requires the removal of 
the "nan" attribute from there and documenting the change);
- require congruent modifications in complex and Decimal too.

Additionally, I don't think this will affect module-level NaN "constants" such 
as math.nan and how they behave. The "NaN constant" has never been a problem to 
begin with. It's only the *distinct* NaN objects that may cause the worst-case 
behaviour.



Just for the record I'd also like to clear some outdated info or misconception 
about NaN keys in Python dicts. It's not true that NaN keys, once inserted, 
cannot be retrieved (e.g., as claimed in [1][2]). In Python, they can be, if 
you keep the original key *object* around by keeping a reference to it (or 
obtaining a new one from the dict by iterating over it). This, I think, is 
because Python dict compares for object identity before rich-comparing for 
equality in `lookdict()` in Objects/dictobject.c, so this works for `d = 
dict()`:

```
f = float("nan")
d[f] = "value"
v = d[f]
```

but this fails with `KeyError`, as it should:

```
d[float("nan")] = "value"
v = d[float("nan")]
```

In this regard the NaN float object behaves exactly like the object() instance 
as keys -- except for the collisions. That's why I think at least for floats 
the "object" hash is likely to work. The solution using PRNG [1] (implemented 
with the Go language) is not necessary for CPython because the live objects are 
already distinct.



Links:

[0] https://docs.python.org/3/library/stdtypes.html#hashing-of-numeric-types
[1] https://research.swtch.com/randhash
[2] 
https://readafterwrite.wordpress.com/2017/03/23/how-to-hash-floating-point-numbers/

--
components: Interpreter Core, Library (Lib)
files: nan_key.py
messages: 388508
nosy: congma
priority: normal
severity: normal
status: open
title: Worst-case behaviour of hash collision with float NaN
type: performance
Added file: https://bugs.python.org/file49869/nan_key.py

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2372 matches

Mail list logo