subject:"\[Python\-ideas\] incremental hashing in __hash_

Re: [Python-ideas] incremental hashing in hash

2017-01-06 Thread Neil Girdhar

On Fri, Jan 6, 2017 at 3:59 AM Paul Moore  wrote:

> On 6 January 2017 at 07:26, Neil Girdhar  wrote:
> > On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull
> >  wrote:
> >>
> >> Neil Girdhar writes:
> >>
> >>  > I don't understand this?  How is providing a default method in an
> >>  > abstract base class a pessimization?  If it happens to be slower
> >>  > than the code in the current methods, it can still be overridden.
> >>
> >> How often will people override until it's bitten them?  How many
> >> people will not even notice until they've lost business due to slow
> >> response?  If you don't have a default, that's much more obvious.
> >> Note that if there is a default, the collections are "large", and
> >> equality comparisons are "rare", this could be a substantial overhead.
> >
> >
> > I still don't understand what you're talking about here.  You're saying
> that
> > we shouldn't provide a __hash__ in case the default hash happens to be
> > slower than what the user wants and so by not providing it, we force the
> > user to write a fast one?  Doesn't that argument apply to all methods
> > provided by abcs?
>
> The point here is that ABCs should provide defaults for methods where
> there is an *obvious* default. It's not at all clear that there's an
> obvious default for __hash__.



>
> Unless I missed a revision of your proposal, what you suggested was:
>
>
Yeah, looks like you missed a revision.  There were two emails.  I
suggested adding ImmutableIterable and ImmutableSet, and so there is an
obvious implementation of __hash__ for both.


> > A better option is to add collections.abc.ImmutableIterable that derives
> from Iterable and provides __hash__.
>
> So what classes would derive from ImmutableIterable? Frozenset? A
> user-defined frozendict? There's no "obvious" default that would work
> for both those cases. And that's before we even get to the question of
> whether the default has the right performance characteristics, which
> is highly application-dependent.
>
> It's not clear to me if you expect ImmutableIterable to provide
> anything other than a default implementation of hash. If not, then how
> is making it an ABC any better than simply providing a helper function
> that people can use in their own __hash__ implementation? That would
> make it explicit what people are doing, and avoid any tendency towards
> people thinking they "should" inherit from ImmutableIterable and yet
> needing to override the only method it provides.
>
> Paul
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-06 Thread Paul Moore

On 6 January 2017 at 07:26, Neil Girdhar  wrote:
> On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull
>  wrote:
>>
>> Neil Girdhar writes:
>>
>>  > I don't understand this?  How is providing a default method in an
>>  > abstract base class a pessimization?  If it happens to be slower
>>  > than the code in the current methods, it can still be overridden.
>>
>> How often will people override until it's bitten them?  How many
>> people will not even notice until they've lost business due to slow
>> response?  If you don't have a default, that's much more obvious.
>> Note that if there is a default, the collections are "large", and
>> equality comparisons are "rare", this could be a substantial overhead.
>
>
> I still don't understand what you're talking about here.  You're saying that
> we shouldn't provide a __hash__ in case the default hash happens to be
> slower than what the user wants and so by not providing it, we force the
> user to write a fast one?  Doesn't that argument apply to all methods
> provided by abcs?

The point here is that ABCs should provide defaults for methods where
there is an *obvious* default. It's not at all clear that there's an
obvious default for __hash__.

Unless I missed a revision of your proposal, what you suggested was:

> A better option is to add collections.abc.ImmutableIterable that derives from 
> Iterable and provides __hash__.

So what classes would derive from ImmutableIterable? Frozenset? A
user-defined frozendict? There's no "obvious" default that would work
for both those cases. And that's before we even get to the question of
whether the default has the right performance characteristics, which
is highly application-dependent.

It's not clear to me if you expect ImmutableIterable to provide
anything other than a default implementation of hash. If not, then how
is making it an ABC any better than simply providing a helper function
that people can use in their own __hash__ implementation? That would
make it explicit what people are doing, and avoid any tendency towards
people thinking they "should" inherit from ImmutableIterable and yet
needing to override the only method it provides.

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-06 Thread Stephen J. Turnbull

Neil Girdhar writes:

 > I don't understand this?  How is providing a default method in an
 > abstract base class a pessimization?  If it happens to be slower
 > than the code in the current methods, it can still be overridden.

How often will people override until it's bitten them?  How many
people will not even notice until they've lost business due to slow
response?  If you don't have a default, that's much more obvious.
Note that if there is a default, the collections are "large", and
equality comparisons are "rare", this could be a substantial overhead.

 > > BTW, it occurs to me that now that dictionaries are versioned, in some
 > > cases it *may* make sense to hash dictionaries even though they are
 > > mutable, although the "hash" would need to somehow account for the
 > > version changing.  Seems messy but maybe someone has an idea?

 > I think it's important to keep in mind that dictionaries are not versioned
 > in Python. They happen to be versioned in CPython as an unexposed
 > implementation detail.  I don't think that such details should have any
 > bearing on potential changes to Python.

AFAIK the use of the hash member for equality checking is an
implementation detail too, although the language reference does
mention that set, frozenset and dict are "hashed collections".  The
basic requirements on hashes are that (1) objects that compare equal
must hash to the same value, and (2) the hash bucket must not change
over an object's lifetime (this is the "messy" aspect that probably
kills the idea -- you'd need to fix up all hashed collections that
contain the object as a key).

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-06 Thread Paul Moore

On 6 January 2017 at 09:02, Neil Girdhar  wrote:
>
> Yeah, looks like you missed a revision.  There were two emails.  I suggested
> adding ImmutableIterable and ImmutableSet, and so there is an obvious
> implementation of __hash__ for both.

OK, sorry.

The proposal is still getting more complicated, though, and I really
don't see how it's better than having some low-level helper functions
for people who need to build custom __hash__ implementations. The "one
obvious way" to customise hashing is to implement __hash__, not to
derive from a base class that says my class is an
Immutable{Iterable,Set}.

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-06 Thread Neil Girdhar

On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull <
turnbull.stephen...@u.tsukuba.ac.jp> wrote:

> Neil Girdhar writes:
>
>  > I don't understand this?  How is providing a default method in an
>  > abstract base class a pessimization?  If it happens to be slower
>  > than the code in the current methods, it can still be overridden.
>
> How often will people override until it's bitten them?  How many
> people will not even notice until they've lost business due to slow
> response?  If you don't have a default, that's much more obvious.
> Note that if there is a default, the collections are "large", and
> equality comparisons are "rare", this could be a substantial overhead.
>

I still don't understand what you're talking about here.  You're saying
that we shouldn't provide a __hash__ in case the default hash happens to be
slower than what the user wants and so by not providing it, we force the
user to write a fast one?  Doesn't that argument apply to all methods
provided by abcs?


>  > > BTW, it occurs to me that now that dictionaries are versioned, in some
>  > > cases it *may* make sense to hash dictionaries even though they are
>  > > mutable, although the "hash" would need to somehow account for the
>  > > version changing.  Seems messy but maybe someone has an idea?
>
>  > I think it's important to keep in mind that dictionaries are not
> versioned
>  > in Python. They happen to be versioned in CPython as an unexposed
>  > implementation detail.  I don't think that such details should have any
>  > bearing on potential changes to Python.
>
> AFAIK the use of the hash member for equality checking is an
> implementation detail too, although the language reference does
> mention that set, frozenset and dict are "hashed collections".  The
> basic requirements on hashes are that (1) objects that compare equal
> must hash to the same value, and (2) the hash bucket must not change
> over an object's lifetime (this is the "messy" aspect that probably
> kills the idea -- you'd need to fix up all hashed collections that

contain the object as a key).
>

Are you saying that __hash__ is called by __eq__?
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-05 Thread Neil Girdhar

On Thu, Jan 5, 2017 at 9:10 PM Stephen J. Turnbull <
turnbull.stephen...@u.tsukuba.ac.jp> wrote:

> Paul Moore writes:
>
>  > The debate here regarding tuple/frozenset indicates that there may not
>  > be a "standard way" of hashing an iterable (should order matter?).
>
> If part of the data structure, yes, if an implementation accident, no.
>
>  > Although I agree that assuming order matters is a reasonable
>  > assumption to make in the absence of any better information.
>
> I don't think so.  Eg, with dicts now ordered by insertion, an
> order-dependent default hash for collections means
>
> a = {}
> b = {}
> a['1'] = 1
> a['2'] = 2
> b['2'] = 2
> b['1'] = 1
> hash(a) != hash(b)# modulo usual probability of collision
>
> (and modulo normally not hashing mutables).  For the same reason I
> expect I'd disagree with Neil's proposal for an ImmutableWhatever
> default __hash__ although the hash comparison is "cheap", it's still a
> pessimization.  Haven't thought that through, though.
>

I don't understand this?  How is providing a default method in an abstract
base class a pessimization?  If it happens to be slower than the code in
the current methods, it can still be overridden.


>
> BTW, it occurs to me that now that dictionaries are versioned, in some
> cases it *may* make sense to hash dictionaries even though they are
> mutable, although the "hash" would need to somehow account for the
> version changing.  Seems messy but maybe someone has an idea?
>

I think it's important to keep in mind that dictionaries are not versioned
in Python. They happen to be versioned in CPython as an unexposed
implementation detail.  I don't think that such details should have any
bearing on potential changes to Python.


> Steve
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-05 Thread Stephen J. Turnbull

Paul Moore writes:

 > The debate here regarding tuple/frozenset indicates that there may not
 > be a "standard way" of hashing an iterable (should order matter?).

If part of the data structure, yes, if an implementation accident, no.

 > Although I agree that assuming order matters is a reasonable
 > assumption to make in the absence of any better information.

I don't think so.  Eg, with dicts now ordered by insertion, an
order-dependent default hash for collections means

a = {}
b = {}
a['1'] = 1
a['2'] = 2
b['2'] = 2
b['1'] = 1
hash(a) != hash(b)# modulo usual probability of collision

(and modulo normally not hashing mutables).  For the same reason I
expect I'd disagree with Neil's proposal for an ImmutableWhatever
default __hash__ although the hash comparison is "cheap", it's still a
pessimization.  Haven't thought that through, though.

BTW, it occurs to me that now that dictionaries are versioned, in some
cases it *may* make sense to hash dictionaries even though they are
mutable, although the "hash" would need to somehow account for the
version changing.  Seems messy but maybe someone has an idea?

Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-05 Thread Neil Girdhar

On Thu, Jan 5, 2017 at 10:58 AM Paul Moore  wrote:

> On 5 January 2017 at 13:28, Neil Girdhar  wrote:
> > The point is that the OP doesn't want to write his own hash function, but
> > wants Python to provide a standard way of hashing an iterable.  Today,
> the
> > standard way is to convert to tuple and call hash on that.  That may not
> be
> > efficient. FWIW from a style perspective, I agree with OP.
>
> The debate here regarding tuple/frozenset indicates that there may not
> be a "standard way" of hashing an iterable (should order matter?).
> Although I agree that assuming order matters is a reasonable
> assumption to make in the absence of any better information.
>

That's another good point.  In keeping with my abc proposal, why not add
abstract base classes with __hash__:
* ImmutableIterable, and
* ImmutableSet.

ImmutableSet inherits from ImmutableIterable, and overrides __hash__ in
such a way that order is ignored.

This presumably involves very little new code — it's just a propagating up
of the code that's already in set and tuple.

The advantage is that instead of implementing __hash__ for your type, you
declare your intention by inheriting from an abc and get an
automatically-provided hash function.

Hashing is low enough level that providing helpers in the stdlib is
> not unreasonable. It's not obvious (to me, at least) that it's a
> common enough need to warrant it, though. Do we have any information
> on how often people implement their own __hash__, or how often
> hash(tuple(my_iterable)) would be an acceptable hash, except for the
> cost of creating the tuple? The OP's request is the only time this has
> come up as a requirement, to my knowledge. Hence my suggestion to copy
> the tuple implementation, modify it to work with general iterables,
> and publish it as a 3rd party module - its usage might give us an idea
> of how often this need arises. (The other option would be for someone
> to do some analysis of published code).
>
> Assuming it is a sufficiently useful primitive to add, then we can
> debate naming. But I'd prefer it to be named in such a way that it
> makes it clear that it's a low-level helper for people writing their
> own __hash__ function, and not some sort of variant of hashing (which
> hash.from_iterable implies to me).
>
> Paul
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-05 Thread Matt Gilson

I agree with Paul -- I'm not convinced that this is common enough or that
the benefits are big enough to warrant something builtin.  However, I did
decide to dust off some of my old skills and I threw together a simple gist
to see how hard it would be to create something using Cython based on the
CPython tuple hash algorithm.  I don't know how well it works for arbitrary
iterables without a `__length_hint__`, but seems to work as intended for
iterables that have the length hint.

https://gist.github.com/mgilson/129859a79487a483163980db25b709bf

If you're interested, or want to pick this up and actually do something
with it, feel free... Also, I haven't written anything using Cython for
ages, so if this could be further optimized, feel free to let me know.

On Thu, Jan 5, 2017 at 7:58 AM, Paul Moore  wrote:

> On 5 January 2017 at 13:28, Neil Girdhar  wrote:
> > The point is that the OP doesn't want to write his own hash function, but
> > wants Python to provide a standard way of hashing an iterable.  Today,
> the
> > standard way is to convert to tuple and call hash on that.  That may not
> be
> > efficient. FWIW from a style perspective, I agree with OP.
>
> The debate here regarding tuple/frozenset indicates that there may not
> be a "standard way" of hashing an iterable (should order matter?).
> Although I agree that assuming order matters is a reasonable
> assumption to make in the absence of any better information.
>
> Hashing is low enough level that providing helpers in the stdlib is
> not unreasonable. It's not obvious (to me, at least) that it's a
> common enough need to warrant it, though. Do we have any information
> on how often people implement their own __hash__, or how often
> hash(tuple(my_iterable)) would be an acceptable hash, except for the
> cost of creating the tuple? The OP's request is the only time this has
> come up as a requirement, to my knowledge. Hence my suggestion to copy
> the tuple implementation, modify it to work with general iterables,
> and publish it as a 3rd party module - its usage might give us an idea
> of how often this need arises. (The other option would be for someone
> to do some analysis of published code).
>
> Assuming it is a sufficiently useful primitive to add, then we can
> debate naming. But I'd prefer it to be named in such a way that it
> makes it clear that it's a low-level helper for people writing their
> own __hash__ function, and not some sort of variant of hashing (which
> hash.from_iterable implies to me).
>
> Paul
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>

-- 

[image: pattern-sig.png]

Matt Gilson // SOFTWARE ENGINEER

E: m...@getpattern.com // P: 603.892.7736

We’re looking for beta testers.  Go here
 to sign up!
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-05 Thread Paul Moore

On 5 January 2017 at 13:28, Neil Girdhar  wrote:
> The point is that the OP doesn't want to write his own hash function, but
> wants Python to provide a standard way of hashing an iterable.  Today, the
> standard way is to convert to tuple and call hash on that.  That may not be
> efficient. FWIW from a style perspective, I agree with OP.

The debate here regarding tuple/frozenset indicates that there may not
be a "standard way" of hashing an iterable (should order matter?).
Although I agree that assuming order matters is a reasonable
assumption to make in the absence of any better information.

Hashing is low enough level that providing helpers in the stdlib is
not unreasonable. It's not obvious (to me, at least) that it's a
common enough need to warrant it, though. Do we have any information
on how often people implement their own __hash__, or how often
hash(tuple(my_iterable)) would be an acceptable hash, except for the
cost of creating the tuple? The OP's request is the only time this has
come up as a requirement, to my knowledge. Hence my suggestion to copy
the tuple implementation, modify it to work with general iterables,
and publish it as a 3rd party module - its usage might give us an idea
of how often this need arises. (The other option would be for someone
to do some analysis of published code).

Assuming it is a sufficiently useful primitive to add, then we can
debate naming. But I'd prefer it to be named in such a way that it
makes it clear that it's a low-level helper for people writing their
own __hash__ function, and not some sort of variant of hashing (which
hash.from_iterable implies to me).

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-05 Thread Random832

On Thu, Jan 5, 2017, at 04:00, Matt Gilson wrote:
> But, I think that the problem with adding `__hash__` to
> collections.abc.Iterable is that not all iterables are immutable -- And
> if
> they aren't immutable, then allowing them to be hashed is likely to be a
> pretty bad idea...

Why? This should never cause an interpreter-crashing bug, because
user-defined types can have bad hash methods anyway. And without that,
the reason for not applying the "consenting adults" principle and
allowing people to add mutable objects to a *short-lived* dict without
intending to change them while the dict is in use has never been clear
to me. I think mutable types not having a hash method was a mistake in
the first place.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-05 Thread Neil Girdhar

The point is that the OP doesn't want to write his own hash function, but
wants Python to provide a standard way of hashing an iterable.  Today, the
standard way is to convert to tuple and call hash on that.  That may not be
efficient. FWIW from a style perspective, I agree with OP.

On Thu, Jan 5, 2017 at 4:30 AM M.-A. Lemburg  wrote:

> On 28.12.2016 04:13, j...@math.brown.edu wrote:
> > Suppose you have implemented an immutable Position type to represent
> > the state of a game played on an MxN board, where the board size can
> > grow quite large.
> > ...
> >
> > According to
> https://docs.python.org/3/reference/datamodel.html#object.__hash__
> > :
> >
> >
> > """
> > it is advised to mix together the hash values of the components of the
> > object that also play a part in comparison of objects by packing them
> > into a tuple and hashing the tuple. Example:
> >
> > def __hash__(self):
> > return hash((self.name, self.nick, self.color))
> >
> > """
> >
> >
> > Applying this advice to the use cases above would require creating an
> > arbitrarily large tuple in memory before passing it to hash(), which
> > is then just thrown away. It would be preferable if there were a way
> > to pass multiple values to hash() in a streaming fashion, such that
> > the overall hash were computed incrementally, without building up a
> > large object in memory first.
>
> I think there's a misunderstanding here: the hash(obj) built-in
> merely interfaces to the obj.__hash__() method (or the tp_hash slot
> for C types) and returns whatever these methods give.
>
> It doesn't implement any logic by itself.
>
> If you would like to implement a more efficient hash algorithm
> for your types, just go ahead and write them as .__hash__()
> method or tp_hash slot method and you're done.
>
> The example from the docs is just to showcase an example of
> how such a hash function should work, i.e. to mix in all
> relevant data attributes.
>
> In your case, you'd probably use a simple for loop to calculate
> the hash without creating tuples or any other temporary
> structures.
>
> Here's the hash implementation tuples use as an example
>
> /* The addend 82520, was selected from the range(0, 100) for
>generating the greatest number of prime multipliers for tuples
>upto length eight:
>
>  1082527, 1165049, 1082531, 1165057, 1247581, 1330103, 1082533,
>  1330111, 1412633, 1165069, 1247599, 1495177, 1577699
>
>Tests have shown that it's not worth to cache the hash value, see
>issue #9685.
> */
>
> static Py_hash_t
> tuplehash(PyTupleObject *v)
> {
> Py_uhash_t x;  /* Unsigned for defined overflow behavior. */
> Py_hash_t y;
> Py_ssize_t len = Py_SIZE(v);
> PyObject **p;
> Py_uhash_t mult = _PyHASH_MULTIPLIER;
> x = 0x345678UL;
> p = v->ob_item;
> while (--len >= 0) {
> y = PyObject_Hash(*p++);
> if (y == -1)
> return -1;
> x = (x ^ y) * mult;
> /* the cast might truncate len; that doesn't change hash
> stability */
> mult += (Py_hash_t)(82520UL + len + len);
> }
> x += 97531UL;
> if (x == (Py_uhash_t)-1)
> x = -2;
> return x;
> }
>
> As you can see, there's some magic going on there to make
> sure that the hash values behave well when used as "keys"
> for the dictionary implementation (which is their main
> purpose in Python).
>
> You are free to create your own hash implementation.
> The only characteristic to pay attention to is to have
> objects which compare equal give the same hash value.
> This is needed to be able to map such objects to the same
> dictionary slots.
>
> There should be no need to have a special hash function which
> works on iterables. As long as those iterable objects define
> their own .__hash__() method or tp_slot, the hash() built-in
> (and Python's dict implementation) will use these and, if needed,
> those methods can then use an approach to build hash values
> using iterators on the object's internal data along similar
> lines as the above tuple implementation.
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Jan 05 2017)
> >>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
> >>> Python Database Interfaces ...   http://products.egenix.com/
> >>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/
> 
>
> ::: We implement business ideas - efficiently in both time and costs :::
>
>eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>Registered at Amtsgericht Duesseldorf: HRB 46611
>http://www.egenix.com/company/contact/
>   http://www.malemburg.com/
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.pyth

Re: [Python-ideas] incremental hashing in hash

2017-01-05 Thread Neil Girdhar

On Thu, Jan 5, 2017 at 4:00 AM Matt Gilson  wrote:

> But, I think that the problem with adding `__hash__` to
> collections.abc.Iterable is that not all iterables are immutable -- And if
> they aren't immutable, then allowing them to be hashed is likely to be a
> pretty bad idea...
>

Good point.  A better option is to add collections.abc.ImmutableIterable
that derives from Iterable and provides __hash__.  Since tuple inherits
from it, it can choose to delegate up.  Then I think everyone is happy.

>
> I'm still having a hard time being convinced that this is very much of an
> optimization at all ...
>
> If you start hashing tuples that are large enough that memory is a
> concern, then that's going to also take a *really* long time and probably
> be prohibitive anyway.  Just for kicks, I decided to throw together a
> simple script to time how much penalty you pay for hashing a tuple:
>
> class F(object):
> def __init__(self, arg):
> self.arg = arg
>
> def __hash__(self):
> return hash(tuple(self.arg))
>
>
> class T(object):
> def __init__(self, arg):
> self.arg = tuple(arg)
>
> def __hash__(self):
> return hash(self.arg)
>
>
> class C(object):
> def __init__(self, arg):
> self.arg = tuple(arg)
> self._hash = None
>
> def __hash__(self):
> if self._hash is None:
> self._hash = hash(tuple(self.arg))
> return self._hash
>
> import timeit
>
> print(timeit.timeit('hash(f)', 'from __main__ import F; f =
> F(list(range(500)))'))
> print(timeit.timeit('hash(t)', 'from __main__ import T; t =
> T(list(range(500)))'))
> print(timeit.timeit('hash(c)', 'from __main__ import C; c =
> C(list(range(500)))'))
>
> results = []
> for i in range(1, 11):
> n = i * 100
> t1 = timeit.timeit('hash(f)', 'from __main__ import F; f =
> F(list(range(%d)))' % i)
> t2 = timeit.timeit('hash(t)', 'from __main__ import T; t =
> T(list(range(%d)))' % i)
> results.append(t1/t2)
> print(results)
>
>
> F is going to create a new tuple each time and then hash it.  T already
> has a tuple, so we'll only pay the cost of hashing a tuple, not the cost of
> constructing a tuple and C caches the hash value and re-uses it once it is
> known.  C is the winner by a factor of 10 or more (no surprise there).  But
> the real interesting thing is that the the ratio of the timing results from
> hashing `F` vs. `T` is relatively constant in the range of my test (up to
> 1000 elements) and that ratio's value is approximately 1.3.  For most
> applications, that seems reasonable.  If you really need a speed-up, then I
> suppose you could recode the thing in Cython and see what happens, but I
> doubt that will be frequently necessary.  If you _do_ code it up in Cython,
> put it up on Pypi and see if people use it...
>
>
> On Wed, Jan 4, 2017 at 5:04 PM, Neil Girdhar 
> wrote:
>
> Couldn't you add __hash__ to collections.abc.Iterable ?  Essentially,
> expose __hash__ there; then all iterables automatically have a default hash
> that hashes their ordered contents.
>
> On Wednesday, January 4, 2017 at 7:37:26 PM UTC-5, Steven D'Aprano wrote:
>
> On Wed, Jan 04, 2017 at 04:38:05PM -0500, j...@math.brown.edu wrote:
> > Instead of the proposals like "hash.from_iterable()", would it make
> sense
> > to allow tuple.__hash__() to accept any iterable, when called as a
> > classmethod?
>
> The public API for calculating the hash of something is to call the
> hash() builtin function on some object, e.g. to call tuple.__hash__ you
> write hash((a, b, c)). The __hash__ dunder method is implementation, not
> interface, and normally shouldn't be called directly.
>
> Unless I'm missing something obvious, your proposal would require the
> caller to call the dunder methods directly:
>
> class X:
> def __hash__(self):
> return tuple.__hash__(iter(self))
>
> I consider that a poor interface design.
>
> But even if we decide to make an exception in this case, tuple.__hash__
> is currently an ordinary instance method right now. There's probably
> code that relies on that fact and expects that:
>
> tuple.__hash__((a, b, c))
>
> is currently the same as
>
> (a, b, c).__hash__()
>
>
> (Starting with the hash() builtin itself, I expect, although that is
> easy enough to fix if needed.) Your proposal will break backwards
> compatibility, as it requires a change in semantics:
>
> (1) (a, b, c).__hash__() must keep the current behaviour, which
> means behaving like a bound instance method;
>
> (2) But tuple.__hash__ will no longer return an unbound method (actually
> a function object, but the difference is unimportant) and instead will
> return something that behaves like a bound class method.
>
> Here's an implementation which does this:
>
> http://code.activestate.com/recipes/577030-dualmethod-descriptor/
>
> so such a thing is possible. But it breaks backwards-compatability and
> introduces something which I consider to be an unclean API (calling a
> dun

Re: [Python-ideas] incremental hashing in hash

2017-01-05 Thread M.-A. Lemburg

On 28.12.2016 04:13, j...@math.brown.edu wrote:
> Suppose you have implemented an immutable Position type to represent
> the state of a game played on an MxN board, where the board size can
> grow quite large.
> ...
> 
> According to 
> https://docs.python.org/3/reference/datamodel.html#object.__hash__
> :
> 
> 
> """
> it is advised to mix together the hash values of the components of the
> object that also play a part in comparison of objects by packing them
> into a tuple and hashing the tuple. Example:
> 
> def __hash__(self):
> return hash((self.name, self.nick, self.color))
> 
> """
> 
> 
> Applying this advice to the use cases above would require creating an
> arbitrarily large tuple in memory before passing it to hash(), which
> is then just thrown away. It would be preferable if there were a way
> to pass multiple values to hash() in a streaming fashion, such that
> the overall hash were computed incrementally, without building up a
> large object in memory first.

I think there's a misunderstanding here: the hash(obj) built-in
merely interfaces to the obj.__hash__() method (or the tp_hash slot
for C types) and returns whatever these methods give.

It doesn't implement any logic by itself.

If you would like to implement a more efficient hash algorithm
for your types, just go ahead and write them as .__hash__()
method or tp_hash slot method and you're done.

The example from the docs is just to showcase an example of
how such a hash function should work, i.e. to mix in all
relevant data attributes.

In your case, you'd probably use a simple for loop to calculate
the hash without creating tuples or any other temporary
structures.

Here's the hash implementation tuples use as an example

/* The addend 82520, was selected from the range(0, 100) for
   generating the greatest number of prime multipliers for tuples
   upto length eight:

 1082527, 1165049, 1082531, 1165057, 1247581, 1330103, 1082533,
 1330111, 1412633, 1165069, 1247599, 1495177, 1577699

   Tests have shown that it's not worth to cache the hash value, see
   issue #9685.
*/

static Py_hash_t
tuplehash(PyTupleObject *v)
{
Py_uhash_t x;  /* Unsigned for defined overflow behavior. */
Py_hash_t y;
Py_ssize_t len = Py_SIZE(v);
PyObject **p;
Py_uhash_t mult = _PyHASH_MULTIPLIER;
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
x = (x ^ y) * mult;
/* the cast might truncate len; that doesn't change hash
stability */
mult += (Py_hash_t)(82520UL + len + len);
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
x = -2;
return x;
}

As you can see, there's some magic going on there to make
sure that the hash values behave well when used as "keys"
for the dictionary implementation (which is their main
purpose in Python).

You are free to create your own hash implementation.
The only characteristic to pay attention to is to have
objects which compare equal give the same hash value.
This is needed to be able to map such objects to the same
dictionary slots.

There should be no need to have a special hash function which
works on iterables. As long as those iterable objects define
their own .__hash__() method or tp_slot, the hash() built-in
(and Python's dict implementation) will use these and, if needed,
those methods can then use an approach to build hash values
using iterators on the object's internal data along similar
lines as the above tuple implementation.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 05 2017)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-05 Thread Matt Gilson

But, I think that the problem with adding `__hash__` to
collections.abc.Iterable is that not all iterables are immutable -- And if
they aren't immutable, then allowing them to be hashed is likely to be a
pretty bad idea...

I'm still having a hard time being convinced that this is very much of an
optimization at all ...

If you start hashing tuples that are large enough that memory is a concern,
then that's going to also take a *really* long time and probably be
prohibitive anyway.  Just for kicks, I decided to throw together a simple
script to time how much penalty you pay for hashing a tuple:

class F(object):
def __init__(self, arg):
self.arg = arg

def __hash__(self):
return hash(tuple(self.arg))


class T(object):
def __init__(self, arg):
self.arg = tuple(arg)

def __hash__(self):
return hash(self.arg)


class C(object):
def __init__(self, arg):
self.arg = tuple(arg)
self._hash = None

def __hash__(self):
if self._hash is None:
self._hash = hash(tuple(self.arg))
return self._hash

import timeit

print(timeit.timeit('hash(f)', 'from __main__ import F; f =
F(list(range(500)))'))
print(timeit.timeit('hash(t)', 'from __main__ import T; t =
T(list(range(500)))'))
print(timeit.timeit('hash(c)', 'from __main__ import C; c =
C(list(range(500)))'))

results = []
for i in range(1, 11):
n = i * 100
t1 = timeit.timeit('hash(f)', 'from __main__ import F; f =
F(list(range(%d)))' % i)
t2 = timeit.timeit('hash(t)', 'from __main__ import T; t =
T(list(range(%d)))' % i)
results.append(t1/t2)
print(results)


F is going to create a new tuple each time and then hash it.  T already has
a tuple, so we'll only pay the cost of hashing a tuple, not the cost of
constructing a tuple and C caches the hash value and re-uses it once it is
known.  C is the winner by a factor of 10 or more (no surprise there).  But
the real interesting thing is that the the ratio of the timing results from
hashing `F` vs. `T` is relatively constant in the range of my test (up to
1000 elements) and that ratio's value is approximately 1.3.  For most
applications, that seems reasonable.  If you really need a speed-up, then I
suppose you could recode the thing in Cython and see what happens, but I
doubt that will be frequently necessary.  If you _do_ code it up in Cython,
put it up on Pypi and see if people use it...


On Wed, Jan 4, 2017 at 5:04 PM, Neil Girdhar  wrote:

> Couldn't you add __hash__ to collections.abc.Iterable ?  Essentially,
> expose __hash__ there; then all iterables automatically have a default hash
> that hashes their ordered contents.
>
> On Wednesday, January 4, 2017 at 7:37:26 PM UTC-5, Steven D'Aprano wrote:
>>
>> On Wed, Jan 04, 2017 at 04:38:05PM -0500, j...@math.brown.edu wrote:
>> > Instead of the proposals like "hash.from_iterable()", would it make
>> sense
>> > to allow tuple.__hash__() to accept any iterable, when called as a
>> > classmethod?
>>
>> The public API for calculating the hash of something is to call the
>> hash() builtin function on some object, e.g. to call tuple.__hash__ you
>> write hash((a, b, c)). The __hash__ dunder method is implementation, not
>> interface, and normally shouldn't be called directly.
>>
>> Unless I'm missing something obvious, your proposal would require the
>> caller to call the dunder methods directly:
>>
>> class X:
>> def __hash__(self):
>> return tuple.__hash__(iter(self))
>>
>> I consider that a poor interface design.
>>
>> But even if we decide to make an exception in this case, tuple.__hash__
>> is currently an ordinary instance method right now. There's probably
>> code that relies on that fact and expects that:
>>
>> tuple.__hash__((a, b, c))
>>
>> is currently the same as
>>
>> (a, b, c).__hash__()
>>
>>
>> (Starting with the hash() builtin itself, I expect, although that is
>> easy enough to fix if needed.) Your proposal will break backwards
>> compatibility, as it requires a change in semantics:
>>
>> (1) (a, b, c).__hash__() must keep the current behaviour, which
>> means behaving like a bound instance method;
>>
>> (2) But tuple.__hash__ will no longer return an unbound method (actually
>> a function object, but the difference is unimportant) and instead will
>> return something that behaves like a bound class method.
>>
>> Here's an implementation which does this:
>>
>> http://code.activestate.com/recipes/577030-dualmethod-descriptor/
>>
>> so such a thing is possible. But it breaks backwards-compatability and
>> introduces something which I consider to be an unclean API (calling a
>> dunder method directly). Unless there's a *really* strong advantage to
>>
>> tuple.__hash__(...)
>>
>> over
>>
>> hash.from_iterable(...)
>>
>> (or equivalent), I would be against this change.
>>
>>
>>
>> > (And similarly with frozenset.__hash__(), so that the fast C
>> > implementation of that algorithm could be used, rather than the slow
>> > col

Re: [Python-ideas] incremental hashing in hash

2017-01-05 Thread Paul Moore

On 5 January 2017 at 00:31, Steven D'Aprano  wrote:
> This is a good point. Until now, I've been assuming that
> hash.from_iterable should consider order. But frozenset shows us that
> sometimes the hash should *not* consider order.
>
> This hints that perhaps the hash.from_iterable() should have its own
> optional dunder method. Or maybe we need two functions: an ordered
> version and an unordered version.
>
> Hmmm... just tossing out a wild idea here... let's get rid of the dunder
> method part of your suggestion, and add new public class methods to
> tuple and frozenset:
>
> tuple.hash_from_iter(iterable)
> frozenset.hash_from_iter(iterable)
>
>
> That gets rid of all the objections about backwards compatibility, since
> these are new methods. They're not dunder names, so there are no
> objections to being used as part of the public API.
>
> A possible objection is the question, is this functionality *actually*
> important enough to bother?
>
> Another possible objection: are these methods part of the sequence/set
> API? If not, do they really belong on the tuple/frozenset? Maybe they
> belong elsewhere?

At this point I'd be inclined to say that a 3rd party hashing_utils
module would be a reasonable place to thrash out these design
decisions before committing to a permanent design in the stdlib. The
popularity of such a module would also give a level of indication as
to whether this is an important optimisation in practice.

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-04 Thread Neil Girdhar

Couldn't you add __hash__ to collections.abc.Iterable ?  Essentially, 
expose __hash__ there; then all iterables automatically have a default hash 
that hashes their ordered contents.

On Wednesday, January 4, 2017 at 7:37:26 PM UTC-5, Steven D'Aprano wrote:
>
> On Wed, Jan 04, 2017 at 04:38:05PM -0500, j...@math.brown.edu 
>  wrote: 
> > Instead of the proposals like "hash.from_iterable()", would it make 
> sense 
> > to allow tuple.__hash__() to accept any iterable, when called as a 
> > classmethod? 
>
> The public API for calculating the hash of something is to call the 
> hash() builtin function on some object, e.g. to call tuple.__hash__ you 
> write hash((a, b, c)). The __hash__ dunder method is implementation, not 
> interface, and normally shouldn't be called directly. 
>
> Unless I'm missing something obvious, your proposal would require the 
> caller to call the dunder methods directly: 
>
> class X: 
> def __hash__(self): 
> return tuple.__hash__(iter(self)) 
>
> I consider that a poor interface design. 
>
> But even if we decide to make an exception in this case, tuple.__hash__ 
> is currently an ordinary instance method right now. There's probably 
> code that relies on that fact and expects that: 
>
> tuple.__hash__((a, b, c)) 
>
> is currently the same as 
>
> (a, b, c).__hash__() 
>
>
> (Starting with the hash() builtin itself, I expect, although that is 
> easy enough to fix if needed.) Your proposal will break backwards 
> compatibility, as it requires a change in semantics: 
>
> (1) (a, b, c).__hash__() must keep the current behaviour, which 
> means behaving like a bound instance method; 
>
> (2) But tuple.__hash__ will no longer return an unbound method (actually 
> a function object, but the difference is unimportant) and instead will 
> return something that behaves like a bound class method. 
>
> Here's an implementation which does this: 
>
> http://code.activestate.com/recipes/577030-dualmethod-descriptor/ 
>
> so such a thing is possible. But it breaks backwards-compatability and 
> introduces something which I consider to be an unclean API (calling a 
> dunder method directly). Unless there's a *really* strong advantage to 
>
> tuple.__hash__(...) 
>
> over 
>
> hash.from_iterable(...) 
>
> (or equivalent), I would be against this change. 
>
>
>
> > (And similarly with frozenset.__hash__(), so that the fast C 
> > implementation of that algorithm could be used, rather than the slow 
> > collections.Set._hash() implementation. Then the duplicated 
> implementation 
> > in _collections_abc.py's Set._hash() could be removed completely, 
> > delegating to frozenset.__hash__() instead.) 
>
> This is a good point. Until now, I've been assuming that 
> hash.from_iterable should consider order. But frozenset shows us that 
> sometimes the hash should *not* consider order. 
>
> This hints that perhaps the hash.from_iterable() should have its own 
> optional dunder method. Or maybe we need two functions: an ordered 
> version and an unordered version. 
>
> Hmmm... just tossing out a wild idea here... let's get rid of the dunder 
> method part of your suggestion, and add new public class methods to 
> tuple and frozenset: 
>
> tuple.hash_from_iter(iterable) 
> frozenset.hash_from_iter(iterable) 
>
>
> That gets rid of all the objections about backwards compatibility, since 
> these are new methods. They're not dunder names, so there are no 
> objections to being used as part of the public API. 
>
> A possible objection is the question, is this functionality *actually* 
> important enough to bother? 
>
> Another possible objection: are these methods part of the sequence/set 
> API? If not, do they really belong on the tuple/frozenset? Maybe they 
> belong elsewhere? 
>
>
>
> > Would this API more cleanly communicate the algorithm being used and the 
> > implementation, 
>
> No. If you want to communicate the algorithm being used, write some 
> documentation. 
>
> Seriously, the public API doesn't communicate the algorithm used for the 
> implementation. How can it? We can keep the same interface and change 
> the implementation, or change the interface and keep the implementation. 
> The two are (mostly) independent. 
>
>
>
> > while making a smaller increase in API surface area 
> > compared to introducing a new function? 
>
> It's difficult to quantify "API surface area". On the one hand, we have 
> the addition of one or two new functions or methods. Contrast with: 
>
> * introducing a new kind of method into the built-ins (one which 
>   behaves like a classmethod when called from the class, and like 
>   an instance method when called from an instance); 
>
> * changing tuple.__hash__ from an ordinary method to one of the 
>   above special methods; 
>
> * and likewise for frozenset.__hash__; 
>
> * change __hash__ from "only used as implementation, not as 
>   interface" to "sometimes used as interface". 
>
>
> To me,

Re: [Python-ideas] incremental hashing in hash

2017-01-04 Thread Steven D'Aprano

On Wed, Jan 04, 2017 at 04:38:05PM -0500, j...@math.brown.edu wrote:
> Instead of the proposals like "hash.from_iterable()", would it make sense
> to allow tuple.__hash__() to accept any iterable, when called as a
> classmethod? 

The public API for calculating the hash of something is to call the 
hash() builtin function on some object, e.g. to call tuple.__hash__ you 
write hash((a, b, c)). The __hash__ dunder method is implementation, not 
interface, and normally shouldn't be called directly.

Unless I'm missing something obvious, your proposal would require the 
caller to call the dunder methods directly:

class X:
def __hash__(self):
return tuple.__hash__(iter(self))

I consider that a poor interface design.

But even if we decide to make an exception in this case, tuple.__hash__ 
is currently an ordinary instance method right now. There's probably 
code that relies on that fact and expects that:

tuple.__hash__((a, b, c))

is currently the same as 

(a, b, c).__hash__()

(Starting with the hash() builtin itself, I expect, although that is 
easy enough to fix if needed.) Your proposal will break backwards 
compatibility, as it requires a change in semantics:

(1) (a, b, c).__hash__() must keep the current behaviour, which 
means behaving like a bound instance method;

(2) But tuple.__hash__ will no longer return an unbound method (actually 
a function object, but the difference is unimportant) and instead will 
return something that behaves like a bound class method.

Here's an implementation which does this:

http://code.activestate.com/recipes/577030-dualmethod-descriptor/

so such a thing is possible. But it breaks backwards-compatability and 
introduces something which I consider to be an unclean API (calling a 
dunder method directly). Unless there's a *really* strong advantage to

tuple.__hash__(...)

over 

hash.from_iterable(...)

(or equivalent), I would be against this change.

> (And similarly with frozenset.__hash__(), so that the fast C
> implementation of that algorithm could be used, rather than the slow
> collections.Set._hash() implementation. Then the duplicated implementation
> in _collections_abc.py's Set._hash() could be removed completely,
> delegating to frozenset.__hash__() instead.)

This is a good point. Until now, I've been assuming that 
hash.from_iterable should consider order. But frozenset shows us that 
sometimes the hash should *not* consider order.

This hints that perhaps the hash.from_iterable() should have its own 
optional dunder method. Or maybe we need two functions: an ordered 
version and an unordered version.

Hmmm... just tossing out a wild idea here... let's get rid of the dunder 
method part of your suggestion, and add new public class methods to 
tuple and frozenset:

tuple.hash_from_iter(iterable)
frozenset.hash_from_iter(iterable)

That gets rid of all the objections about backwards compatibility, since 
these are new methods. They're not dunder names, so there are no 
objections to being used as part of the public API.

A possible objection is the question, is this functionality *actually* 
important enough to bother?

Another possible objection: are these methods part of the sequence/set 
API? If not, do they really belong on the tuple/frozenset? Maybe they 
belong elsewhere?

> Would this API more cleanly communicate the algorithm being used and the
> implementation, 

No. If you want to communicate the algorithm being used, write some 
documentation.

Seriously, the public API doesn't communicate the algorithm used for the 
implementation. How can it? We can keep the same interface and change 
the implementation, or change the interface and keep the implementation. 
The two are (mostly) independent.

> while making a smaller increase in API surface area
> compared to introducing a new function?

It's difficult to quantify "API surface area". On the one hand, we have 
the addition of one or two new functions or methods. Contrast with:

* introducing a new kind of method into the built-ins (one which 
  behaves like a classmethod when called from the class, and like
  an instance method when called from an instance);

* changing tuple.__hash__ from an ordinary method to one of the
  above special methods;

* and likewise for frozenset.__hash__;

* change __hash__ from "only used as implementation, not as 
  interface" to "sometimes used as interface".

To me, adding one or two new methods/functions is the smaller, or at 
least less disruptive, change.

-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-04 Thread Stephen J. Turnbull

j...@math.brown.edu writes:

 > Instead of the proposals like "hash.from_iterable()", would it make sense
 > to allow tuple.__hash__() to accept any iterable, when called as a
 > classmethod?
[...]

 > Would this API more cleanly communicate the algorithm being used and the
 > implementation, while making a smaller increase in API surface area
 > compared to introducing a new function?

I don't understand what you're proposing.  There are three problems
with the API.  First, the "obvious" meaning of
"tuple.__hash__(iterable)" is "hash(tuple(iterable))", and that's the
obvious spelling.  Second, there's no reason for a dunder method to be
polymorphic; this is hardly discoverable.  Third, dunders are normally
not called from user code (including class implementations, although
they're less ugly there), suggesting that there should be a helper for
this.

I'm unclear on how the function is supposed to be implemented, since
presumably tuple.__hash__ knows about the tuple data structure's
implementation (memory layout), but it can't know about an arbitrary
iterable.

Steve

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-04 Thread jab

Instead of the proposals like "hash.from_iterable()", would it make sense
to allow tuple.__hash__() to accept any iterable, when called as a
classmethod? (And similarly with frozenset.__hash__(), so that the fast C
implementation of that algorithm could be used, rather than the slow
collections.Set._hash() implementation. Then the duplicated implementation
in _collections_abc.py's Set._hash() could be removed completely,
delegating to frozenset.__hash__() instead.)

Would this API more cleanly communicate the algorithm being used and the
implementation, while making a smaller increase in API surface area
compared to introducing a new function?
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2017-01-01 Thread jab

On Sat, Dec 31, 2016 at 5:39 PM,  wrote:

> a set hashing algorithm is exposed as collections.Set._hash() in
> _collections_abc.py, which can be passed an iterable


(By which I meant to say, "which can be passed a set-like iterable such as
a Keys- or ItemsView of an existing mapping, so that the hash can be
computed from the existing data rather than copying it all into a new
frozenset." Should have been more precise.)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-31 Thread jab

On Sat, Dec 31, 2016 at 7:17 AM, Nick Coghlan  wrote:

> On 31 December 2016 at 15:13,  wrote:
>
>> On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico 
>> wrote:
>>
>>> How likely is it that you'll have this form of collision, rather than
>>> some other? Remember, collisions *will* happen, so don't try to eliminate
>>> them all; just try to minimize the chances of *too many* collisions. So if
>>> you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3)
>>> and (3,1,2), then sure, you need to care about order; but otherwise, one
>>> possible cause of a collision is no worse than any other. Keep your
>>> algorithm simple, and don't sweat the details that you aren't sure matter.
>>
>>
>> In the context of writing a collections library, and not an application,
>> it should work well for a diversity of workloads that your users could
>> throw at it. In that context, it's hard to know what to do with advice like
>> this. "Heck, just hash the first three items and call it a day!"
>>
>
> Yes, this is essentially what we're suggesting you do - start with a "good
> enough" hash that may have scalability problems (e.g. due to memory
> copying) or mathematical distribution problems (e.g. due to a naive
> mathematical combination of values), and then improve it over time based on
> real world usage of the library.
>
> Alternatively, you could take the existing tuple hashing algorithm and
> reimplement that in pure Python: https://hg.python.org/cpython/
> file/tip/Objects/tupleobject.c#l336
>
> Based on microbenchmarks, you could then find the size breakpoint where it
> makes sense to switch between "hash(tuple(self))" (with memory copying, but
> a more optimised implementation of the algorithm) and a pure Python
> "tuple_hash(self)". In either case, caching the result on the instance
> would minimise the computational cost over the lifetime of the object.
>
> Cheers,
> Nick.
>
> P.S. Having realised that the tuple hash *algorithm* can be re-used
> without actually creating a tuple, I'm more amenable to the idea of
> exposing a "hash.from_iterable" callable that produces the same result as
> "hash(tuple(iterable))" without actually creating the intermediate tuple.
>

Nice!

I just realized, similar to tupleobject.c's "tuplehash" routine, I think
the frozenset_hash algorithm (implemented in setobject.c) can't be reused
without actually creating a frozenset either. As mentioned, a set hashing
algorithm is exposed as collections.Set._hash() in _collections_abc.py,
which can be passed an iterable, but that routine is implemented in Python.
So here again it seems like you have to choose between either creating an
ephemeral copy of your data so you can use the fast C routine, or streaming
your data to a slower Python implementation. At least in this case the
Python implementation is built-in though.

Given my current shortage of information, for now I'm thinking of handling
this problem in my library by exposing a parameter that users can tune if
needed. See bidict/_frozen.py in https://github.com/jab/bidict/
commit/485bf98#diff-215302d205b9f3650d58ee0337f77297, and check out the
_HASH_NITEMS_MAX attribute.

I have to run for now, but thanks again everyone, and happy new year!
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-31 Thread Brendan Barnwell


On 2016-12-30 17:04, Ethan Furman wrote:

On 12/30/2016 04:31 PM,j...@math.brown.edu  wrote:

On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman wrote:

If you are relying on an identity check for equality then no
two FrozenOrderedCollection instances can be equal.  Was that
your intention?  It it was, then just hash the instance's
id() and you're done.


No, I was talking about the identity check done by a set or dict
when doing a lookup to check if the object in a hash bucket is
identical to the object being looked up. In that case, there is
no need for the set or dict to even call __eq__.  Right?

No.  It is possible to have two keys be equal but different -- an
easy example is 1 and 1.0; they both hash the same, equal the same,
but are not identical.  dict has to check equality when two different
objects hash the same but have non-matching identities.


	I think that is the same as what he said.  The point is that if they 
*are* the same object, you *don't* need to check equality.


--
Brendan Barnwell
"Do not follow where the path may lead.  Go, instead, where there is no
path, and leave a trail."
   --author unknown
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-31 Thread Nick Coghlan

On 31 December 2016 at 15:13,  wrote:

> On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico  wrote:
>
>> How likely is it that you'll have this form of collision, rather than
>> some other? Remember, collisions *will* happen, so don't try to eliminate
>> them all; just try to minimize the chances of *too many* collisions. So if
>> you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3)
>> and (3,1,2), then sure, you need to care about order; but otherwise, one
>> possible cause of a collision is no worse than any other. Keep your
>> algorithm simple, and don't sweat the details that you aren't sure matter.
>
>
> In the context of writing a collections library, and not an application,
> it should work well for a diversity of workloads that your users could
> throw at it. In that context, it's hard to know what to do with advice like
> this. "Heck, just hash the first three items and call it a day!"
>

Yes, this is essentially what we're suggesting you do - start with a "good
enough" hash that may have scalability problems (e.g. due to memory
copying) or mathematical distribution problems (e.g. due to a naive
mathematical combination of values), and then improve it over time based on
real world usage of the library.

Alternatively, you could take the existing tuple hashing algorithm and
reimplement that in pure Python:
https://hg.python.org/cpython/file/tip/Objects/tupleobject.c#l336

Based on microbenchmarks, you could then find the size breakpoint where it
makes sense to switch between "hash(tuple(self))" (with memory copying, but
a more optimised implementation of the algorithm) and a pure Python
"tuple_hash(self)". In either case, caching the result on the instance
would minimise the computational cost over the lifetime of the object.

Cheers,
Nick.

P.S. Having realised that the tuple hash *algorithm* can be re-used without
actually creating a tuple, I'm more amenable to the idea of exposing a
"hash.from_iterable" callable that produces the same result as
"hash(tuple(iterable))" without actually creating the intermediate tuple.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Steven D'Aprano

On Fri, Dec 30, 2016 at 07:08:27PM -0800, Ethan Furman wrote:

> So maybe this will work?
> 
> def __hash__(self):
> return hash(self.name) * hash(self.nick) * hash(self.color)

I don't like the multiplications. If any of the three hashes return 
zero, the overall hash will be zero. I think you need better mixing than 
that. Look at tuple:

py> hash((0, 1, 2))
-421559672
py> hash(0) * hash(1) * hash(2)
0



-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Steven D'Aprano

On Fri, Dec 30, 2016 at 09:47:54PM -0500, j...@math.brown.edu wrote:

> __eq__ only has to be called when a hash bucket is non-empty. In that case,
> it may be O(n) in pathological cases, but it could also be O(1) every time.
> On the other hand, __hash__ has to be called on every lookup, is O(n) on
> the first call, and ideally O(1) after. I'd expect that __eq__ may often
> not dominate, and avoiding an unnecessary large tuple allocation on the
> first __hash__ call could be helpful.

Sorry to be the broken record repeating himself, but this sounds 
*exactly* like premature optimization here.

My suggestion is that you are overthinking things, or at least worrying 
about issues before you've got any evidence that they are going to be 
real issues. I expect that the amount of time and effort you've spent in 
analysing the theorectical problems here and writing to this list is 
more than the time it would have taken for you to write the simplest 
__hash__ method that could work (using the advice in the docs to make a 
tuple). You could have implemented that in five minutes, and be running 
code by now.

Of course I understand that performance issues may not be visible when 
you have 100 or 100 thousand items but may be horrific when you have 100 
million. I get that. But you're aware of the possibility, so you can 
write a performance test that generates 100 million objects and tests 
performance. *If* you find an actual problem, then you can look into 
changing your __hash__ method. You could come back here and talk about 
actual performance issues instead of hypothetical issues, or you could 
hire an expert to tune your hash function. (And if you do pay for an 
expert to solve this, please consider giving back to the community.)

Remember that the specific details of __hash__ should be an internal 
implementation detail for your class. You shouldn't fear changing the 
hash algorithm as often as you need to, including in bug fix releases. 
You don't have to wait for the perfect hash function, just a "good 
enough for now" one to get started.

I'm not trying to be dismissive of your concerns. They may be real 
issues that you have to solve. I'm just saying, you should check your 
facts first rather than solve hypothetic problems. I have seen, and even 
written, far too much pessimized code in the name of "this must be 
better, it stands to reason" to give much credence to theoretical 
arguments about performance.

-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread jab

On Fri, Dec 30, 2016 at 10:30 PM, Nathaniel Smith  wrote:

> ...

"Most hash schemes depend on having a "good" hash function, in the sense of
> simulating randomness. Python doesn't ..."

https://github.com/python/cpython/blob/d0a2f68a/Objects/dictobject.c#L133

...

Thanks for that link, fascinating to read the rest of that comment!!

Btw, the part you quoted seemed like more a defense for what followed, i.e.
the choice to make hash(some_int) == some_int. I'm not sure how much the
part you quoted applies generally. e.g. The https://docs.python.org/3/
reference/datamodel.html#object.__hash__ docs don't say, "Don't worry about
your __hash__ implementation, dict's collision resolution strategy is so
good it probably doesn't matter anyway."

But it also doesn't have any discussion of any of the tradeoffs you
mentioned that might be worth considering. What should you do when there
are arbitrarily many "components of the object that play a part in
comparison of objects"? The "hash((self._name, self._nick, self._color))"
code snippet is the only example it gives. This leaves people who have use
cases like mine wondering whether it's still advised to scale this up to
the arbitrarily many items that instances of their class can contain.

If not, then what is advised? Passing a tuple of fewer items to a single
hash() call, e.g. hash(tuple(islice(self, CUTOFF)))? Ned's recipe of
pairwise-accumulating hash() results over all the items? Or only
pairwise-accumulating up to a certain cutoff? Stephen J. Turnbull's idea to
use fewer accumulation steps and larger-than-2-tuples? Passing all the
items into some other cheap, built-in hash algorithm that actually has an
incremental update API (crc32?)?

Still hoping someone can give some authoritative advice, and hope it's
still reasonable to be asking. If not, I'll cut it out.

On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico  wrote:

> How likely is it that you'll have this form of collision, rather than some
> other? Remember, collisions *will* happen, so don't try to eliminate them
> all; just try to minimize the chances of *too many* collisions. So if
> you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3)
> and (3,1,2), then sure, you need to care about order; but otherwise, one
> possible cause of a collision is no worse than any other. Keep your
> algorithm simple, and don't sweat the details that you aren't sure matter.

In the context of writing a collections library, and not an application, it
should work well for a diversity of workloads that your users could throw
at it. In that context, it's hard to know what to do with advice like this.
"Heck, just hash the first three items and call it a day!"
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Chris Angelico

On Sat, Dec 31, 2016 at 2:24 PM,   wrote:
> See the "Simply XORing such results together would not be order-sensitive,
> and so wouldn't work" from my original post. (Like XOR, multiplication is
> also commutative.)
>
> e.g. Since FrozenOrderedCollection([1, 2]) != FrozenOrderedCollection([2,
> 1]), we should try to avoid making their hashes equal, or else we increase
> collisions unnecessarily.

How likely is it that you'll have this form of collision, rather than
some other? Remember, collisions *will* happen, so don't try to
eliminate them all; just try to minimize the chances of *too many*
collisions. So if you're going to be frequently dealing with (1,2,3)
and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about
order; but otherwise, one possible cause of a collision is no worse
than any other. Keep your algorithm simple, and don't sweat the
details that you aren't sure matter.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Nathaniel Smith

On Fri, Dec 30, 2016 at 9:29 AM,   wrote:
> Updating the docs sounds like the more important change for now, given 3.7+.
> But before the docs make an official recommendation for that recipe, were
> the analyses that Steve and I did sufficient to confirm that its hash
> distribution and performance is good enough at scale, or is more rigorous
> analysis necessary?
>
> I've been trying to find a reasonably detailed and up-to-date reference on
> Python hash() result requirements and analysis methodology, with
> instructions on how to confirm if they're met, but am still looking. Would
> find that an interesting read if it's out there. But I'd take just an
> authoritative thumbs up here too. Just haven't heard one yet.

I'm not an expert on hash table implementation subtleties, but I can
quote Objects/dictobject.c: "Most hash schemes depend on having a
"good" hash function, in the sense of simulating randomness. Python
doesn't ..."

https://github.com/python/cpython/blob/d0a2f68a5f972e4474f5ecce182752f6f7a22178/Objects/dictobject.c#L133

(Basically IIUC the idea is that Python dicts use a relatively
sophisticated probe sequence to resolve collisions which allows the
use of relatively "poor" hashes. There are interesting trade-offs
here...)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread jab

On Fri, Dec 30, 2016 at 10:08 PM, Ethan Furman  wrote:

> So maybe this will work?
>
> def __hash__(self):
> return hash(self.name) * hash(self.nick) * hash(self.color)
>
> In other words, don't create a new tuple, just use the ones you already
> have and toss in a couple maths operations.  (and have somebody vet that ;)

See the "Simply XORing such results together would not be order-sensitive,
and so wouldn't work" from my original post. (Like XOR, multiplication is
also commutative.)

e.g. Since FrozenOrderedCollection([1, 2]) != FrozenOrderedCollection([2,
1]), we should try to avoid making their hashes equal, or else we increase
collisions unnecessarily.

More generally, I think the trick is to get an even, chaotic distribution
into sys.hash_info.hash_bits out of this hash algorithm, and I think simply
multiplying hash values together like this wouldn't do that.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Ethan Furman


On 12/30/2016 06:47 PM, j...@math.brown.edu wrote:


__eq__ only has to be called when a hash bucket is non-empty. In that case,
 it may be O(n) in pathological cases, but it could also be O(1) every time.
 On the other hand, __hash__ has to be called on every lookup, is O(n) on
 the first call, and ideally O(1) after. I'd expect that __eq__ may often
 not dominate, and avoiding an unnecessary large tuple allocation on the
 first __hash__ call could be helpful. If a __hash__ implementation can
 avoid creating an arbitrarily large tuple unnecessarily and still perform
 well, why not save the space? If the resulting hash distribution is worse,
 that's another story, but it seems worth documenting one way or the other,
 since the current docs give memory-conscious Python programmers pause for
 this use case.


A quick list of what we have agreed on:

- __hash__ is called once for every dict/set lookup
- __hash__ is calculated once because we are caching the value
- __eq__ is being called an unknown number of times, but can be quite
  expensive in terms of time
- items with the same hash are probably cheap in terms of __eq__ (?)

So maybe this will work?

def __hash__(self):
return hash(self.name) * hash(self.nick) * hash(self.color)

In other words, don't create a new tuple, just use the ones you already have 
and toss in a couple maths operations.  (and have somebody vet that ;)

--
~Ethan~

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread jab

On Fri, Dec 30, 2016 at 9:21 PM, Ethan Furman  wrote:

> I don't think so.  As someone else said, a hash can be calculated once and
> then cached, but __eq__ has to be called every time.  Depending on the
> clustering of your data that could be quick... or not.
>

__eq__ only has to be called when a hash bucket is non-empty. In that case,
it may be O(n) in pathological cases, but it could also be O(1) every time.
On the other hand, __hash__ has to be called on every lookup, is O(n) on
the first call, and ideally O(1) after. I'd expect that __eq__ may often
not dominate, and avoiding an unnecessary large tuple allocation on the
first __hash__ call could be helpful. If a __hash__ implementation can
avoid creating an arbitrarily large tuple unnecessarily and still perform
well, why not save the space? If the resulting hash distribution is worse,
that's another story, but it seems worth documenting one way or the other,
since the current docs give memory-conscious Python programmers pause for
this use case.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Ethan Furman


On 12/30/2016 06:12 PM, j...@math.brown.edu wrote:


... your point stands that Python had to call __eq__ in these cases.

But with instances of large, immutable, ordered collections, an
 application could either:

1. accept that it might create a duplicate, equivalent instance of
 an existing instance with sufficient infrequency that it's okay
 taking the performance hit, or

2. try to avoid creating duplicate instances in the first place,
 using the existing, equivalent instance it created as a singleton.
 Then a set or dict lookup could use the identity check, and avoid
 the expensive __eq__ call.



I think it's much more important to focus on what happens with
 unequal instances here, since there are usually many more of them.
 And with them, the performance hit of the __eq__ calls definitely
 does not necessarily dominate that of __hash__.  Right?


I don't think so.  As someone else said, a hash can be calculated once and then 
cached, but __eq__ has to be called every time.  Depending on the clustering of 
your data that could be quick... or not.

--
~Ethan~
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread jab

On Fri, Dec 30, 2016 at 8:10 PM,  wrote:

> On Fri, Dec 30, 2016 at 8:04 PM, Ethan Furman  wrote:
>
>> No.  It is possible to have two keys be equal but different -- an easy
>> example is 1 and 1.0; they both hash the same, equal the same, but are not
>> identical.  dict has to check equality when two different objects hash the
>> same but have non-matching identities.
>>
> ...
> I'm having trouble showing that two equal but nonidentical objects can
> both be in the same dict.
>

(Because they can't, though) your point stands that Python had to call
__eq__ in these cases.

But with instances of large, immutable, ordered collections, an application
could either:

1. accept that it might create a duplicate, equivalent instance of an
existing instance with sufficient infrequency that it's okay taking the
performance hit, or

2. try to avoid creating duplicate instances in the first place, using the
existing, equivalent instance it created as a singleton. Then a set or dict
lookup could use the identity check, and avoid the expensive __eq__ call.

I think it's much more important to focus on what happens with unequal
instances here, since there are usually many more of them. And with them,
the performance hit of the __eq__ calls definitely does not necessarily
dominate that of __hash__. Right?
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread jab

On Fri, Dec 30, 2016 at 8:04 PM, Ethan Furman  wrote:

> No.  It is possible to have two keys be equal but different -- an easy
> example is 1 and 1.0; they both hash the same, equal the same, but are not
> identical.  dict has to check equality when two different objects hash the
> same but have non-matching identities.
>


Python 3.6.0 (default, Dec 24 2016, 00:01:50)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> d = {1: 'int', 1.0: 'float'}
>>> d
{1: 'float'}


IPython 5.1.0 -- An enhanced Interactive Python.

In [1]: class Foo:
   ...: def __eq__(self, other):
   ...: return True
   ...: def __init__(self, val):
   ...: self.val = val
   ...: def __repr__(self):
   ...: return '' % self.val
   ...: def __hash__(self):
   ...: return 42
   ...:

In [2]: f1 = Foo(1)

In [3]: f2 = Foo(2)

In [4]: x = {f1: 1, f2: 2}

In [5]: x
Out[5]: {: 2}



I'm having trouble showing that two equal but nonidentical objects can both
be in the same dict.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Ethan Furman


On 12/30/2016 04:31 PM, j...@math.brown.edu wrote:

On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman wrote:



If you are relying on an identity check for equality then no two
 FrozenOrderedCollection instances can be equal.  Was that your
 intention?  It it was, then just hash the instance's id() and
 you're done.


No, I was talking about the identity check done by a set or dict
 when doing a lookup to check if the object in a hash bucket is
 identical to the object being looked up. In that case, there is
 no need for the set or dict to even call __eq__.  Right?


No.  It is possible to have two keys be equal but different -- an easy example 
is 1 and 1.0; they both hash the same, equal the same, but are not identical.  
dict has to check equality when two different objects hash the same but have 
non-matching identities.

--
~Ethan~
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread jab

On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman  wrote:

> On 12/30/2016 03:36 PM, j...@math.brown.edu wrote:
>
> In the use cases I described, the objects' members are ordered. So in the
>> unlikely event that two objects hash to the same value but are unequal, the
>> __eq__ call should be cheap, because they probably differ in length or on
>> their first member, and can skip further comparison. After a successful
>> hash lookup of an object that's already in a set or dict, a successful
>> identity check avoids an expensive equality check. Perhaps I misunderstood?
>>
>
> If you are relying on an identity check for equality then no two
> FrozenOrderedCollection instances can be equal.  Was that your intention?
> It it was, then just hash the instance's id() and you're done.
>

No, I was talking about the identity check done by a set or dict when doing
a lookup to check if the object in a hash bucket is identical to the object
being looked up. In that case, there is no need for the set or dict to even
call __eq__. Right?

The FrozenOrderedCollection.__eq__ implementation I sketched out was as
intended -- no identity check there.

If that wasn't your intention then, while there can certainly be a few
> quick checks to rule out equality (such as length) if those match, the
> expensive full equality check will have to happen.

I think we're misunderstanding each other, but at the risk of saying the
same thing again: Because it's an ordered collection, the equality check
for two unequal instances with the same hash value will very likely be able
to bail after comparing lengths or the first items. With a good hash
function, the odds of two unequal instances both hashing to the same value
and having their first N items equal should be vanishingly small, no?
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Ethan Furman


On 12/30/2016 03:36 PM, j...@math.brown.edu wrote:


In the use cases I described, the objects' members are ordered. So in the 
unlikely event that two objects hash to the same value but are unequal, the 
__eq__ call should be cheap, because they probably differ in length or on their 
first member, and can skip further comparison. After a successful hash lookup 
of an object that's already in a set or dict, a successful identity check 
avoids an expensive equality check. Perhaps I misunderstood?


If you are relying on an identity check for equality then no two 
FrozenOrderedCollection instances can be equal.  Was that your intention?  It 
it was, then just hash the instance's id() and you're done.

If that wasn't your intention then, while there can certainly be a few quick 
checks to rule out equality (such as length) if those match, the expensive full 
equality check will have to happen.

--
~Ethan~
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread jab

On Fri, Dec 30, 2016 at 3:54 PM, Christian Heimes 
wrote:

> Hi,
>
> I'm the author of PEP 456 (SipHash24) and one of the maintainers of the
> hashlib module.
>
> Before we come up with a new API or recipe, I would like to understand
> the problem first. Why does the initial op consider hash(large_tuple) a
> performance issue? If you have an object with lots of members that
> affect both __hash__ and __eq__, then __hash__ is really least of your
> concern. The hash has to be computed just once and then will stay the
> same over the life time of the object. Once computed the hash can be
> cached.
>
> On the other hand __eq__ is at least called once for every successful
> hash lookup. On the worst case it is called n-1 for a dict of size n for
> a match *and* a hashmap miss. Every __eq__ call has to compare between 1
> and m member attributes. For a dict with 1,000 elements with 1,000
> members each, that's just 1,000 hash computations with roughly 8 bB
> memory allocation but almost a million comparisons in worst case.
>
> A hasher objects adds further overhead, e.g. object allocation, creation
> of a bound methods for each call etc. It's also less CPU cache friendly
> than the linear data structure of a tuple.

Thanks for joining the discussion, great to have your input.

In the use cases I described, the objects' members are ordered. So in the
unlikely event that two objects hash to the same value but are unequal, the
__eq__ call should be cheap, because they probably differ in length or on
their first member, and can skip further comparison. After a successful
hash lookup of an object that's already in a set or dict, a successful
identity check avoids an expensive equality check. Perhaps I misunderstood?

Here is some example code:

class FrozenOrderedCollection:
...
def __eq__(self, other):
if not isinstance(other, FrozenOrderedCollection):
return NotImplemented
if len(self) != len(other):
return False
return all(i == j for (i, j) in zip(self, other))

def __hash__(self):
if hasattr(self, '_hashval'):
return self._hashval
hash_initial = hash(self.__class__)
self._hashval = reduce(lambda h, i: hash((h, i)), self,
hash_initial)
return self._hashval

Is it the case that internally, the code is mostly there to compute the
hash of such an object in incremental fashion, and it's just a matter of
figuring out the right API to expose it?

If creating an arbitrarily large tuple, passing that to hash(), and then
throwing it away really is the best practice, then perhaps that should be
explicitly documented, since I think many would find that counterintuitive?

@Stephen J. Turnbull, thank you for your input -- still digesting, but very
interesting.

Thanks again to everyone for the helpful discussion.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Christian Heimes

On 2016-12-30 20:59, Matthias Bussonnier wrote:
> On Fri, Dec 30, 2016 at 5:24 PM, Nick Coghlan  wrote:
>>
>> I understood the "iterhash" suggestion to be akin to itertools.accumulate:
>>
>> >>> for value, tally in enumerate(accumulate(range(10))): print(value, 
>> ...
> 
> It reminds me of hmac[1]/hashlib[2], with the API :  h.update(...)
> before a .digest().
> It is slightly lower level than a `from_iterable`, but would be a bit
> more flexible.
> If the API were kept similar things would be easier to remember.

Hi,

I'm the author of PEP 456 (SipHash24) and one of the maintainers of the
hashlib module.

Before we come up with a new API or recipe, I would like to understand
the problem first. Why does the initial op consider hash(large_tuple) a
performance issue? If you have an object with lots of members that
affect both __hash__ and __eq__, then __hash__ is really least of your
concern. The hash has to be computed just once and then will stay the
same over the life time of the object. Once computed the hash can be
cached.

On the other hand __eq__ is at least called once for every successful
hash lookup. On the worst case it is called n-1 for a dict of size n for
a match *and* a hashmap miss. Every __eq__ call has to compare between 1
and m member attributes. For a dict with 1,000 elements with 1,000
members each, that's just 1,000 hash computations with roughly 8 bB
memory allocation but almost a million comparisons in worst case.

A hasher objects adds further overhead, e.g. object allocation, creation
of a bound methods for each call etc. It's also less CPU cache friendly
than the linear data structure of a tuple.

Christian

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Matthias Bussonnier

On Fri, Dec 30, 2016 at 5:24 PM, Nick Coghlan  wrote:
>
> I understood the "iterhash" suggestion to be akin to itertools.accumulate:
>
> >>> for value, tally in enumerate(accumulate(range(10))): print(value, ...

It reminds me of hmac[1]/hashlib[2], with the API :  h.update(...)
before a .digest().
It is slightly lower level than a `from_iterable`, but would be a bit
more flexible.
If the API were kept similar things would be easier to remember.
-- 
M

[1]: https://docs.python.org/3/library/hmac.html#hmac.HMAC.update
[2]: https://docs.python.org/3/library/hashlib-blake2.html#module-hashlib
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Stephen J. Turnbull

j...@math.brown.edu writes:

 > But as you showed, it's certainly possible to do some exploration in the
 > meantime. Prompted by your helpful comparison, I just put together
 > https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further
 > compare hash_tuple vs. hash_incremental.

It's been a while :-) since I read Knuth[1] (and that was when Knuth
was authoritative on this subject 8^O), but neither methodology is
particularly helpful here.  The ideal is effectively a uniform
distribution on a circle, which has no mean.  Therefore standard
deviation is also uninteresting, since its definition uses the mean.
The right test is something like a χ² test against a uniform
distribution.

The scatter plots (of hash against test data) simply show the same
thing, without the precision of a statistical test.  (Note: do not
deprecate the computer visualization + human eye + human brain system.
It is the best known machine for detecting significant patterns and
anomolies, though YMMV.)  But it's not very good at detecting high-
dimensional patterns.

However, it's nowhere near good enough for a hash function to have a
uniform distribution on random data.  It actually needs to be
"chaotic" in the sense that it also spreads "nearby" data out, where
"nearby" here would probably mean that as 4000-bit strings less than
m% (for small m) differ, as in real data you usually expect a certain
amount of structure (continuity in time, clustering around a mean,
line noise in a communication channel) so that you'd be likely to get
lots of "clusters" of very similar data.  You don't want them pounding
on a small subset of buckets.

 > I'm not sure if the methodology there is sound, as I'm new to
 > analysis like this.

Even decades later, starting with Knuth[1] can't hurt. :-)

 > Given sufficiently good distribution, I'd expect there to be unanimous
 > agreement that an immutable collection, which could contain arbitrarily
 > many items, should strongly prefer hash_incremental(self) over
 > hash(tuple(self)), for the same reason we use generator comprehensions
 > instead of list comprehensions when appropriate. Please correct me if I'm
 > wrong.

I think that's a misleading analogy.  Just stick to

For very large collections where we already have the data,
duplicating the data is very expensive.  Furthermore, since the
purpose of hashing is making equality comparisons cheap, this is
likely to happen in a loop.

On the other hand, if there are going to be a *lot* of "large"
collections being stored, then they're actually not all that large
compared to the system, and you might not actually care that much
about the cost of the emphemeral tuples, because the real cost is in
an n^2 set of comparisons.  From the point of view of NumPy, this is
an "are you kidding?" argument because large datasets are its stock in
trade, but for core Python, this may be sufficiently esoteric that it
should be delegated to 

On balance, the arguments that Steven d'Aprano advanced for having a
statistics module in the stdlib vs. importing pandas apply here.  In
particular, I think there are a huge number of options for an
iterative hash.  For example, Ned chained 2-tuples.  But perhaps for
efficient time in bounded space you want to use bounded but larger
tuples.  I don't know -- and that's the point.  If we have a TOOWTDI
API like hash.from_iterable then smart people (and people with time on
their hands to do exhaustive experiments!) can tune that over time, as
has been done with the sort API.

Another option is yielding partials, as Steven says:

 > > itertools.iterhash(iterable)  # yield incremental hashes

That's a very interesting idea, though I suspect it rarely would make
a big performance improvement.

I'm not sure I like the "hash.from_iterable" name for this API.  The
problem is that if it's not a concrete collection[3], then you throw
away the data.  If the hash algorithm happens to suck for certain
data, you'll get a lot of collisions, and conclude that your data is
much more concentrated than it actually is.  I find it hard to imagine
a use case where you have large data where you only care about whether
two data points are different (cf. equality comparisons for floats).
You want to know how they're different.  So I think I would prefer to
name it "hash.from_collection" or similar.  Of course whether the
implementation actually raises on a generator or other non-concrete
iterable is a fine point.

Footnotes: 
[1]  Of course I mean The Art of Computer Programming, Ch. 3.

[2]  Including the newly ordered dicts, maybe?  Somebody tweet
@dabeaz!  What evil can he do with this?

[3]  Yeah, I know, "re-iterables".  But we don't have a definition,
let alone an API for identifying, those Things.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread jab

Updating the docs sounds like the more important change for now, given
3.7+. But before the docs make an official recommendation for that recipe,
were the analyses that Steve and I did sufficient to confirm that its hash
distribution and performance is good enough at scale, or is more rigorous
analysis necessary?

I've been trying to find a reasonably detailed and up-to-date reference on
Python hash() result requirements and analysis methodology, with
instructions on how to confirm if they're met, but am still looking. Would
find that an interesting read if it's out there. But I'd take just an
authoritative thumbs up here too. Just haven't heard one yet.

And regarding any built-in support that might get added, I just want to
make sure Ryan Gonzalez's proposal (the first reply on this thread) didn't
get buried:

hasher = IncrementalHasher()
hasher.add(one_item_to_hash)  # updates hasher.hash property with result
# repeat
return hasher.hash

I think this is the only proposal so far that actually adds an explicit API
for performing an incremental update. (i.e. The other
"hash_stream(iterable)" -style proposals are all-or-nothing.) This would
bring Python's built-in hash() algorithm's support up to parity with the
other algorithms in the standard library (hashlib, crc32). Maybe that's
valuable?
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-30 Thread Nick Coghlan

On 29 December 2016 at 18:35, Chris Angelico  wrote:

> On Thu, Dec 29, 2016 at 7:20 PM, Steven D'Aprano 
> wrote:
> > I'd rather add a generator to the itertools
> > module:
> >
> > itertools.iterhash(iterable)  # yield incremental hashes
> >
> > or, copying the API of itertools.chain, add a method to hash:
> >
> > hash.from_iterable(iterable)  # return hash calculated incrementally
>
> The itertools module is mainly designed to be consumed lazily. The
> hash has to be calculated eagerly, so it's not really a good fit for
> itertools.


I understood the "iterhash" suggestion to be akin to itertools.accumulate:

>>> for value, tally in enumerate(accumulate(range(10))): print(value,
tally)
...
0 0
1 1
2 3
3 6
4 10
5 15
6 21
7 28
8 36
9 45

However, I think including Ned's recipe (or something like it) in
https://docs.python.org/3/reference/datamodel.html#object.__hash__ as a
tool for avoiding large temporary tuple allocations may be a better way to
start off as:

1. It's applicable to all currently released versions of Python, not just
to 3.7+
2. It provides more scope for people to experiment with their own variants
of the idea before committing to a *particular* version somewhere in the
standard library

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-29 Thread Tim Delaney

On 29 December 2016 at 19:20, Steven D'Aprano  wrote:

>
> With respect Josh, I feel that this thread is based on premature
> optimization. It seems to me that you're *assuming* that anything less
> than some theoretically ideal O(1) space O(N) time hash function is
> clearly and obviously unsuitable.
>
> Of course I might be completely wrong. Perhaps you have implemented your
> own __hash__ methods as suggested by the docs, as well as Ned's version,
> and profiled your code and discovered that __hash__ is a significant
> bottleneck. In which case, I'll apologise for doubting you, but in my
> defence I'll say that the language you have used in this thread so far
> gives no hint that you've actually profiled your code and found the
> bottleneck.
>

In Josh's defence, the initial use case he put forward is exactly the kind
of scenario where it's worthwhile optimising ahead of time. Quite often a
poorly implemented hash function doesn't manifest as a problem until you
scale up massively - and a developer may not have the capability to scale
up to a suitable level in-house, resulting in performance issues at
customer sites.

I had one particular case (fortunately discovered before going to
customers) where a field was included in the equality check, but wasn't
part of the hash. Unfortunately, the lack of this one field resulted in
objects only being allocated to a few buckets (in a Java HashMap),
resulting in every access having to walk a potentially very long chain
doing equality comparisons - O(N) behaviour from an amortised O(1) data
structure.

Unit tests - no worries. Small-scale tests - everything looked fine. Once
we started our load tests though everything slowed to a crawl. 100% CPU,
throughput at a standstill ... it didn't look good.

Adding that one field to the hash resulted in the ability to scale up to
hundreds of thousands of objects with minimal CPU. I can't remember if it
was millions we tested to (it was around 10 years ago ...).

Tim Delaney
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-29 Thread jab

Thanks for the thoughtful discussion, it's been very interesting.

Hash algorithms seem particularly sensitive and tricky to get right, with a
great deal of research going into choices of constants, etc. and lots of
gotchas. So it seemed worth asking about. If I had to bet on whether
repeatedly accumulating pairwise hash() results would maintain the same
desired properties that hash(tuple(items)) guarantees, I'd want to get
confirmation from someone with expertise in this first, hence my starting
this thread.

But as you showed, it's certainly possible to do some exploration in the
meantime. Prompted by your helpful comparison, I just put together
https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further
compare hash_tuple vs. hash_incremental.

Based on that, hash_incremental seems to have a comparable distribution to
hash_tuple. I'm not sure if the methodology there is sound, as I'm new to
analysis like this. So I'd still welcome confirmation from someone with
actual expertise in Python's internal hash algorithms. But so far this sure
seems good enough for the use cases I described.

Given sufficiently good distribution, I'd expect there to be unanimous
agreement that an immutable collection, which could contain arbitrarily
many items, should strongly prefer hash_incremental(self) over
hash(tuple(self)), for the same reason we use generator comprehensions
instead of list comprehensions when appropriate. Please correct me if I'm
wrong.

+1 for the "hash.from_iterable" API you proposed, if some additional
support for this is added to Python. Otherwise +1 for including Ned's
recipe in the docs. Again, happy to submit a patch for either of these if
it would be helpful.

And to be clear, I really appreciate the time that contributors have put
into this thread, and into Python in general. Thoughtful responses are
always appreciated, and never expected. I'm just interested in learning and
in helping improve Python when I might have an opportunity. My Python open
source work has been done on a voluntary basis too, and I haven't even
gotten to use Python for paid/closed source work in several years, alas.

Thanks again,
Josh

On Thu, Dec 29, 2016 at 3:20 AM, Steven D'Aprano 
wrote:

> On Wed, Dec 28, 2016 at 12:44:55PM -0500, j...@math.brown.edu wrote:
> > On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman 
> wrote:
> >
> > > In other words, objects that do not compare equal can also have the
> same
> > > hash value (although too much of that will reduce the efficiency of
> > > Python's containers).
> > >
> >
> > Yes, I realize that unequal objects can return the same hash value with
> > only performance, and not correctness, suffering. It's the performance
> I'm
> > concerned about. That's what I meant by "...to keep from unnecessarily
> > causing hash collisions..." in my original message, but sorry this wasn't
> > clearer. We should be able to do this in a way that doesn't increase hash
> > collisions unnecessarily.
>
> With respect Josh, I feel that this thread is based on premature
> optimization. It seems to me that you're *assuming* that anything less
> than some theoretically ideal O(1) space O(N) time hash function is
> clearly and obviously unsuitable.
>
> Of course I might be completely wrong. Perhaps you have implemented your
> own __hash__ methods as suggested by the docs, as well as Ned's version,
> and profiled your code and discovered that __hash__ is a significant
> bottleneck. In which case, I'll apologise for doubting you, but in my
> defence I'll say that the language you have used in this thread so far
> gives no hint that you've actually profiled your code and found the
> bottleneck.
>
> As I see it, this thread includes a few questions:
>
> (1) What is a good way to generate a hash one item at a time?
>
> I think Ned's answer is "good enough", subject to evidence to the
> contrary. If somebody cares to spend the time to analyse it, that's
> excellent, but we're volunteers and its the holiday period and most
> people have probably got better things to do. But we shouldn't let the
> perfect be the enemy of the good.
>
> But for what it's worth, I've done an *extremely* quick and dirty test
> to see whether the incremental hash function gives a good spread of
> values, by comparing it to the standard hash() function.
>
>
> py> import statistics
> py> def incrhash(iterable):
> ... h = hash(())
> ... for x in iterable:
> ... h = hash((h, x))
> ... return h
> ...
> py>
> py> data1 = []
> py> data2 = []
> py> for i in range(1000):
> ... it = range(i, i+100)
> ... data1.append(hash(tuple(it)))
> ... data2.append(incrhash(it))
> ...
> py> # Are there any collisions?
> ... len(set(data1)), len(set(data2))
> (1000, 1000)
> py> # compare spread of values
> ... statistics.stdev(data1), statistics.stdev(data2)
> (1231914201.0980585, 1227850884.443638)
> py> max(data1)-min(data1), max(data2)-min(data2)
> (4287398438, 4287569008)
>
>
> Neither the built-in hash

Re: [Python-ideas] incremental hashing in hash

2016-12-29 Thread Chris Angelico

On Thu, Dec 29, 2016 at 7:20 PM, Steven D'Aprano  wrote:
> I'd rather add a generator to the itertools
> module:
>
> itertools.iterhash(iterable)  # yield incremental hashes
>
> or, copying the API of itertools.chain, add a method to hash:
>
> hash.from_iterable(iterable)  # return hash calculated incrementally

The itertools module is mainly designed to be consumed lazily. The
hash has to be calculated eagerly, so it's not really a good fit for
itertools. The only real advantage of this "hash from iterable" over
hash(tuple(it)) is avoiding the intermediate tuple, so I'd want to see
evidence that that's actually significant.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-29 Thread Steven D'Aprano

On Wed, Dec 28, 2016 at 12:44:55PM -0500, j...@math.brown.edu wrote:
> On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman  wrote:
> 
> > In other words, objects that do not compare equal can also have the same
> > hash value (although too much of that will reduce the efficiency of
> > Python's containers).
> >
> 
> Yes, I realize that unequal objects can return the same hash value with
> only performance, and not correctness, suffering. It's the performance I'm
> concerned about. That's what I meant by "...to keep from unnecessarily
> causing hash collisions..." in my original message, but sorry this wasn't
> clearer. We should be able to do this in a way that doesn't increase hash
> collisions unnecessarily.

With respect Josh, I feel that this thread is based on premature 
optimization. It seems to me that you're *assuming* that anything less 
than some theoretically ideal O(1) space O(N) time hash function is 
clearly and obviously unsuitable.

Of course I might be completely wrong. Perhaps you have implemented your 
own __hash__ methods as suggested by the docs, as well as Ned's version, 
and profiled your code and discovered that __hash__ is a significant 
bottleneck. In which case, I'll apologise for doubting you, but in my 
defence I'll say that the language you have used in this thread so far 
gives no hint that you've actually profiled your code and found the 
bottleneck.

As I see it, this thread includes a few questions:

(1) What is a good way to generate a hash one item at a time?

I think Ned's answer is "good enough", subject to evidence to the 
contrary. If somebody cares to spend the time to analyse it, that's 
excellent, but we're volunteers and its the holiday period and most 
people have probably got better things to do. But we shouldn't let the 
perfect be the enemy of the good.

But for what it's worth, I've done an *extremely* quick and dirty test 
to see whether the incremental hash function gives a good spread of 
values, by comparing it to the standard hash() function.

py> import statistics
py> def incrhash(iterable):
... h = hash(())
... for x in iterable:
... h = hash((h, x))
... return h
...
py>
py> data1 = []
py> data2 = []
py> for i in range(1000):
... it = range(i, i+100)
... data1.append(hash(tuple(it)))
... data2.append(incrhash(it))
...
py> # Are there any collisions?
... len(set(data1)), len(set(data2))
(1000, 1000)
py> # compare spread of values
... statistics.stdev(data1), statistics.stdev(data2)
(1231914201.0980585, 1227850884.443638)
py> max(data1)-min(data1), max(data2)-min(data2)
(4287398438, 4287569008)

Neither the built-in hash() nor the incremental hash gives any 
collisions over this (admittedly small) data set, and both have very 
similar spreads of values as measured by either the standard deviation 
or the statistical range. The means are quite different:

py> statistics.mean(data1), statistics.mean(data2)
(-8577110.944, 2854438.568)

but I don't think that matters. So that's good enough for me.

(2) Should Ned's incremental hash, or some alternative with better 
properties, go into the standard library? 

I'm not convinced that your examples are common enough that the stdlib 
should be burdened with supporting it. On the other hand, I don't think 
it is an especially *large* burden, so perhaps it could be justified. 
Count me as sitting on the fence on this one.

Perhaps a reasonable compromise would be to include it as a recipe in 
the docs.

(3) If it does go in the stdlib, where should it go?

I'm suspicious of functions that change their behaviour depending on how 
they are called, so I'm not keen on your suggestion of adding a second 
API to the hash built-in:

hash(obj)  # return hash of obj

hash(iterable=obj)  # return incrementally calculated hash of obj

That feels wrong to me. I'd rather add a generator to the itertools 
module:

itertools.iterhash(iterable)  # yield incremental hashes

or, copying the API of itertools.chain, add a method to hash:

hash.from_iterable(iterable)  # return hash calculated incrementally

-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-28 Thread Ned Batchelder

On 12/28/16 12:27 PM, j...@math.brown.edu wrote:
> On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder
> mailto:n...@nedbatchelder.com>> wrote:
>
> You can write a simple function to use hash iteratively to hash
> the entire stream in constant space and linear time:
>
> def hash_stream(them):
> val = 0
> for it in them:
> val = hash((val, it))
> return val
>
> Although this creates N 2-tuples, they come and go, so the memory
> use won't grow.  Adjust the code as needed to achieve
> canonicalization before iterating.
>
> Or maybe I am misunderstanding the requirements?
>
>
> This is better than solutions like
> http://stackoverflow.com/a/27952689/161642
>  in the sense that it's a
> little higher level (no bit shifting or magic numbers).
>
> But it's not clear that it's any better in the sense that you're still
> rolling your own incremental hash algorithm out of a lower-level
> primitive that doesn't document support for this, and therefore taking
> responsibility yourself for how well it distributes values into buckets.
>
> Are you confident this results in good hash performance? Is this
> better than a solution built on top of a hash function with an
> explicit API for calculating a hash incrementally, such as the crc32
> example I included? (And again, this would ideally be
> a sys.hash_info.hash_bits -bit algorithm.)

I don't have the theoretical background to defend this function.  But it
seems to me that if we believe that hash((int, thing)) distributes well,
then how could this function not distribute well?

--Ned.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-28 Thread jab

On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman  wrote:

> In other words, objects that do not compare equal can also have the same
> hash value (although too much of that will reduce the efficiency of
> Python's containers).
>

Yes, I realize that unequal objects can return the same hash value with
only performance, and not correctness, suffering. It's the performance I'm
concerned about. That's what I meant by "...to keep from unnecessarily
causing hash collisions..." in my original message, but sorry this wasn't
clearer. We should be able to do this in a way that doesn't increase hash
collisions unnecessarily.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-28 Thread jab

On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder 
wrote:

> You can write a simple function to use hash iteratively to hash the entire
> stream in constant space and linear time:
>
> def hash_stream(them):
> val = 0
> for it in them:
> val = hash((val, it))
> return val
>
> Although this creates N 2-tuples, they come and go, so the memory use
> won't grow.  Adjust the code as needed to achieve canonicalization before
> iterating.
>
> Or maybe I am misunderstanding the requirements?
>

This is better than solutions like http://stackoverflow.com/a/
27952689/161642 in the sense that it's a little higher level (no bit
shifting or magic numbers).

But it's not clear that it's any better in the sense that you're still
rolling your own incremental hash algorithm out of a lower-level primitive
that doesn't document support for this, and therefore taking responsibility
yourself for how well it distributes values into buckets.

Are you confident this results in good hash performance? Is this better
than a solution built on top of a hash function with an explicit API for
calculating a hash incrementally, such as the crc32 example I included?
(And again, this would ideally be a sys.hash_info.hash_bits -bit algorithm.)

Don't we still probably want either:

1) Python to provide some such hash_stream() function as a built-in,

or failing that,

2) the https://docs.python.org/3/reference/datamodel.html#object.__hash__
documentation to bless this as the recommended solution to this problem,
thereby providing assurance of its performance?

If that makes sense, I'd be happy to file an issue, and include the start
of a patch providing either 1 or 2.

Thanks very much for the helpful response.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-28 Thread Ethan Furman


On 12/27/2016 07:13 PM, j...@math.brown.edu wrote:


According to the docs [6]:

"""
it is advised to mix together the hash values of the components of the
object that also play a part in comparison of objects by packing them
into a tuple and hashing the tuple. Example:

def __hash__(self):
 return hash((self.name, self.nick, self.color))

"""


Applying this advice to the use cases above would require creating an
arbitrarily large tuple in memory before passing it to hash(), which
is then just thrown away. It would be preferable if there were a way
to pass multiple values to hash() in a streaming fashion, such that
the overall hash were computed incrementally, without building up a
large object in memory first.


Part of the reason for creating __hash__ like above is that:

- it's simple
- it's reliable

However, it's not the only way to have a hash algorithm that works; in fact, 
the beginning of the sentence you quoted says:


The only required property is that objects which compare equal have
 the same hash value;


In other words, objects that do not compare equal can also have the same hash 
value (although too much of that will reduce the efficiency of Python's 
containers).


(ii) priming the overall hash value with some class-specific initial
value, so that if an instance of a different type of collection, which
comprised the same items but which compared unequal, were to compute
its hash value out of the same constituent items, we make sure our
hash value differs.


This is unnecessary:  hashes are compared first as a way to weed out impossible 
matches, but when the hashes are the same an actual __eq__ test is still done 
[7].

--
~Ethan~


[6] https://docs.python.org/3/reference/datamodel.html#object.__hash__
[7] some test code to prove above points:

--- 8< 
from unittest import main, TestCase

class Eggs(object):
def __init__(self, value):
self.value = value
def __hash__(self):
return hash(self.value)
def __eq__(self, other):
if isinstance(other, self.__class__):
return self.value == other.value
return NotImplemented

class Spam(object):
def __init__(self, value):
self.value = value
def __hash__(self):
return hash(self.value)
def __eq__(self, other):
if isinstance(other, self.__class__):
return self.value == other.value
return NotImplemented

e10 = Eggs(1)
e20 = Eggs(2)
e11 = Eggs(1)
e21 = Eggs(2)

s10 = Spam(1)
s20 = Spam(2)
s11 = Spam(1)
s21 = Spam(2)

bag = {}
bag[e10] = 1
bag[s10] = 2
bag[e20] = 3
bag[s20] = 4

class TestEqualityAndHashing(TestCase):

def test_equal(self):
# same class, same value --> equal
self.assertEqual(e10, e11)
self.assertEqual(e20, e21)
self.assertEqual(s10, s11)
self.assertEqual(s20, s21)

def test_not_equal(self):
# different class, same value --> not equal
self.assertEqual(e10.value, s10.value)
self.assertNotEqual(e10, s10)
self.assertEqual(e20.value, s20.value)
self.assertNotEqual(e20, s20)

def test_same_hash(self):
# same class, same value, same hash
self.assertEqual(hash(e10), hash(e11))
self.assertEqual(hash(e20), hash(e21))
self.assertEqual(hash(s10), hash(s11))
self.assertEqual(hash(s20), hash(s21))

# different class, same value, same hash
self.assertEqual(hash(e10), hash(s10))
self.assertEqual(hash(e11), hash(s11))
self.assertEqual(hash(e20), hash(s20))
self.assertEqual(hash(e21), hash(s21))

def test_as_key(self):
# different objects from different classes with same hash should still 
be distinct
self.assertEqual(len(bag), 4)
self.assertEqual(bag[e10], 1)
self.assertEqual(bag[s10], 2)
self.assertEqual(bag[e20], 3)
self.assertEqual(bag[s20], 4)

# different objects from same classes with same hash should not be 
distinct
self.assertEqual(bag[e11], 1)
self.assertEqual(bag[s11], 2)
self.assertEqual(bag[e21], 3)
self.assertEqual(bag[s21], 4)

main()
--- 8< 
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-28 Thread Ned Batchelder

On 12/27/16 10:13 PM, j...@math.brown.edu wrote:
> Applying this advice to the use cases above would require creating an
> arbitrarily large tuple in memory before passing it to hash(), which
> is then just thrown away. It would be preferable if there were a way
> to pass multiple values to hash() in a streaming fashion, such that
> the overall hash were computed incrementally, without building up a
> large object in memory first.
>
> Should there be better support for this use case? Perhaps hash() could
> support an alternative signature, allowing it to accept a stream of
> values whose combined hash would be computed incrementally in
> *constant* space and linear time, e.g. "hash(items=iter(self))".
You can write a simple function to use hash iteratively to hash the
entire stream in constant space and linear time:

def hash_stream(them):
val = 0
for it in them:
val = hash((val, it))
return val

Although this creates N 2-tuples, they come and go, so the memory use
won't grow.  Adjust the code as needed to achieve canonicalization
before iterating.

Or maybe I am misunderstanding the requirements?

--Ned.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-28 Thread jab

I actually have been poking around that code already. I also found
https://github.com/vperron/python-superfasthash/blob/master/superfasthash.py
in case of interest.

But it still seems like library authors with this use case should keep
their library code free of implementation details like this, and instead
use a higher-level API provided by Python.

Thanks,
Josh

On Tue, Dec 27, 2016 at 10:28 PM, Ryan Gonzalez  wrote:

> You could always try to make a Python version of the C tuple hashing
> function[1] (requires the total # of elements) or PyPy's[2] (seems like it
> would allow true incremental hashing). API idea:
>
>
> hasher = IncrementalHasher()
> hasher.add(one_item_to_hash)  # updates hasher.hash property with result
> # repeat
> return hasher.hash
>
>
> [1]: https://hg.python.org/cpython/file/dcced3bd22fe/
> Objects/tupleobject.c#l331
> [2]: https://bitbucket.org/pypy/pypy/src/d8febc18447e1f785a384d52413a34
> 5d7b3db423/rpython/rlib/objectmodel.py#objectmodel.py-562
>
> --
> Ryan (ライアン)
> Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else
> http://kirbyfan64.github.io/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in hash

2016-12-27 Thread Ryan Gonzalez

You could always try to make a Python version of the C tuple hashing
function[1] (requires the total # of elements) or PyPy's[2] (seems like it
would allow true incremental hashing). API idea:


hasher = IncrementalHasher()
hasher.add(one_item_to_hash)  # updates hasher.hash property with result
# repeat
return hasher.hash


[1]:
https://hg.python.org/cpython/file/dcced3bd22fe/Objects/tupleobject.c#l331
[2]:
https://bitbucket.org/pypy/pypy/src/d8febc18447e1f785a384d52413a345d7b3db423/rpython/rlib/objectmodel.py#objectmodel.py-562

--
Ryan (ライアン)
Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else
http://kirbyfan64.github.io/


On Dec 27, 2016 9:15 PM,  wrote:

Suppose you have implemented an immutable Position type to represent
the state of a game played on an MxN board, where the board size can
grow quite large.

Or suppose you have implemented an immutable, ordered collection type.
For example, the collections-extended package provides a
frozensetlist[1]. One of my own packages provides a frozen, ordered
bidirectional mapping type.[2]

These types should be hashable so that they can be inserted into sets
and mappings. The order-sensitivity of the contents prevents them from
using the built-in collections.Set._hash() helper in their __hash__
implementations, to keep from unnecessarily causing hash collisions
for objects that compare unequal due only to having a different
ordering of the same set of contained items.

According to https://docs.python.org/3/reference/datamodel.html#
object.__hash__
:


"""
it is advised to mix together the hash values of the components of the
object that also play a part in comparison of objects by packing them
into a tuple and hashing the tuple. Example:

def __hash__(self):
return hash((self.name, self.nick, self.color))

"""


Applying this advice to the use cases above would require creating an
arbitrarily large tuple in memory before passing it to hash(), which
is then just thrown away. It would be preferable if there were a way
to pass multiple values to hash() in a streaming fashion, such that
the overall hash were computed incrementally, without building up a
large object in memory first.

Should there be better support for this use case? Perhaps hash() could
support an alternative signature, allowing it to accept a stream of
values whose combined hash would be computed incrementally in
*constant* space and linear time, e.g. "hash(items=iter(self))".

In the meantime, what is the best way to incrementally compute a good
hash value for such objects using built-in Python routines? (As a
library author, it would be preferable to use a routine with explicit
support for computing a hash incrementally, rather than having to
worry about how to correctly combine results from multiple calls to
hash(contained_item) in library code. (Simply XORing such results
together would not be order-sensitive, and so wouldn't work.) Using a
routine with explicit support for incremental hashing would allow
libraries to focus on doing one thing well.[3,4,5])

I know that hashlib provides algorithms that support incremental
hashing, but those use at least 128 bits. Since hash() throws out
anything beyond sys.hash_info.hash_bits (e.g. 64) bits, anything in
hashlib seems like overkill. Am I right in thinking that's the wrong
tool for the job?

On the other hand, would binascii.crc32 be suitable, at least for
32-bit systems? (And is there some 64-bit incremental hash algorithm
available for 64-bit systems? It seems Python has no support for crc64
built in.) For example:

import binascii, struct

class FrozenOrderedCollection:
def __hash__(self):
if hasattr(self, '__hashval'):  # Computed lazily.
return self.__hashval
hv = crc32(b'FrozenOrderedCollection')
for i in self:
hv = binascii.crc32(struct.pack('@l', hash(i)), hv)
hv &= 0x
self.__hashval = hv
return hv


Note that this example illustrates two other common requirements of
these use cases:

(i) lazily computing the hash value on first use, and then caching it
for future use

(ii) priming the overall hash value with some class-specific initial
value, so that if an instance of a different type of collection, which
comprised the same items but which compared unequal, were to compute
its hash value out of the same constituent items, we make sure our
hash value differs. (On that note, should the documentation in
https://docs.python.org/3/reference/datamodel.html#object.__hash__
quoted above be updated to add this advice? The current advice to
"return hash((self.name, self.nick, self.color))" would cause a hash
collision with a tuple of the same values, even though the tuple
should presumably compare unequal with this object.)

To summarize these questions:

1. Should hash() add support for incremental hashing?

2. In the meantime, what is the best way to compute a hash of a
combination of many values incrementally (in constant space and

[Python-ideas] incremental hashing in hash

2016-12-27 Thread jab

Suppose you have implemented an immutable Position type to represent
the state of a game played on an MxN board, where the board size can
grow quite large.

Or suppose you have implemented an immutable, ordered collection type.
For example, the collections-extended package provides a
frozensetlist[1]. One of my own packages provides a frozen, ordered
bidirectional mapping type.[2]

These types should be hashable so that they can be inserted into sets
and mappings. The order-sensitivity of the contents prevents them from
using the built-in collections.Set._hash() helper in their __hash__
implementations, to keep from unnecessarily causing hash collisions
for objects that compare unequal due only to having a different
ordering of the same set of contained items.

According to https://docs.python.org/3/reference/datamodel.html#object.__hash__
:


"""
it is advised to mix together the hash values of the components of the
object that also play a part in comparison of objects by packing them
into a tuple and hashing the tuple. Example:

def __hash__(self):
return hash((self.name, self.nick, self.color))

"""


Applying this advice to the use cases above would require creating an
arbitrarily large tuple in memory before passing it to hash(), which
is then just thrown away. It would be preferable if there were a way
to pass multiple values to hash() in a streaming fashion, such that
the overall hash were computed incrementally, without building up a
large object in memory first.

Should there be better support for this use case? Perhaps hash() could
support an alternative signature, allowing it to accept a stream of
values whose combined hash would be computed incrementally in
*constant* space and linear time, e.g. "hash(items=iter(self))".

In the meantime, what is the best way to incrementally compute a good
hash value for such objects using built-in Python routines? (As a
library author, it would be preferable to use a routine with explicit
support for computing a hash incrementally, rather than having to
worry about how to correctly combine results from multiple calls to
hash(contained_item) in library code. (Simply XORing such results
together would not be order-sensitive, and so wouldn't work.) Using a
routine with explicit support for incremental hashing would allow
libraries to focus on doing one thing well.[3,4,5])

I know that hashlib provides algorithms that support incremental
hashing, but those use at least 128 bits. Since hash() throws out
anything beyond sys.hash_info.hash_bits (e.g. 64) bits, anything in
hashlib seems like overkill. Am I right in thinking that's the wrong
tool for the job?

On the other hand, would binascii.crc32 be suitable, at least for
32-bit systems? (And is there some 64-bit incremental hash algorithm
available for 64-bit systems? It seems Python has no support for crc64
built in.) For example:

import binascii, struct

class FrozenOrderedCollection:
def __hash__(self):
if hasattr(self, '__hashval'):  # Computed lazily.
return self.__hashval
hv = crc32(b'FrozenOrderedCollection')
for i in self:
hv = binascii.crc32(struct.pack('@l', hash(i)), hv)
hv &= 0x
self.__hashval = hv
return hv


Note that this example illustrates two other common requirements of
these use cases:

(i) lazily computing the hash value on first use, and then caching it
for future use

(ii) priming the overall hash value with some class-specific initial
value, so that if an instance of a different type of collection, which
comprised the same items but which compared unequal, were to compute
its hash value out of the same constituent items, we make sure our
hash value differs. (On that note, should the documentation in
https://docs.python.org/3/reference/datamodel.html#object.__hash__
quoted above be updated to add this advice? The current advice to
"return hash((self.name, self.nick, self.color))" would cause a hash
collision with a tuple of the same values, even though the tuple
should presumably compare unequal with this object.)

To summarize these questions:

1. Should hash() add support for incremental hashing?

2. In the meantime, what is the best way to compute a hash of a
combination of many values incrementally (in constant space and linear
time), using only what's available in the standard library? Ideally
there is some routine available that uses exactly hash_info.hash_bits
number of bits, and that does the combining of incremental results for
you.

3. Should the https://docs.python.org/3/reference/datamodel.html#object.__hash__
documentation be updated to include suitable advice for these use
cases, in particular, that the overall hash value should be computed
lazily, incrementally, and should be primed with a class-unique value?

Thanks in advance for a helpful discussion, and best wishes.

Josh


References:

[1] 
http://collections-extended.lenzm.net/api.html#collections_extended.frozensetlist

[2] https://

56 matches

Mail list logo