[Python-ideas] Re: set arbitrary hash random seed to ensure reproducible results

2021-12-17 Thread Stephen J. Turnbull
Hao Hu writes:
 > > On 17 Dec 2021, at 15:28, Chris Angelico  wrote:

 > > The built-in hash() function is extremely generic, so it can't really
 > > work that way. Adding a parameter to it would require (a) adding the
 > > parameter to every __hash__ method of every object, including
 > > user-defined objects;
 > 
 > I would not say the opposite, however maybe it appears to be more
 > complicated than it is really is. Probably it is worth a small
 > analysis?

It's the user-defined objects that are the killer here.  We don't want
to go wrecking dozens of projects' objects.

 > >> For instance, if we create a caching programming interface that
 > >> relies on a distributed kv store,

I would be very suspicious of using Python's hash builtin for such a
purpose.  The Python hash functions are very carefully tuned for high
performance in one application only: equality testing in Python,
especially for dicts.  Many __hash__ methods omit much of the object
being hashed; if the variation in your keys depends only on those
attributes, you'll get a lot of collisions.  Others are extremely
predictable.  E.g., most integers and other numbers equal to integers
hash to themselves mod 2**61 - 1, I believe -1 is only exception.
Being predictable as such may not be a problem for your kv store
cache, but predictable == pattern, and if your application happens to
match that pattern, you could again end up with a massive collision
problem.  I imagine this is much less likely to be a problem than the
case where keys depend on omitted attributes, since presumably the
__hash__ method is designed to cover the whole range.  And numbers are
the only case I know of offhand.

 > > I'd recommend hashlib:

+1

 > Otherwise, would that be useful to add siphash24 or fnv in the
 > hashlib as well?

I think that is a good idea.  To me, it seems relatively likely to be
accepted quickly.  However, many cryptographic algorithms are delicate
(eg, to avoid timing attacks), so I could be wrong about that.  Folks
like Christian Heimes might be very concerned about the implementation
as well as the algorithm.

Note that Python/pyhash.c seems to have implementations of both of
these algorithms, although I don't know if these implementations
satisfy cryptographic needs.

Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/H3BBUYLMJAPGGD66MN3R7A7M7SEYAX66/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: set arbitrary hash random seed to ensure reproducible results

2021-12-17 Thread Hao Hu


> On 17 Dec 2021, at 15:49, Chris Angelico  wrote:
> 
> On Sat, Dec 18, 2021 at 1:44 AM Hao Hu  wrote:
>>> For that sort of thing, it may be more practical to use your own
>>> hashing function, possibly a cryptographically secure one. The precise
>>> hashing function used by Python isn't guaranteed, so if you need it to
>>> be stable across different runs, and especially if you need to seed it
>>> in a specific way, I'd recommend hashlib:
>>> 
>>> https://docs.python.org/3/library/hashlib.html
>> 
>> I’ve explored that option, however the siphash24 or fnv under the hood of 
>> *hash* seems to be more adapted for this type of use cases in terms of 
>> *performance*.
>> Otherwise, would that be useful to add siphash24 or fnv in the hashlib as 
>> well?
> 
> That's a more viable option, although maybe it wouldn't even matter.
> How does hashlib.sha1() performance stack up, and what about a
> handrolled simple string hashing function in Python? Is performance
> actually going to be a problem with one of those?
> 

Great question. I agree that there could be other factors which slow things 
down much more than the hash function.
I assume that this is a function that’ll be potentially called a lot of times, 
and the cumulated cost won’t be negligible.
Maybe for the similar reason, some high performance networking tools adopt the 
same algorithm .

> ChrisA
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/LKXP3P7J7RBHC57SVS4VASDTVACIEM5S/
> Code of Conduct: http://python.org/psf/codeofconduct/

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/4P6EZRWWNFYS3PFRNZSW3NDLY2AGYSFK/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: set arbitrary hash random seed to ensure reproducible results

2021-12-17 Thread Chris Angelico
On Sat, Dec 18, 2021 at 2:21 AM Hao Hu  wrote:
> Great question. I agree that there could be other factors which slow things 
> down much more than the hash function.
> I assume that this is a function that’ll be potentially called a lot of 
> times, and the cumulated cost won’t be negligible.
> Maybe for the similar reason, some high performance networking tools adopt 
> the same algorithm.

Yup. It's always better to measure than to assume, though :)

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ISK4YY4QEPS6QMY64HX4EBOWEV7AB3CI/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: set arbitrary hash random seed to ensure reproducible results

2021-12-17 Thread Hao Hu



> On 17 Dec 2021, at 15:42, Steven D'Aprano  wrote:
> 
> On Fri, Dec 17, 2021 at 02:07:38PM -, Hao Hu wrote:
>> Hi,
>> 
>> I am wondering if it would be good to add an additional keyword `seed` 
>> to the builtin function *hash* to allow us to set arbitrary seed to 
>> ensure reproducible results.
> 
> I assume you are talking about hashing strings, I believe they are the 
> only objects that are effected by hash randomization.

I am not sure about this, though I agree with you that strings are affected by 
this.

> 
> 
>> In parallel, we have identified a couple of real use cases that 
>> require that an arbitrary seed is used for a limited scope.
>> For instance, if we create a caching programming interface that relies 
>> on a distributed kv store, it would be very important to make sure 
>> that the hash key stays the same when the application is rebooted or 
>> replicated.
> 
> Does your distributed key/value store have to use the built-in hash 
> function?

If you limit the use case strictly to a string and a value, I agree with you 
that I can rely on the hashing capability of the kv store.
However, sometimes I expect something more generic, such as it might be a tuple 
that mixes numbers, strings and other hashable types at least.
Another option might be using serialisation to produce key, but there are 2 
things to tackle
- serialisation vs hash in terms of speed
- a key of dynamic size vs key of fixed size (I tend to believe that fixed size 
key is more friendly for random access on kv storage side)  

> 
> 
> -- 
> Steve
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/ZEKR2Y673GSHYBU5MBIHHLCSFF5YUALJ/
> Code of Conduct: http://python.org/psf/codeofconduct/

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/OSGNNVTHVU4BSMC65CGBFZ75PQETKPTA/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: set arbitrary hash random seed to ensure reproducible results

2021-12-17 Thread Chris Angelico
On Sat, Dec 18, 2021 at 1:44 AM Hao Hu  wrote:
> > For that sort of thing, it may be more practical to use your own
> > hashing function, possibly a cryptographically secure one. The precise
> > hashing function used by Python isn't guaranteed, so if you need it to
> > be stable across different runs, and especially if you need to seed it
> > in a specific way, I'd recommend hashlib:
> >
> > https://docs.python.org/3/library/hashlib.html
>
> I’ve explored that option, however the siphash24 or fnv under the hood of 
> *hash* seems to be more adapted for this type of use cases in terms of 
> *performance*.
> Otherwise, would that be useful to add siphash24 or fnv in the hashlib as 
> well?

That's a more viable option, although maybe it wouldn't even matter.
How does hashlib.sha1() performance stack up, and what about a
handrolled simple string hashing function in Python? Is performance
actually going to be a problem with one of those?

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/LKXP3P7J7RBHC57SVS4VASDTVACIEM5S/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: set arbitrary hash random seed to ensure reproducible results

2021-12-17 Thread Steven D'Aprano
On Fri, Dec 17, 2021 at 02:07:38PM -, Hao Hu wrote:
> Hi,
> 
> I am wondering if it would be good to add an additional keyword `seed` 
> to the builtin function *hash* to allow us to set arbitrary seed to 
> ensure reproducible results.

I assume you are talking about hashing strings, I believe they are the 
only objects that are effected by hash randomization.


> In parallel, we have identified a couple of real use cases that 
> require that an arbitrary seed is used for a limited scope.
> For instance, if we create a caching programming interface that relies 
> on a distributed kv store, it would be very important to make sure 
> that the hash key stays the same when the application is rebooted or 
> replicated.

Does your distributed key/value store have to use the built-in hash 
function?


-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ZEKR2Y673GSHYBU5MBIHHLCSFF5YUALJ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: set arbitrary hash random seed to ensure reproducible results

2021-12-17 Thread Hao Hu

> On 17 Dec 2021, at 15:28, Chris Angelico  wrote:
> 
> On Sat, Dec 18, 2021 at 1:21 AM Hao Hu  wrote:
>> 
>> Hi,
>> 
>> I am wondering if it would be good to add an additional keyword `seed` to 
>> the builtin function *hash* to allow us to set arbitrary seed to ensure 
>> reproducible results.
>> 
> 
> The built-in hash() function is extremely generic, so it can't really
> work that way. Adding a parameter to it would require (a) adding the
> parameter to every __hash__ method of every object, including
> user-defined objects; and (b) defining what that would mean when
> multiple objects' hashes are combined (eg hashing a tuple).
> 

I would not say the opposite, however maybe it appears to be more complicated 
than it is really is. Probably it is worth a small analysis?

>> 
>> As far as I know, there exists already the environment variable 
>> PYTHONHASHSEED that allows us to set arbitrary seed or disable the seed 
>> globally for the python interpreter.
>> However, it looks like that it would be too bold to use that environment 
>> variable to change the default behavior because the random seed generation 
>> helps improve the security my reducing the risk of hash flooding.
>> 
>> In parallel, we have identified a couple of real use cases that require that 
>> an arbitrary seed is used for a limited scope.
>> For instance, if we create a caching programming interface that relies on a 
>> distributed kv store, it would be very important to make sure that the hash 
>> key stays the same when the application is rebooted or replicated. It is 
>> generally more cautious to use the above capability to limit the scope to 
>> the caching library itself instead of applying the same for all the hash 
>> functions of all the python interpreters.
>> 
> 
> For that sort of thing, it may be more practical to use your own
> hashing function, possibly a cryptographically secure one. The precise
> hashing function used by Python isn't guaranteed, so if you need it to
> be stable across different runs, and especially if you need to seed it
> in a specific way, I'd recommend hashlib:
> 
> https://docs.python.org/3/library/hashlib.html

I’ve explored that option, however the siphash24 or fnv under the hood of 
*hash* seems to be more adapted for this type of use cases in terms of 
*performance*.
Otherwise, would that be useful to add siphash24 or fnv in the hashlib as well?
There are obviously also other third party libraries such as *mmh*, however 
that’ll introduce unnecessary immature dependencies.

WDYT? 
Thank you.

> 
> ChrisA
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/CU4TI3AEMI7Y6USIIHDWWSQW7WGGPNJ7/
> Code of Conduct: http://python.org/psf/codeofconduct/

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HNMTY3SSKLAYNCLC5CN5SPUOMZP7IYGJ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: set arbitrary hash random seed to ensure reproducible results

2021-12-17 Thread Chris Angelico
On Sat, Dec 18, 2021 at 1:21 AM Hao Hu  wrote:
>
> Hi,
>
> I am wondering if it would be good to add an additional keyword `seed` to the 
> builtin function *hash* to allow us to set arbitrary seed to ensure 
> reproducible results.
>

The built-in hash() function is extremely generic, so it can't really
work that way. Adding a parameter to it would require (a) adding the
parameter to every __hash__ method of every object, including
user-defined objects; and (b) defining what that would mean when
multiple objects' hashes are combined (eg hashing a tuple).

>
> As far as I know, there exists already the environment variable 
> PYTHONHASHSEED that allows us to set arbitrary seed or disable the seed 
> globally for the python interpreter.
> However, it looks like that it would be too bold to use that environment 
> variable to change the default behavior because the random seed generation 
> helps improve the security my reducing the risk of hash flooding.
>
> In parallel, we have identified a couple of real use cases that require that 
> an arbitrary seed is used for a limited scope.
> For instance, if we create a caching programming interface that relies on a 
> distributed kv store, it would be very important to make sure that the hash 
> key stays the same when the application is rebooted or replicated. It is 
> generally more cautious to use the above capability to limit the scope to the 
> caching library itself instead of applying the same for all the hash 
> functions of all the python interpreters.
>

For that sort of thing, it may be more practical to use your own
hashing function, possibly a cryptographically secure one. The precise
hashing function used by Python isn't guaranteed, so if you need it to
be stable across different runs, and especially if you need to seed it
in a specific way, I'd recommend hashlib:

https://docs.python.org/3/library/hashlib.html

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/CU4TI3AEMI7Y6USIIHDWWSQW7WGGPNJ7/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] set arbitrary hash random seed to ensure reproducible results

2021-12-17 Thread Hao Hu
Hi,

I am wondering if it would be good to add an additional keyword `seed` to the 
builtin function *hash* to allow us to set arbitrary seed to ensure 
reproducible results.
As far as I know, there exists already the environment variable PYTHONHASHSEED 
that allows us to set arbitrary seed or disable the seed globally for the 
python interpreter.
However, it looks like that it would be too bold to use that environment 
variable to change the default behavior because the random seed generation 
helps improve the security my reducing the risk of hash flooding.

In parallel, we have identified a couple of real use cases that require that an 
arbitrary seed is used for a limited scope.
For instance, if we create a caching programming interface that relies on a 
distributed kv store, it would be very important to make sure that the hash key 
stays the same when the application is rebooted or replicated. It is generally 
more cautious to use the above capability to limit the scope to the caching 
library itself instead of applying the same for all the hash functions of all 
the python interpreters.

WDYT?

Kind regards,

Hao
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/RRPEOREIWJ4AGDOYWQKIT6SRTC5ISDP3/
Code of Conduct: http://python.org/psf/codeofconduct/