Re: [Numpy-discussion] SeedSequence.spawn()

Stig Korsnes Sat, 28 Aug 2021 02:55:54 -0700

Thank you again Robert.
I am using NamedTuple for mye keys, which also are keys in a dictionary.
Each key will be unique (tuple on distinct int and enum), so I am thinking
maybe the risk of producing duplicate hash is not present, but could as
always be wrong :)
For positive ints i followed this tip
https://stackoverflow.com/questions/18766535/positive-integer-from-python-hash-function
, and did:


def stronghash(key:ComponentId):
    return ctypes.c_size_t(hash(key)).value

Since I will be using each process/random sample several times, and keeping
all of them in memory at once is not feasible (dimensionality) i did the
following:

        self._rng = default_rng(cs)
        self._state = dict(self._rng.bit_generator.state)  #

    def scenarios(self) -> npt.NDArray[np.float64]:
        self._rng.bit_generator.state = self._state
       ....
      return ....

Would you consider this bad practice, or an ok solution?


I Norway we have a saying which directly translates :" He asked for the
finger... and took the whole arm" .

Best,
Stig


fre. 27. aug. 2021 kl. 17:01 skrev Robert Kern <robert.k...@gmail.com>:

> joblib is a library that uses clever caching of function call results to
> make the development of certain kinds of data-heavy computational pipelines
> easier. In order to derive the key to be used to check the cache, joblib
> has to look at the arguments passed to the function, which may
> involve usually-nonhashable things like large numpy arrays.
>
>   https://joblib.readthedocs.io/en/latest/
>
> So they constructed joblib.hash() which basically takes the arguments,
> pickles them into a bytestring (with some implementation details), then
> computes an MD5 hash on that. It's probably overkill for your keys, but
> it's easily available and quite generic. It returns a hex-encoded string of
> the 128-bit MD5 hash. `int(..., 16)` will convert that to a non-negative
> (almost-certainly positive!) integer that can be fed into SeedSequence.
>
> On Fri, Aug 27, 2021 at 5:03 AM Stig Korsnes <stigkors...@gmail.com>
> wrote:
>
>> Thank you Robert!
>> This scheme fits perfectly into what I`m trying to accomplish! :) The
>> "smooshing" of ints by supplying a list of ints had eluded me. Thank you
>> also for the pointer about built-in hash(). I would not be able to rely on
>> it anyways, because it does not return strictly positive ints which
>> SeedSequence requires.  If you have a minute to spare: Could you briefly
>> explain "int(joblib.hash(key)
>> <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>,
>> 16)" , and would this always return non-negative integers?
>> Thanks again!
>>
>> tor. 26. aug. 2021 kl. 22:59 skrev Robert Kern <robert.k...@gmail.com>:
>>
>>> On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes <stigkors...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> Is there a way to uniquely spawn child seeds?
>>>> I`m doing monte carlo analysis, where I have n random processes, each
>>>> with their own generator.
>>>> All process models instantiate a generator with default_rng(). I.e
>>>> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
>>>> problem I`m facing, is that results using individual process  depends on
>>>> the order of the process initialization ,and the number of processes used.
>>>> However, if I could spawn children with a unique identifier, I would be
>>>> able to reproduce my individual results without having to pickle/log
>>>> states. For example, all my models have an id (tuple) field which is
>>>> hashable.
>>>> If I had the ability to SeedSequence(x).Spawn([objects]) where objects
>>>> support hash(object), I would have reproducibility for all my processes. I
>>>> could do without the spawning, but then I would probably loose independence
>>>> when I do multiproc? Is there a way to achieve my goal in the current
>>>> version 1.21 of numpy?
>>>>
>>>
>>> I would probably not rely on `hash()` as it is only intended to be
>>> pretty good at getting distinct values from distinct inputs. If you can
>>> combine the tuple objects into a string of bytes in a reliable,
>>> collision-free way and use one of the cryptographic hashes to get them down
>>> to a 128bit number, that'd be ideal. `int(joblib.hash(key)
>>> <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>,
>>> 16)` should do nicely. You can combine that with your main process's seed
>>> easily. SeedSequence can take arbitrary amounts of integer data and smoosh
>>> them all together. The spawning functionality builds off of that, but you
>>> can also just manually pass in lists of integers.
>>>
>>> Let's call that function `stronghash()`. Let's call your main process
>>> seed number `seed` (this is the thing that the user can set on the
>>> command-line or something you get from `secrets.randbits(128)` if you need
>>> a fresh one). Let's call the unique tuple `key`. You can build the
>>> `SeedSequence` for each job according to the `key` like so:
>>>
>>> root_ss = SeedSequence(seed)
>>> for key, data in jobs:
>>>     child_ss = SeedSequence([stronghash(key), seed])
>>>     submit_job(key, data, seed=child_ss)
>>>
>>> Now each job will get its own unique stream regardless of the order the
>>> job is assigned. When the user reruns it with the same root `seed`, they
>>> will get the same results. When the user chooses a different `seed`, they
>>> will get another set of results (this is why you don't want to just use
>>> `SeedSequence(stronghash(key))` all by itself).
>>>
>>> I put the job-specific seed data ahead of the main program's seed to be
>>> on the super-safe side. The spawning mechanism will append integers to the
>>> end, so there's a super-tiny chance somewhere down a long line of
>>> `root_ss.spawn()`s that there would be a collision (and I mean
>>> super-extra-tiny). But best practices cost nothing.
>>>
>>> I hope that helps and is not too confusing!
>>>
>>> --
>>> Robert Kern
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>
>
> --
> Robert Kern
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] SeedSequence.spawn()

Reply via email to