Re: [Numpy-discussion] SeedSequence.spawn()
Thanks again Robert! Got rid of dict(state). Not sure I followed you completely on the test case. The "calculator" i am writing , will for the specific use case depend on ~200-1000 processes. Each process object will return say 1m floats when its method scenario is called. If I am not mistaken, that would require 7-8GiB just to keep the these in memory. Furthermore I would possibly have to add the size of the dependent calculation on these (but would likely aggregate outside of testing). A given object that depends on processes will calculate its results based on 1-4 (1-4 *1m of these processes (non multiproc)), and will loop over objects with processpool. So my reasoning is that running memory consumption would then be (1-4)*size of 1m floats x processes + all of other overhead. Since sampling 1m normals is pretty fast, I can happily live with sampling (vs lookup in presampled array), but since two object might depend on the same process they need the exact same array of samples. Hence the state. If I understood you correctly, another solution is to add another duplicate process with same seed, instead of using one where i "reset" state. I promised that this could run on any laptop.. søn. 29. aug. 2021 kl. 02:42 skrev Robert Kern : > On Sat, Aug 28, 2021 at 5:56 AM Stig Korsnes > wrote: > >> Thank you again Robert. >> I am using NamedTuple for mye keys, which also are keys in a dictionary. >> Each key will be unique (tuple on distinct int and enum), so I am thinking >> maybe the risk of producing duplicate hash is not present, but could as >> always be wrong :) >> > > Present, but possibly ignorably small. 128-bit spaces give enough > breathing room for me to be comfortable; 64-bit spaces like what hash() > will use for its results makes me just a little claustrophobic. > > If the structure of the keys is pretty fixed, just these two integers > (counting the enum as an integer), then I might just use both in the > seeding material. > > def get_key_seed(key:ComponentId, root_seed:int): > return np.random.SeedSequence([key.the_int, int(key.the_enum), > root_seed]) > > >> For positive ints i followed this tip >> https://stackoverflow.com/questions/18766535/positive-integer-from-python-hash-function >> , and did: >> >> def stronghash(key:ComponentId): >> return ctypes.c_size_t(hash(key)).value >> > > np.uint64(possibly_negative_integer) will also work for this purpose > (somewhat more reliably). > > Since I will be using each process/random sample several times, and >> keeping all of them in memory at once is not feasible (dimensionality) i >> did the following: >> >> self._rng = default_rng(cs) >> self._state = dict(self._rng.bit_generator.state) # >> >> def scenarios(self) -> npt.NDArray[np.float64]: >> self._rng.bit_generator.state = self._state >> >> return >> >> Would you consider this bad practice, or an ok solution? >> > > It's what that property is there for. No need to copy; `.state` creates a > new dict each time. > > In a quick test, I measured a process with 1 million Generator instances > to use ~1.5 GiB while 1 million state dicts ~1.0 GiB (including all of the > other overhead of Python and numpy; not a scientific test). Storing just > the BitGenerator is half-way in between. That's something, but not a huge > win. If that is really crossing the border from feasible to infeasible, you > may be about to run into your limits anyways for other reasons. So balance > that out with the complications of swapping state in and out of a single > instance. > > I Norway we have a saying which directly translates :" He asked for the >> finger... and took the whole arm" . >> > > Well, when I craft an overly-complicated system, I feel responsible to > help shepherd people along in using it well. :-) > > -- > Robert Kern > ___ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] SeedSequence.spawn()
And big kudos for building AND shepherding :) søn. 29. aug. 2021 kl. 12:56 skrev Stig Korsnes : > Thanks again Robert! > Got rid of dict(state). > > Not sure I followed you completely on the test case. The "calculator" i am > writing , will for the specific use case depend on ~200-1000 processes. > Each process object will return say 1m floats when its method scenario is > called. If I am not mistaken, that would require 7-8GiB just to keep the > these in memory. Furthermore I would possibly have to add the size of the > dependent calculation on these (but would likely aggregate outside of > testing). A given object that depends on processes will calculate its > results based on 1-4 (1-4 *1m of these processes (non multiproc)), and > will loop over objects with processpool. So my reasoning is that running > memory consumption would then be (1-4)*size of 1m floats x processes + all > of other overhead. Since sampling 1m normals is pretty fast, I can happily > live with sampling (vs lookup in presampled array), but since two object > might depend on the same process they need the exact same array of samples. > Hence the state. If I understood you correctly, another solution is to add > another duplicate process with same seed, instead of using one where i > "reset" state. > > I promised that this could run on any laptop.. > > > > søn. 29. aug. 2021 kl. 02:42 skrev Robert Kern : > >> On Sat, Aug 28, 2021 at 5:56 AM Stig Korsnes >> wrote: >> >>> Thank you again Robert. >>> I am using NamedTuple for mye keys, which also are keys in a dictionary. >>> Each key will be unique (tuple on distinct int and enum), so I am thinking >>> maybe the risk of producing duplicate hash is not present, but could as >>> always be wrong :) >>> >> >> Present, but possibly ignorably small. 128-bit spaces give enough >> breathing room for me to be comfortable; 64-bit spaces like what hash() >> will use for its results makes me just a little claustrophobic. >> >> If the structure of the keys is pretty fixed, just these two integers >> (counting the enum as an integer), then I might just use both in the >> seeding material. >> >> def get_key_seed(key:ComponentId, root_seed:int): >> return np.random.SeedSequence([key.the_int, int(key.the_enum), >> root_seed]) >> >> >>> For positive ints i followed this tip >>> https://stackoverflow.com/questions/18766535/positive-integer-from-python-hash-function >>> , and did: >>> >>> def stronghash(key:ComponentId): >>> return ctypes.c_size_t(hash(key)).value >>> >> >> np.uint64(possibly_negative_integer) will also work for this purpose >> (somewhat more reliably). >> >> Since I will be using each process/random sample several times, and >>> keeping all of them in memory at once is not feasible (dimensionality) i >>> did the following: >>> >>> self._rng = default_rng(cs) >>> self._state = dict(self._rng.bit_generator.state) # >>> >>> def scenarios(self) -> npt.NDArray[np.float64]: >>> self._rng.bit_generator.state = self._state >>> >>> return >>> >>> Would you consider this bad practice, or an ok solution? >>> >> >> It's what that property is there for. No need to copy; `.state` creates a >> new dict each time. >> >> In a quick test, I measured a process with 1 million Generator instances >> to use ~1.5 GiB while 1 million state dicts ~1.0 GiB (including all of the >> other overhead of Python and numpy; not a scientific test). Storing just >> the BitGenerator is half-way in between. That's something, but not a huge >> win. If that is really crossing the border from feasible to infeasible, you >> may be about to run into your limits anyways for other reasons. So balance >> that out with the complications of swapping state in and out of a single >> instance. >> >> I Norway we have a saying which directly translates :" He asked for the >>> finger... and took the whole arm" . >>> >> >> Well, when I craft an overly-complicated system, I feel responsible to >> help shepherd people along in using it well. :-) >> >> -- >> Robert Kern >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] SeedSequence.spawn()
On Sun, Aug 29, 2021 at 6:58 AM Stig Korsnes wrote: > Thanks again Robert! > Got rid of dict(state). > > Not sure I followed you completely on the test case. > In the code that you showed, you were pulling out and storing the `.state` dict and then punching that back into a single `Generator` instance. Instead, you can just make the ~200-1000 `Generator` instances. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] SeedSequence.spawn()
I am indeed making ~200-1000 generator instances.As many as I have processes. Each process is an instance of a component class , which has a generator. Every time i ask this process for 1m numbers, i need the same 1m numbers. I could instead make a new generator with same seed every time I ask for for the 1m numbers, but presumed that this would be more computationally expensive than setting state on an existing generator. Thank your Robert. Best Stig søn. 29. aug. 2021 kl. 16:08 skrev Robert Kern : > On Sun, Aug 29, 2021 at 6:58 AM Stig Korsnes > wrote: > >> Thanks again Robert! >> Got rid of dict(state). >> >> Not sure I followed you completely on the test case. >> > > In the code that you showed, you were pulling out and storing the `.state` > dict and then punching that back into a single `Generator` instance. > Instead, you can just make the ~200-1000 `Generator` instances. > > -- > Robert Kern > ___ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] SeedSequence.spawn()
On Sun, Aug 29, 2021 at 10:55 AM Stig Korsnes wrote: > I am indeed making ~200-1000 generator instances.As many as I have > processes. Each process is an instance of a component class , which has a > generator. Every time i ask this process for 1m numbers, i need the same 1m > numbers. I could instead make a new generator with same seed every time I > ask for for the 1m numbers, but presumed that this would be more > computationally expensive than setting state on an existing generator. > Nominally, but it's overwhelmed by the actual computation. You will have less to juggle if you just compute it from the key each time. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] SeedSequence.spawn()
Agreed, I already have a flag on the class to toggle fixed "state". Could just set self._rng instead of its state. Will check it out. Must say, had not in my wildest dreams expected such help on any given Sunday. Have a great day and week, sir. Best, Stig søn. 29. aug. 2021, 18:29 skrev Robert Kern : > On Sun, Aug 29, 2021 at 10:55 AM Stig Korsnes > wrote: > >> I am indeed making ~200-1000 generator instances.As many as I have >> processes. Each process is an instance of a component class , which has a >> generator. Every time i ask this process for 1m numbers, i need the same 1m >> numbers. I could instead make a new generator with same seed every time I >> ask for for the 1m numbers, but presumed that this would be more >> computationally expensive than setting state on an existing generator. >> > > Nominally, but it's overwhelmed by the actual computation. You will have > less to juggle if you just compute it from the key each time. > > -- > Robert Kern > ___ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion