On Wed, May 18, 2016 at 7:56 PM, Nathaniel Smith <n...@pobox.com> wrote:
>
> On Wed, May 18, 2016 at 5:07 AM, Robert Kern <robert.k...@gmail.com>
wrote:
> > On Wed, May 18, 2016 at 1:14 AM, Nathaniel Smith <n...@pobox.com> wrote:

> >> ...anyway, the real reason I'm a bit grumpy is because there are solid
> >> engineering reasons why users *want* this API,
> >
> > I remain unconvinced on this mark. Grumpily.
>
> Sorry for getting grumpy :-).

And my apologies for some unwarranted hyperbole. I think we're both
converging on a reasonable approach, though.

> The engineering reasons seem pretty
> obvious to me though?

Well, I mean, engineers want lots of things. I suspect that most engineers
*really* just want to call `numpy.random.seed(8675309)` at the start and
never explicitly pass around separate streams. There's an upside to that in
terms of code simplicity. There are also significant limitations and
constraints. Ultimately, the upside against the alternative of passing
around RandomState objects is usually overweighed by the limitations, so
best practice is to pass around RandomState objects.

I acknowledge that there exists an upside to the splitting API, but I don't
think it's a groundbreaking improvement over the alternative current best
practice. It's also unclear to me how often situations that really
demonstrate the upside come into play; in my experience a lot of these
situations are already structured such that preallocating N streams is the
natural thing to do. The limitations and constraints are currently
underexplored, IMO; and in this conservative field, pessimism is warranted.

> If you have any use case for independent streams
> at all, and you're writing code that's intended to live inside a
> library's abstraction barrier, then you need some way to choose your
> streams to avoid colliding with arbitrary other code that the end-user
> might assemble alongside yours as part of their final program. So
> AFAICT you have two options: either you need a "tree-style" API for
> allocating these streams, or else you need to add some explicit API to
> your library that lets the end-user control in detail which streams
> you use. Both are possible, but the latter is obviously undesireable
> if you can avoid it, since it breaks the abstraction barrier, making
> your library more complicated to use and harder to evolve.

ACK

> >> so whether or not it
> >> turns out to be possible I think we should at least be allowed to have
> >> a discussion about whether there's some way to give it to them.
> >
> > I'm not shutting down discussion of the option. I *implemented* the
option.
> > I think that discussing whether it should be part of the main API is
> > premature. There probably ought to be a paper or three out there
supporting
> > its safety and utility first. Let the utility function version flourish
> > first.
>
> OK -- I guess this particularly makes sense given how
> extra-tightly-constrained we currently are in fixing mistakes in
> np.random. But I feel like in the end the right place for this really
> is inside the RandomState interface, because the person implementing
> RandomState is the one best placed to understand (a) the gnarly
> technical details here, and (b) how those change depending on the
> particular PRNG in use. I don't want to end up with a bunch of
> subtly-buggy utility functions in non-specialist libraries like dask
> -- so we should be trying to help downstream users figure out how to
> actually get this into np.random?

I think this is an open research area. An enterprising grad student could
milk this for a couple of papers analyzing how to do this safely for a
variety of PRNGs. I don't think we can hash this out in an email thread or
PR. So yeah, eventually there might be an API on RandomState for this, but
it's way too premature to do so right now, IMO. Maybe start with a
specialized subclass of RandomState that adds this experimental API. In
ng-numpy-randomstate. ;-)

But if someone has spare time to work on numpy.random, for God's sake, use
it to review @gfyoung's PRs instead.

> >> It's
> >> not even 100% out of the question that we conclude that existing PRNGs
> >> are buggy because they don't take this use case into account -- it
> >> would be far from the first time that numpy found itself going beyond
> >> the limits of older numerical tools that weren't designed to build the
> >> kind of large composable systems that numpy gets used for.
> >>
> >> MT19937's state space is large enough that you could explicitly encode
> >> a "tree seed" into it, even if you don't trust the laws of probability
> >> -- e.g., you start with a RandomState with id [], then its children
> >> have id [0], [1], [2], ..., and their children have ids [0, 0], [0,
> >> 1], ..., [1, 0], ..., and you write these into the state (probably
> >> after sending them through some bit-mixing permutation), to guarantee
> >> non-collision.
> >
> > Sure. Not entirely sure if that can be done without preallocating the
> > branching factor or depth, but I'm sure there's some fancy combinatoric
> > representation of an unbounded tree that could be exploited here. It
seems
> > likely to me that such a thing could be done with the stream IDs of
PRNGs
> > that support that.
>
> I'm pretty sure you do have to preallocate both the branching factor
> and depth, since getting the collision-free guarantee requires that
> across the universe of possible tree addresses, each state id gets
> used at most once -- and there are finitely many state ids. But as a
> practical matter, saying "you can only sprout up to 2**32 new states
> out of any given state, and you can only nest them 600 deep, and
> exceeding these bounds is an error" would still be "composable enough"
> for ~all practical purposes.

To be honest, I'd even be satisfied with taking the treepath breadcrumbs
and hashing them with a suitable cryptographic hash to get a stream ID.
For, say, PCG64 which has 2**127 settable streams, do
sha256(treepath.asbytes()) % (2**127). I'm okay with spray-and-pray if it's
using well-analyzed cryptographic primitives and enough settable streams.
Subject to actual testing, of course.

And that last point concerns me. PRNG quality testing is a dodgy subject as
it is, and the state of the art for testing parallel streams is even worse.
To be honest, making it more convenient to make new streams at a low
library level makes me leery. I do think this is something that needs to be
considered and designed at a high level, irrespective of the available APIs.

--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to