Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-06-10 Thread Marten van Kerkwijk
OK, I spent my Sunday morning writing a NEP. I hope this can lead to some
closure...
See https://github.com/numpy/numpy/pull/11297
-- Marten
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-06-10 Thread Eric Wieser
Rendered here:
https://github.com/mhvk/numpy/blob/nep-gufunc-signature-enhancement/doc/neps/nep-0020-gufunc-signature-enhancement.rst


Eric

On Sun, 10 Jun 2018 at 09:37 Marten van Kerkwijk 
wrote:

> OK, I spent my Sunday morning writing a NEP. I hope this can lead to some
> closure...
> See https://github.com/numpy/numpy/pull/11297
> -- Marten
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-06-10 Thread Eric Wieser
Thanks for the writeup Marten,

Nathaniel:

Output shape feels very similar to
output dtype to me, so maybe the general way to handle this would be
to make the first callback take the input shapes+dtypes and return the
desired output shapes+dtypes?

This hits on an interesting alternative to frozen dimensions - np.cross
could just become a regular ufunc with signature np.dtype((float64, 3)),
np.dtype((float64, 3)) → np.dtype((float64, 3))

Furthermore, the expansion quickly becomes cumbersome. For instance, for
the all_equal signature of (n|1),(n|1)->() …

I think this is only a good argument when used in conjunction with the
broadcasting syntax. I don’t think it’s a reason for matmul not to have
multiple signatures. Having multiple signatures is an disincentive to
introduced too many overloads of the same function, which seems like a good
thing to me

Summarizing my overall opinions:

   - I’m +0.5 on frozen dimensions. The use-cases seem reasonable, and it
   seems like an easy-ish way to get them. Allowing ufuncs to natively support
   subarray types might be a tidier solution, but that could come down the road
   - I’m -1 on optional dimensions: they seem to legitimize creating many
   overloads of gufuncs. I’m already not a fan of how matmul has special cases
   for lower dimensions that don’t generalize well. To me, the best way to
   handle matmul would be to use the proposed __array_function__ to handle
   the shape-based special-case dispatching, either by:
  - Inserting dimensions, and calling the true gufunc
  np.linalg.matmul_2d (which is a function I’d like direct access to
  anyway).
  - Dispatching to one of four ufuncs
   - Broadcasting dimensions:
  - I know you’re not suggesting this but: enabling broadcasting
  unconditionally for all gufuncs would be a bad idea, masking linalg bugs.
  (although einsum does support broadcasting…)
  - Does it really need a per-dimension flag, rather than a global one?
  Can you give a case where that’s useful?
  - If we’d already made all_equal a gufunc, I’d be +1 on adding
  broadcasting support to it
  - I’m -0.5 on the all_equal path in the first place. I think we
  either should have a more generic approach to combined ufuncs, or just
  declare them numbas job.
  - Can you come up with a broadcasting use-case that isn’t just
  chaining a reduction with a broadcasting ufunc?

Eric

On Sun, 10 Jun 2018 at 16:02 Eric Wieser 
wrote:

Rendered here:
> https://github.com/mhvk/numpy/blob/nep-gufunc-signature-enhancement/doc/neps/nep-0020-gufunc-signature-enhancement.rst
>
>
> Eric
>
> On Sun, 10 Jun 2018 at 09:37 Marten van Kerkwijk <
> m.h.vankerkw...@gmail.com> wrote:
>
>> OK, I spent my Sunday morning writing a NEP. I hope this can lead to some
>> closure...
>> See https://github.com/numpy/numpy/pull/11297
>> -- Marten
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> ​
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Ralf Gommers
On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern  wrote:

> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers 
> wrote:
>
>> It may be worth having a look at test suites for scipy, statsmodels,
>> scikit-learn, etc. and estimate how much work this NEP causes those
>> projects. If the devs of those packages are forced to do large scale
>> migrations from RandomState to StableState, then why not instead keep
>> RandomState and just add a new API next to it?
>>
>
> The problem is that we can't really have an ecosystem with two different
> general purpose systems.
>

Can't = prefer not to. But yes, that's true. That's not what I was saying
though. We want one generic one, and one meant for unit testing only. You
can achieve that in two ways:
1. Change the current np.random API to new generic, and add a new
RandomStable for unit tests.
2. Add a new generic API, and document the current np.random API as being
meant for unit tests only, for other usage  should be preferred.

(2) has a couple of pros:
- you're not forcing almost every library and end user out there to migrate
their unit tests.
- more design freedom for the new generic API. The current one is clearly
sub-optimal; in a new one you wouldn't have to expose all the global
state/functions that np.random exposes now. You could even restrict it to a
single class and put that in the main numpy namespace.

Ralf


To properly use pseudorandom numbers, I need to instantiate a PRNG and
> thread it through all of the code in my program: both the parts that I
> write and the third party libraries that I don't write.
>
> Generating test data for unit tests is separable, though. That's why I
> propose having a StableRandom built on the new architecture. Its purpose
> would be well-documented, and in my proposal is limited in features such
> that it will be less likely to be abused outside of that purpose. If you
> make it fully-featured, it is more likely to be abused by building library
> code around it. But even if it is so abused, because it is built on the new
> architecture, at least I can thread the same core PRNG state through the
> StableRandom distributions from the abusing library and use the better
> distributions class elsewhere (randomgen names it "Generator"). Just
> keeping RandomState around can't work like that because it doesn't have a
> replaceable core PRNG.
>
> But that does suggest another alternative that we should explore:
>
> The new architecture separates the core uniform PRNG from the wide variety
> of non-uniform probability distributions. That is, the core PRNG state is
> encapsulated in a discrete object that can be shared between instances of
> different distribution-providing classes. numpy.random should provide two
> such distribution-providing classes. The main one (let us call it
> ``Generator``, as it is called in the prototype) will follow the new
> policy: distribution methods can break the stream in feature releases.
> There will also be a secondary distributions class (let us call it
> ``LegacyGenerator``) which contains distribution methods exactly as they
> exist in the current ``RandomState`` implementation. When one combines
> ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the
> exact same stream as ``RandomState`` for all distribution methods. The
> ``LegacyGenerator`` methods will be forever frozen.
> ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with
> the MT19937 core PRNG, and whatever tricks needed to make
> ``isinstance(prng, RandomState)`` and unpickling work should be done. This
> way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be
> deprecated, becoming progressively noisier over a number of release cycles,
> in favor of explicitly instantiating ``LegacyGenerator``.
>
> ``LegacyGenerator`` CAN be used during this deprecation period in library
> and application code until libraries and applications can migrate to the
> new ``Generator``. Libraries and applications SHOULD migrate but MUST NOT
> be forced to. ``LegacyGenerator`` CAN be used to generate test data for
> unit tests where cross-release stability of the streams is important. Test
> writers SHOULD consider ways to mitigate their reliance on such stability
> and SHOULD limit their usage to distribution methods that have fewer
> cross-platform stability risks.
>
> --
> Robert Kern
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Ralf Gommers
On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser  wrote:

>
>
> On Sun, Jun 3, 2018 at 11:20 PM, Ralf Gommers 
> wrote:
>
>>
>>
>> On Sun, Jun 3, 2018 at 6:54 PM,  wrote:
>>
>>>
>>>
>>> On Sun, Jun 3, 2018 at 9:08 PM, Robert Kern 
>>> wrote:
>>>
 On Sun, Jun 3, 2018 at 5:46 PM  wrote:

>
>
> On Sun, Jun 3, 2018 at 8:21 PM, Robert Kern 
> wrote:
>
>>
>> The list of ``StableRandom`` methods should be chosen to support unit
>>> tests:
>>>
>>> * ``.randint()``
>>> * ``.uniform()``
>>> * ``.normal()``
>>> * ``.standard_normal()``
>>> * ``.choice()``
>>> * ``.shuffle()``
>>> * ``.permutation()``
>>>
>>
>> https://github.com/numpy/numpy/pull/11229#discussion_r192604311
>> @bashtage writes:
>> > standard_gamma and standard_exponential are important enough to be
>> included here IMO.
>>
>> "Importance" was not my criterion, only whether they are used in unit
>> test suites. This list was just off the top of my head for methods that I
>> think were actually used in test suites, so I'd be happy to be shown live
>> tests that use other methods. I'd like to be a *little* conservative 
>> about
>> what methods we stick in here, but we don't have to be *too* 
>> conservative,
>> since we are explicitly never going to be modifying these.
>>
>
> That's one area where I thought the selection is too narrow.
> We should be able to get a stable stream from the uniform for some
> distributions.
>
> However, according to the Wikipedia description Poisson doesn't look
> easy. I just wrote a unit test for statsmodels using Poisson random 
> numbers
> with hard coded numbers for the regression tests.
>

 I'd really rather people do this than use StableRandom; this is best
 practice, as I see it, if your tests involve making precise comparisons to
 expected results.

>>>
>>> I hardcoded the results not the random data. So the unit tests rely on a
>>> reproducible stream of Poisson random numbers.
>>> I don't want to save 500 (100 or 1000) observations in a csv file for
>>> every variation of the unit test that I run.
>>>
>>
>> I agree, hardcoding numbers in every place where seeded random numbers
>> are now used is quite unrealistic.
>>
>> It may be worth having a look at test suites for scipy, statsmodels,
>> scikit-learn, etc. and estimate how much work this NEP causes those
>> projects. If the devs of those packages are forced to do large scale
>> migrations from RandomState to StableState, then why not instead keep
>> RandomState and just add a new API next to it?
>>
>>
>
> As a quick and imperfect test, I monkey-patched numpy so that a call to
> numpy.random.seed(m) actually uses m+1000 as the seed.  I ran the tests
> using the `runtests.py` script:
>
> *seed+1000, using 'python runtests.py -n' in the source directory:*
>
>   236 failed, 12881 passed, 1248 skipped, 585 deselected, 84 xfailed, 7
> xpassed
>
>
> Most of the failures are in scipy.stats:
>
> *seed+1000, using 'python runtests.py -n -s stats' in the source
> directory:*
>
>   203 failed, 1034 passed, 4 skipped, 370 deselected, 4 xfailed, 1 xpassed
>
>
> Changing the amount added to the seed or running the tests using the
> function `scipy.test("full")` gives different (but similar magnitude)
> results:
>
> *seed+1000, using 'import scipy; scipy.test("full")' in an ipython shell:*
>
>   269 failed, 13359 passed, 1271 skipped, 134 xfailed, 8 xpassed
>
> *seed+1, using 'python runtests.py -n' in the source directory:*
>
>   305 failed, 12812 passed, 1248 skipped, 585 deselected, 84 xfailed, 7
> xpassed
>
>
> I suspect many of the tests will be easy to update, so fixing 300 or so
> tests does not seem like a monumental task.
>

It's all not monumental, but it adds up quickly. In addition to changing
tests, one will also need compatibility code when supporting multiple numpy
versions (e.g. scipy when get a copy of RandomStable in
scipy/_lib/_numpy_compat.py).

A quick count of just np.random.seed occurrences with ``$ grep -roh
--include \*.py np.random.seed . | wc -w`` for some packages:
numpy: 77
scipy: 462
matplotlib: 204
statsmodels: 461
pymc3: 36
scikit-image: 63
scikit-learn: 69
keras: 46
pytorch: 0
tensorflow: 368
astropy: 24

And note, these are *not* incorrect/broken usages, this is code that works
and has done so for years.

Conclusion: the current proposal will cause work for the vast majority of
libraries that depends on numpy. The total amount of that work will
certainly not be counted in person-days/weeks, and more likely in years
than months. So I'm not convinced yet that the current proposal is the best
way forward.

Ralf


I haven't looked into why there are 585 deselected tests; maybe there are
> many more tests lurking there that will have to be updated.
>
> Warren
>
>
>
> Ralf
>>
>>
>>
>>>
>>>

 StableRandom is intended as a

Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-06-10 Thread Stephan Hoyer
In Sun, Jun 10, 2018 at 4:31 PM Eric Wieser 
wrote:

> Thanks for the writeup Marten,
>
Indeed, thank you Marten!

> This hits on an interesting alternative to frozen dimensions - np.cross
> could just become a regular ufunc with signature np.dtype((float64, 3)),
> np.dtype((float64, 3)) → np.dtype((float64, 3))
>
> Another alternative to mention is returning multiple arrays, e.g., two
arrays for a fixed dimension of size 2.

That said, I still think frozen dimension are a better proposal than either
of these.


>- I’m -1 on optional dimensions: they seem to legitimize creating many
>overloads of gufuncs. I’m already not a fan of how matmul has special cases
>for lower dimensions that don’t generalize well. To me, the best way to
>handle matmul would be to use the proposed __array_function__ to
>handle the shape-based special-case dispatching, either by:
>   - Inserting dimensions, and calling the true gufunc
>   np.linalg.matmul_2d (which is a function I’d like direct access to
>   anyway).
>   - Dispatching to one of four ufuncs
>
> I don't understand your alternative here. If we overload np.matmul using
__array_function__, then it would not use *ether* of these options for
writing the operation in terms of other gufuncs. It would simply look for
an __array_function__ attribute, and call that method instead.

My concern with either inserting dimensions or dispatching to one of four
ufuncs is that some objects (e.g., xarray.DataArray) define matrix
multiplication, but in an incompatible way with NumPy (e.g., xarray sums
over axes with the same name, instead of last / second-to-last axes). NumPy
really ought to provide a way overload the either operation, without either
inserting/removing dummy dimensions or inspecting input shapes to dispatch
to other gufuncs.

That said, if you don't want to make np.matmul a gufunc, then I would much
rather use Python's standard overloading rules with __matmul__/__rmatmul__
than use __array_function__, for two reasons:
1. You *already* need to use __matmul__/__rmatmul__ if you want to support
matrix multiplication with @ on your class, so __array_function__ would be
additional and redundant. __array_function__ is really intended as a
fall-back, for cases where there is no other alternative.
2. With the current __array_function__ proposal, this would imply that
calling other unimplemented NumPy functions on your object would raise
TypeError rather than doing coercion. This sort of additional coupled
behavior is probably not what an implementor of operator.matmul/@ is
looking for.

In summary, I would either support:
1. (This proposal) Adding additional optional dimensions to gufuncs for
np.matmul/operator.matmul, or
2. Making operator.matmul a special case for mathematical operators that
always checks overloads with __matmul__/__rmatmul__ even if __array_ufunc__
is defined.

Either way, matrix-multiplication becomes somewhat of a special case. It's
just a matter of whether it's a special case for gufuncs (using optional
dimensions) or a special case for arithmetic overloads in NumPy (not using
__array_ufunc__). Given that I think optional dimensions have other
conceivable uses in gufuncs (for row/column vectors), I think that's the
better option.

I would not support either expand dimensions or dispatch to multiple
gufuncs in NumPy's implementation of operator.matmul (i.e.,
ndarray.__matmul__). We could potentially only do this for numpy.matmul
rather than operator.matmul/@, but that opens the door to potential
inconsistency between the NumPy version of an operator and Python's version
of an operator, which is something we tried very hard to avoid with
__arary_ufunc__.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Robert Kern
On Sun, Jun 10, 2018 at 5:47 PM Ralf Gommers  wrote:
>
> On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser <
warren.weckes...@gmail.com> wrote:

>> I suspect many of the tests will be easy to update, so fixing 300 or so
tests does not seem like a monumental task.
>
> It's all not monumental, but it adds up quickly. In addition to changing
tests, one will also need compatibility code when supporting multiple numpy
versions (e.g. scipy when get a copy of RandomStable in
scipy/_lib/_numpy_compat.py).
>
> A quick count of just np.random.seed occurrences with ``$ grep -roh
--include \*.py np.random.seed . | wc -w`` for some packages:
> numpy: 77
> scipy: 462
> matplotlib: 204
> statsmodels: 461
> pymc3: 36
> scikit-image: 63
> scikit-learn: 69
> keras: 46
> pytorch: 0
> tensorflow: 368
> astropy: 24
>
> And note, these are *not* incorrect/broken usages, this is code that
works and has done so for years.

Yes, some of them are incorrect and broken. Failure can be difficult to
detect. This module from keras is particularly problematic:


https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image.py

> Conclusion: the current proposal will cause work for the vast majority of
libraries that depends on numpy. The total amount of that work will
certainly not be counted in person-days/weeks, and more likely in years
than months. So I'm not convinced yet that the current proposal is the best
way forward.

The mere usage of np.random.seed() doesn't imply that these packages
actually require stream-compatibility. Some might, for sure, like where
they are used in the unit tests, but that's not what you counted. At best,
these numbers just mean that we can't eliminate np.random.seed() in a new
system right away.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Robert Kern
On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers  wrote:
>
> On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern  wrote:
>>
>> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers 
wrote:
>>>
>>> It may be worth having a look at test suites for scipy, statsmodels,
scikit-learn, etc. and estimate how much work this NEP causes those
projects. If the devs of those packages are forced to do large scale
migrations from RandomState to StableState, then why not instead keep
RandomState and just add a new API next to it?
>>
>> The problem is that we can't really have an ecosystem with two different
general purpose systems.
>
> Can't = prefer not to.

I meant what I wrote. :-)

> But yes, that's true. That's not what I was saying though. We want one
generic one, and one meant for unit testing only. You can achieve that in
two ways:
> 1. Change the current np.random API to new generic, and add a new
RandomStable for unit tests.
> 2. Add a new generic API, and document the current np.random API as being
meant for unit tests only, for other usage  should be preferred.
>
> (2) has a couple of pros:
> - you're not forcing almost every library and end user out there to
migrate their unit tests.

But it has the cons that I talked about. RandomState *is* a fully
functional general purpose PRNG system. After all, that's its current use.
Documenting it as intended to be something else will not change that fact.
Documentation alone provides no real impetus to move to the new system
outside of the unit tests. And the community does need to move together to
the new system in their library code, or else we won't be able to combine
libraries together; these PRNG objects need to thread all the way through
between code from different authors if we are to write programs with a
controlled seed. The failure mode when people don't pay attention to the
documentation is that I can no longer write programs that compose these
libraries together. That's why I wrote "can't". It's not a mere preference
for not having two systems to maintain. It has binary Go/No Go implications
for building reproducible programs.

> - more design freedom for the new generic API. The current one is clearly
sub-optimal; in a new one you wouldn't have to expose all the global
state/functions that np.random exposes now. You could even restrict it to a
single class and put that in the main numpy namespace.

I'm not sure why you are talking about the global state and np.random.*
convenience functions. What we do with those functions is out of scope for
this NEP and would be talked about it another NEP fully introducing the new
system.

>> To properly use pseudorandom numbers, I need to instantiate a PRNG and
thread it through all of the code in my program: both the parts that I
write and the third party libraries that I don't write.
>>
>> Generating test data for unit tests is separable, though. That's why I
propose having a StableRandom built on the new architecture. Its purpose
would be well-documented, and in my proposal is limited in features such
that it will be less likely to be abused outside of that purpose. If you
make it fully-featured, it is more likely to be abused by building library
code around it. But even if it is so abused, because it is built on the new
architecture, at least I can thread the same core PRNG state through the
StableRandom distributions from the abusing library and use the better
distributions class elsewhere (randomgen names it "Generator"). Just
keeping RandomState around can't work like that because it doesn't have a
replaceable core PRNG.
>>
>> But that does suggest another alternative that we should explore:
>>
>> The new architecture separates the core uniform PRNG from the wide
variety of non-uniform probability distributions. That is, the core PRNG
state is encapsulated in a discrete object that can be shared between
instances of different distribution-providing classes. numpy.random should
provide two such distribution-providing classes. The main one (let us call
it ``Generator``, as it is called in the prototype) will follow the new
policy: distribution methods can break the stream in feature releases.
There will also be a secondary distributions class (let us call it
``LegacyGenerator``) which contains distribution methods exactly as they
exist in the current ``RandomState`` implementation. When one combines
``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the
exact same stream as ``RandomState`` for all distribution methods. The
``LegacyGenerator`` methods will be forever frozen.
``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with
the MT19937 core PRNG, and whatever tricks needed to make
``isinstance(prng, RandomState)`` and unpickling work should be done. This
way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be
deprecated, becoming progressively noisier over a number of release cycles,
in favor of explicitly instantiating ``LegacyGenerator``.
>>
>> ``LegacyGenerator`` CAN be use

Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread josef . pktd
On Sun, Jun 10, 2018 at 9:08 PM, Robert Kern  wrote:

> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers 
> wrote:
> >
> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern 
> wrote:
> >>
> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers 
> wrote:
> >>>
> >>> It may be worth having a look at test suites for scipy, statsmodels,
> scikit-learn, etc. and estimate how much work this NEP causes those
> projects. If the devs of those packages are forced to do large scale
> migrations from RandomState to StableState, then why not instead keep
> RandomState and just add a new API next to it?
> >>
> >> The problem is that we can't really have an ecosystem with two
> different general purpose systems.
> >
> > Can't = prefer not to.
>
> I meant what I wrote. :-)
>
> > But yes, that's true. That's not what I was saying though. We want one
> generic one, and one meant for unit testing only. You can achieve that in
> two ways:
> > 1. Change the current np.random API to new generic, and add a new
> RandomStable for unit tests.
> > 2. Add a new generic API, and document the current np.random API as
> being meant for unit tests only, for other usage  should be
> preferred.
> >
> > (2) has a couple of pros:
> > - you're not forcing almost every library and end user out there to
> migrate their unit tests.
>
> But it has the cons that I talked about. RandomState *is* a fully
> functional general purpose PRNG system. After all, that's its current use.
> Documenting it as intended to be something else will not change that fact.
> Documentation alone provides no real impetus to move to the new system
> outside of the unit tests. And the community does need to move together to
> the new system in their library code, or else we won't be able to combine
> libraries together; these PRNG objects need to thread all the way through
> between code from different authors if we are to write programs with a
> controlled seed. The failure mode when people don't pay attention to the
> documentation is that I can no longer write programs that compose these
> libraries together. That's why I wrote "can't". It's not a mere preference
> for not having two systems to maintain. It has binary Go/No Go implications
> for building reproducible programs.
>

I don't understand this part.
For example, scipy.stats and scikit-learn allow the user to provide a
RandomState instance to the functions. I don't see why you want to force
down stream libraries to change this. A random state argument should be
(essentially) compatible with whatever the user uses, and there is no
reason to force packages to update there internal use like in unit tests if
they don't want to, e.g. because of the instability.

Aside to statsmodels: We currently have very few user facing random
functions, those are just in maybe 3 to 5 places where we have simulated or
bootstrap values.
Most of the other uses of np.random are in unit tests and some in the
documentation examples.

Josef



>
> > - more design freedom for the new generic API. The current one is
> clearly sub-optimal; in a new one you wouldn't have to expose all the
> global state/functions that np.random exposes now. You could even restrict
> it to a single class and put that in the main numpy namespace.
>
> I'm not sure why you are talking about the global state and np.random.*
> convenience functions. What we do with those functions is out of scope for
> this NEP and would be talked about it another NEP fully introducing the new
> system.
>
> >> To properly use pseudorandom numbers, I need to instantiate a PRNG and
> thread it through all of the code in my program: both the parts that I
> write and the third party libraries that I don't write.
> >>
> >> Generating test data for unit tests is separable, though. That's why I
> propose having a StableRandom built on the new architecture. Its purpose
> would be well-documented, and in my proposal is limited in features such
> that it will be less likely to be abused outside of that purpose. If you
> make it fully-featured, it is more likely to be abused by building library
> code around it. But even if it is so abused, because it is built on the new
> architecture, at least I can thread the same core PRNG state through the
> StableRandom distributions from the abusing library and use the better
> distributions class elsewhere (randomgen names it "Generator"). Just
> keeping RandomState around can't work like that because it doesn't have a
> replaceable core PRNG.
> >>
> >> But that does suggest another alternative that we should explore:
> >>
> >> The new architecture separates the core uniform PRNG from the wide
> variety of non-uniform probability distributions. That is, the core PRNG
> state is encapsulated in a discrete object that can be shared between
> instances of different distribution-providing classes. numpy.random should
> provide two such distribution-providing classes. The main one (let us call
> it ``Generator``, as it is called in the prototype) will follow the new
>

Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Ralf Gommers
On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern  wrote:

> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers 
> wrote:
> >
> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern 
> wrote:
> >>
> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers 
> wrote:
> >>>
> >>> It may be worth having a look at test suites for scipy, statsmodels,
> scikit-learn, etc. and estimate how much work this NEP causes those
> projects. If the devs of those packages are forced to do large scale
> migrations from RandomState to StableState, then why not instead keep
> RandomState and just add a new API next to it?
> >>
> >> The problem is that we can't really have an ecosystem with two
> different general purpose systems.
> >
> > Can't = prefer not to.
>
> I meant what I wrote. :-)
>
> > But yes, that's true. That's not what I was saying though. We want one
> generic one, and one meant for unit testing only. You can achieve that in
> two ways:
> > 1. Change the current np.random API to new generic, and add a new
> RandomStable for unit tests.
> > 2. Add a new generic API, and document the current np.random API as
> being meant for unit tests only, for other usage  should be
> preferred.
> >
> > (2) has a couple of pros:
> > - you're not forcing almost every library and end user out there to
> migrate their unit tests.
>
> But it has the cons that I talked about. RandomState *is* a fully
> functional general purpose PRNG system. After all, that's its current use.
> Documenting it as intended to be something else will not change that fact.
> Documentation alone provides no real impetus to move to the new system
> outside of the unit tests. And the community does need to move together to
> the new system in their library code, or else we won't be able to combine
> libraries together; these PRNG objects need to thread all the way through
> between code from different authors if we are to write programs with a
> controlled seed. The failure mode when people don't pay attention to the
> documentation is that I can no longer write programs that compose these
> libraries together. That's why I wrote "can't". It's not a mere preference
> for not having two systems to maintain. It has binary Go/No Go implications
> for building reproducible programs.
>

I strongly suspect you are right, but only because you're asserting "can't"
so heavily. I have trouble formulating what would go wrong in case there's
two PRNGs used in a single program. It's not described in the NEP, nor in
the numpy.random docs (those don't even have any recommendations for best
practices listed as far as I can tell - that needs fixing). All you explain
in the NEP is that reproducible research isn't helped by the current
stream-compat guarantee. So a bit of (probably incorrect) devil's advocate
reasoning:
- If there's no stream-compat guarantee, all a user can rely on is the
properties of drawing from a seeded PRNG.
- Any use of a PRNG in library code can also only rely on properties
- So now whether in a user's program libraries draw from one or two seeded
PRNGs doesn't matter for reproducibility, because those properties don't
change.


Also, if there is to be a multi-year transitioning to the new API, would
there be two PRNG systems anyway during those years?



> > - more design freedom for the new generic API. The current one is
> clearly sub-optimal; in a new one you wouldn't have to expose all the
> global state/functions that np.random exposes now. You could even restrict
> it to a single class and put that in the main numpy namespace.
>
> I'm not sure why you are talking about the global state and np.random.*
> convenience functions. What we do with those functions is out of scope for
> this NEP and would be talked about it another NEP fully introducing the new
> system.
>

To quote you from one of the first emails in this thread: "
I deliberately left it out of this one as it may, depending on our choices,
impinge upon the design of the new PRNG subsystem, which I declared out of
scope for this NEP. I have ideas (besides the glib "Let them eat
AttributeErrors!"), and now that I think more about it, that does seem like
it might be in scope just like the discussion of freezing RandomState and
StableRandom are. But I think I'd like to hold that thought a little bit
and get a little more screaming^Wfeedback on the core proposal first. I'll
return to this in a few days if not sooner.
"

So consider this some screaming^Wfeedback:)



>
> >> To properly use pseudorandom numbers, I need to instantiate a PRNG and
> thread it through all of the code in my program: both the parts that I
> write and the third party libraries that I don't write.
> >>
> >> Generating test data for unit tests is separable, though. That's why I
> propose having a StableRandom built on the new architecture. Its purpose
> would be well-documented, and in my proposal is limited in features such
> that it will be less likely to be abused outside of that purpose. If you
> make it fully-featured, it is more likely to 

Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Ralf Gommers
On Sun, Jun 10, 2018 at 5:57 PM, Robert Kern  wrote:

> On Sun, Jun 10, 2018 at 5:47 PM Ralf Gommers 
> wrote:
> >
> > On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser <
> warren.weckes...@gmail.com> wrote:
>
> >> I suspect many of the tests will be easy to update, so fixing 300 or so
> tests does not seem like a monumental task.
> >
> > It's all not monumental, but it adds up quickly. In addition to changing
> tests, one will also need compatibility code when supporting multiple numpy
> versions (e.g. scipy when get a copy of RandomStable in
> scipy/_lib/_numpy_compat.py).
> >
> > A quick count of just np.random.seed occurrences with ``$ grep -roh
> --include \*.py np.random.seed . | wc -w`` for some packages:
> > numpy: 77
> > scipy: 462
> > matplotlib: 204
> > statsmodels: 461
> > pymc3: 36
> > scikit-image: 63
> > scikit-learn: 69
> > keras: 46
> > pytorch: 0
> > tensorflow: 368
> > astropy: 24
> >
> > And note, these are *not* incorrect/broken usages, this is code that
> works and has done so for years.
>
> Yes, some of them are incorrect and broken. Failure can be difficult to
> detect. This module from keras is particularly problematic:
>
>   https://github.com/keras-team/keras-preprocessing/blob/
> master/keras_preprocessing/image.py
>

You have to appreciate that we're not all thinking at lightning speed and
in the same direction. If there is a difficult to detect problem, it may be
useful to give a brief code example (or even line of reasoning) of how this
actually breaks something.


>
> > Conclusion: the current proposal will cause work for the vast majority
> of libraries that depends on numpy. The total amount of that work will
> certainly not be counted in person-days/weeks, and more likely in years
> than months. So I'm not convinced yet that the current proposal is the best
> way forward.
>
> The mere usage of np.random.seed() doesn't imply that these packages
> actually require stream-compatibility. Some might, for sure, like where
> they are used in the unit tests, but that's not what you counted. At best,
> these numbers just mean that we can't eliminate np.random.seed() in a new
> system right away.
>

Well, mere usage has been called an antipattern (also on your behalf), plus
for scipy over half of the usages do give test failures (Warren's quick
test). So I'd say that counting usages is a decent proxy for the work that
has to be done.

Cheers,
Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Stephan Hoyer
On Sun, Jun 10, 2018 at 8:10 PM Ralf Gommers  wrote:

> On Sun, Jun 10, 2018 at 5:57 PM, Robert Kern 
> wrote:
>
>> > Conclusion: the current proposal will cause work for the vast majority
>> of libraries that depends on numpy. The total amount of that work will
>> certainly not be counted in person-days/weeks, and more likely in years
>> than months. So I'm not convinced yet that the current proposal is the best
>> way forward.
>>
>
>> The mere usage of np.random.seed() doesn't imply that these packages
>> actually require stream-compatibility. Some might, for sure, like where
>> they are used in the unit tests, but that's not what you counted. At best,
>> these numbers just mean that we can't eliminate np.random.seed() in a new
>> system right away.
>>
>
> Well, mere usage has been called an antipattern (also on your behalf),
> plus for scipy over half of the usages do give test failures (Warren's
> quick test). So I'd say that counting usages is a decent proxy for the work
> that has to be done.
>

Let me suggest another possible concession for backwards compatibility. We
should make a dedicated module, e.g., "numpy.random.stable" that contains
functions implemented as methods on StableRandom. These functions should
include "seed", which is too pervasive to justify removing.

Transitioning to the new module should be as simple as mechanistically
replacing all uses of "numpy.random" with "numpy.random.stable".

This module would add virtually no maintenance overhead, because the
implementations would be entirely contained on StableRandom, and would
simply involve creating a single top-level StableRandom object (like what
is currently done in numpy.random).
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Robert Kern
On Sun, Jun 10, 2018 at 7:46 PM  wrote:
>
> On Sun, Jun 10, 2018 at 9:08 PM, Robert Kern 
wrote:
>>
>> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers 
wrote:
>> >
>> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern 
wrote:
>> >>
>> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers 
wrote:
>> >>>
>> >>> It may be worth having a look at test suites for scipy, statsmodels,
scikit-learn, etc. and estimate how much work this NEP causes those
projects. If the devs of those packages are forced to do large scale
migrations from RandomState to StableState, then why not instead keep
RandomState and just add a new API next to it?
>> >>
>> >> The problem is that we can't really have an ecosystem with two
different general purpose systems.
>> >
>> > Can't = prefer not to.
>>
>> I meant what I wrote. :-)
>>
>> > But yes, that's true. That's not what I was saying though. We want one
generic one, and one meant for unit testing only. You can achieve that in
two ways:
>> > 1. Change the current np.random API to new generic, and add a new
RandomStable for unit tests.
>> > 2. Add a new generic API, and document the current np.random API as
being meant for unit tests only, for other usage  should be
preferred.
>> >
>> > (2) has a couple of pros:
>> > - you're not forcing almost every library and end user out there to
migrate their unit tests.
>>
>> But it has the cons that I talked about. RandomState *is* a fully
functional general purpose PRNG system. After all, that's its current use.
Documenting it as intended to be something else will not change that fact.
Documentation alone provides no real impetus to move to the new system
outside of the unit tests. And the community does need to move together to
the new system in their library code, or else we won't be able to combine
libraries together; these PRNG objects need to thread all the way through
between code from different authors if we are to write programs with a
controlled seed. The failure mode when people don't pay attention to the
documentation is that I can no longer write programs that compose these
libraries together. That's why I wrote "can't". It's not a mere preference
for not having two systems to maintain. It has binary Go/No Go implications
for building reproducible programs.
>
> I don't understand this part.
> For example, scipy.stats and scikit-learn allow the user to provide a
RandomState instance to the functions. I don't see why you want to force
down stream libraries to change this. A random state argument should be
(essentially) compatible with whatever the user uses, and there is no
reason to force packages to update there internal use like in unit tests if
they don't want to, e.g. because of the instability.
>
> Aside to statsmodels: We currently have very few user facing random
functions, those are just in maybe 3 to 5 places where we have simulated or
bootstrap values.
> Most of the other uses of np.random are in unit tests and some in the
documentation examples.

Please consider my alternative proposal. Your feedback has convinced me
that that's a better approach than the StableRandom as laid out in the NEP.
I'm even willing to not deprecate the name RandomState.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Robert Kern
On Sun, Jun 10, 2018 at 8:04 PM Ralf Gommers  wrote:
>
> On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern 
wrote:
>>
>> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers 
wrote:
>> >
>> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern 
wrote:
>> >>
>> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers 
wrote:
>> >>>
>> >>> It may be worth having a look at test suites for scipy, statsmodels,
scikit-learn, etc. and estimate how much work this NEP causes those
projects. If the devs of those packages are forced to do large scale
migrations from RandomState to StableState, then why not instead keep
RandomState and just add a new API next to it?
>> >>
>> >> The problem is that we can't really have an ecosystem with two
different general purpose systems.
>> >
>> > Can't = prefer not to.
>>
>> I meant what I wrote. :-)
>>
>> > But yes, that's true. That's not what I was saying though. We want one
generic one, and one meant for unit testing only. You can achieve that in
two ways:
>> > 1. Change the current np.random API to new generic, and add a new
RandomStable for unit tests.
>> > 2. Add a new generic API, and document the current np.random API as
being meant for unit tests only, for other usage  should be
preferred.
>> >
>> > (2) has a couple of pros:
>> > - you're not forcing almost every library and end user out there to
migrate their unit tests.
>>
>> But it has the cons that I talked about. RandomState *is* a fully
functional general purpose PRNG system. After all, that's its current use.
Documenting it as intended to be something else will not change that fact.
Documentation alone provides no real impetus to move to the new system
outside of the unit tests. And the community does need to move together to
the new system in their library code, or else we won't be able to combine
libraries together; these PRNG objects need to thread all the way through
between code from different authors if we are to write programs with a
controlled seed. The failure mode when people don't pay attention to the
documentation is that I can no longer write programs that compose these
libraries together. That's why I wrote "can't". It's not a mere preference
for not having two systems to maintain. It has binary Go/No Go implications
for building reproducible programs.
>
> I strongly suspect you are right, but only because you're asserting
"can't" so heavily. I have trouble formulating what would go wrong in case
there's two PRNGs used in a single program. It's not described in the NEP,
nor in the numpy.random docs (those don't even have any recommendations for
best practices listed as far as I can tell - that needs fixing). All you
explain in the NEP is that reproducible research isn't helped by the
current stream-compat guarantee. So a bit of (probably incorrect) devil's
advocate reasoning:
> - If there's no stream-compat guarantee, all a user can rely on is the
properties of drawing from a seeded PRNG.
> - Any use of a PRNG in library code can also only rely on properties
> - So now whether in a user's program libraries draw from one or two
seeded PRNGs doesn't matter for reproducibility, because those properties
don't change.

Correctly making a stochastic program reproducible while retaining good
statistical properties is difficult. People don't do it well in the best of
circumstances. The best way that we've found to manage that difficulty is
to instantiate a single stream and use it all throughout your code. Every
new stream requires the management of more seeds (unless if we use the
fancy new algorithms that have settable stream IDs, but by stipulation, we
don't have these in this case). And now I have to thread both of these
objects through my code, and pass the right object to each third-party
library. These third-party libraries don't know anything about this weird
2-stream workaround that you are doing, so we now have libraries that can't
build on each other unless if they are using the same compatible API, even
if I can make workarounds to build a program that combines two libraries
side-to-side.

So yeah, people "can" do this. "It's just a matter of code" as my boss
likes to say. But it's making an already-difficult task more difficult.

> Also, if there is to be a multi-year transitioning to the new API, would
there be two PRNG systems anyway during those years?

Sure, but with a deadline and not-just-documentation to motivate
transitioning.

But if we follow my alternative proposal, there'll be no need for
deprecation! You've convinced me to not deprecate RandomState. I just want
to change some of its internal implementation details, add a less-stable
set of distributions on the side, and a framework of core uniform PRNGs
that can be shared by both.

>> > - more design freedom for the new generic API. The current one is
clearly sub-optimal; in a new one you wouldn't have to expose all the
global state/functions that np.random exposes now. You could even restrict
it to a single class and put that in the main numpy namespace.
>>
>> I

Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Robert Kern
On Sun, Jun 10, 2018 at 8:11 PM Ralf Gommers  wrote:
>
> On Sun, Jun 10, 2018 at 5:57 PM, Robert Kern 
wrote:
>>
>> On Sun, Jun 10, 2018 at 5:47 PM Ralf Gommers 
wrote:
>> >
>> > On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser <
warren.weckes...@gmail.com> wrote:
>>
>> >> I suspect many of the tests will be easy to update, so fixing 300 or
so tests does not seem like a monumental task.
>> >
>> > It's all not monumental, but it adds up quickly. In addition to
changing tests, one will also need compatibility code when supporting
multiple numpy versions (e.g. scipy when get a copy of RandomStable in
scipy/_lib/_numpy_compat.py).
>> >
>> > A quick count of just np.random.seed occurrences with ``$ grep -roh
--include \*.py np.random.seed . | wc -w`` for some packages:
>> > numpy: 77
>> > scipy: 462
>> > matplotlib: 204
>> > statsmodels: 461
>> > pymc3: 36
>> > scikit-image: 63
>> > scikit-learn: 69
>> > keras: 46
>> > pytorch: 0
>> > tensorflow: 368
>> > astropy: 24
>> >
>> > And note, these are *not* incorrect/broken usages, this is code that
works and has done so for years.
>>
>> Yes, some of them are incorrect and broken. Failure can be difficult to
detect. This module from keras is particularly problematic:
>>
>>
https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image.py
>
> You have to appreciate that we're not all thinking at lightning speed and
in the same direction. If there is a difficult to detect problem, it may be
useful to give a brief code example (or even line of reasoning) of how this
actually breaks something.

Ahem. Sorry. That wasn't the code I was thinking of. It's merely hazardous,
not broken by itself. However, if you used any of the `seed=` arguments
that are helpfully(?) provided, you are almost certainly writing broken
code. If you must use np.random.seed() to get reproducibility, you need to
call it exactly once at the start of your code (or maybe once for each
process) and let it ride.

This is the impossible-to-use-correctly code that I was thinking of, which
got partially fixed after I pointed out the problem.

  https://github.com/keras-team/keras/pull/8325/files

The intention of this code is to shuffle two same-length sequences in the
same way. So now if I write my code well to call np.random.seed() once at
the start of my program, this function comes along and obliterates that
with a fixed seed just so it can reuse the seed again to replicate the
shuffle.

Puzzlingly, the root sin of unconditionally and unavoidably reseeding for
some of these functions is still there even though I showed how and why to
avoid it. This is one reason why I was skeptical that merely documenting
RandomState or StableRandom to only be used for unit tests would work. :-)

>> > Conclusion: the current proposal will cause work for the vast majority
of libraries that depends on numpy. The total amount of that work will
certainly not be counted in person-days/weeks, and more likely in years
than months. So I'm not convinced yet that the current proposal is the best
way forward.
>>
>> The mere usage of np.random.seed() doesn't imply that these packages
actually require stream-compatibility. Some might, for sure, like where
they are used in the unit tests, but that's not what you counted. At best,
these numbers just mean that we can't eliminate np.random.seed() in a new
system right away.
>
> Well, mere usage has been called an antipattern (also on your behalf),
plus for scipy over half of the usages do give test failures (Warren's
quick test). So I'd say that counting usages is a decent proxy for the work
that has to be done.

Sure. But with my new proposal, we don't have to change it (as much as I'd
like to!). I'll draft up a PR to modify my NEP accordingly.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Ralf Gommers
On Sun, Jun 10, 2018 at 10:36 PM, Robert Kern  wrote:

> On Sun, Jun 10, 2018 at 8:04 PM Ralf Gommers 
> wrote:
> >
> > On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern 
> wrote:
> >>
> >> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers 
> wrote:
> >> >
> >> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern 
> wrote:
> >> >>
> >> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers 
> wrote:
> >> >>>
> >> >>> It may be worth having a look at test suites for scipy,
> statsmodels, scikit-learn, etc. and estimate how much work this NEP causes
> those projects. If the devs of those packages are forced to do large scale
> migrations from RandomState to StableState, then why not instead keep
> RandomState and just add a new API next to it?
> >> >>
> >> >> The problem is that we can't really have an ecosystem with two
> different general purpose systems.
> >> >
> >> > Can't = prefer not to.
> >>
> >> I meant what I wrote. :-)
> >>
> >> > But yes, that's true. That's not what I was saying though. We want
> one generic one, and one meant for unit testing only. You can achieve that
> in two ways:
> >> > 1. Change the current np.random API to new generic, and add a new
> RandomStable for unit tests.
> >> > 2. Add a new generic API, and document the current np.random API as
> being meant for unit tests only, for other usage  should be
> preferred.
> >> >
> >> > (2) has a couple of pros:
> >> > - you're not forcing almost every library and end user out there to
> migrate their unit tests.
> >>
> >> But it has the cons that I talked about. RandomState *is* a fully
> functional general purpose PRNG system. After all, that's its current use.
> Documenting it as intended to be something else will not change that fact.
> Documentation alone provides no real impetus to move to the new system
> outside of the unit tests. And the community does need to move together to
> the new system in their library code, or else we won't be able to combine
> libraries together; these PRNG objects need to thread all the way through
> between code from different authors if we are to write programs with a
> controlled seed. The failure mode when people don't pay attention to the
> documentation is that I can no longer write programs that compose these
> libraries together. That's why I wrote "can't". It's not a mere preference
> for not having two systems to maintain. It has binary Go/No Go implications
> for building reproducible programs.
> >
> > I strongly suspect you are right, but only because you're asserting
> "can't" so heavily. I have trouble formulating what would go wrong in case
> there's two PRNGs used in a single program. It's not described in the NEP,
> nor in the numpy.random docs (those don't even have any recommendations for
> best practices listed as far as I can tell - that needs fixing). All you
> explain in the NEP is that reproducible research isn't helped by the
> current stream-compat guarantee. So a bit of (probably incorrect) devil's
> advocate reasoning:
> > - If there's no stream-compat guarantee, all a user can rely on is the
> properties of drawing from a seeded PRNG.
> > - Any use of a PRNG in library code can also only rely on properties
> > - So now whether in a user's program libraries draw from one or two
> seeded PRNGs doesn't matter for reproducibility, because those properties
> don't change.
>
> Correctly making a stochastic program reproducible while retaining good
> statistical properties is difficult. People don't do it well in the best of
> circumstances. The best way that we've found to manage that difficulty is
> to instantiate a single stream and use it all throughout your code. Every
> new stream requires the management of more seeds (unless if we use the
> fancy new algorithms that have settable stream IDs, but by stipulation, we
> don't have these in this case). And now I have to thread both of these
> objects through my code, and pass the right object to each third-party
> library. These third-party libraries don't know anything about this weird
> 2-stream workaround that you are doing, so we now have libraries that can't
> build on each other unless if they are using the same compatible API, even
> if I can make workarounds to build a program that combines two libraries
> side-to-side.
>
> So yeah, people "can" do this. "It's just a matter of code" as my boss
> likes to say. But it's making an already-difficult task more difficult.
>

Okay, that makes more sense to me now. It would be really useful to
document such best practices and rationales.

Note that scipy.stats distributions allow passing in either a RandomState
instance or an integer as seed (which will be used for seeding a new
instance, not for np.random.seed) [1]. That seems like a fine design
pattern as well, and passing on a seed that way is fairly easy and as good
for reproducibility as passing in a single PRNG.

[1]
https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L612


> > Also, if there is to be a multi-year t

Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Ralf Gommers
On Sun, Jun 10, 2018 at 11:15 PM, Robert Kern  wrote:

> On Sun, Jun 10, 2018 at 8:11 PM Ralf Gommers 
> wrote:
> >
> > On Sun, Jun 10, 2018 at 5:57 PM, Robert Kern 
> wrote:
> >>
> >> On Sun, Jun 10, 2018 at 5:47 PM Ralf Gommers 
> wrote:
> >> >
> >> > On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser <
> warren.weckes...@gmail.com> wrote:
> >>
> >> >> I suspect many of the tests will be easy to update, so fixing 300 or
> so tests does not seem like a monumental task.
> >> >
> >> > It's all not monumental, but it adds up quickly. In addition to
> changing tests, one will also need compatibility code when supporting
> multiple numpy versions (e.g. scipy when get a copy of RandomStable in
> scipy/_lib/_numpy_compat.py).
> >> >
> >> > A quick count of just np.random.seed occurrences with ``$ grep -roh
> --include \*.py np.random.seed . | wc -w`` for some packages:
> >> > numpy: 77
> >> > scipy: 462
> >> > matplotlib: 204
> >> > statsmodels: 461
> >> > pymc3: 36
> >> > scikit-image: 63
> >> > scikit-learn: 69
> >> > keras: 46
> >> > pytorch: 0
> >> > tensorflow: 368
> >> > astropy: 24
> >> >
> >> > And note, these are *not* incorrect/broken usages, this is code that
> works and has done so for years.
> >>
> >> Yes, some of them are incorrect and broken. Failure can be difficult to
> detect. This module from keras is particularly problematic:
> >>
> >>   https://github.com/keras-team/keras-preprocessing/blob/
> master/keras_preprocessing/image.py
> >
> > You have to appreciate that we're not all thinking at lightning speed
> and in the same direction. If there is a difficult to detect problem, it
> may be useful to give a brief code example (or even line of reasoning) of
> how this actually breaks something.
>
> Ahem. Sorry. That wasn't the code I was thinking of. It's merely
> hazardous, not broken by itself. However, if you used any of the `seed=`
> arguments that are helpfully(?) provided, you are almost certainly writing
> broken code. If you must use np.random.seed() to get reproducibility, you
> need to call it exactly once at the start of your code (or maybe once for
> each process) and let it ride.
>
> This is the impossible-to-use-correctly code that I was thinking of, which
> got partially fixed after I pointed out the problem.
>
>   https://github.com/keras-team/keras/pull/8325/files
>
> The intention of this code is to shuffle two same-length sequences in the
> same way. So now if I write my code well to call np.random.seed() once at
> the start of my program, this function comes along and obliterates that
> with a fixed seed just so it can reuse the seed again to replicate the
> shuffle.
>

Yes, that's a big no-no. There are situations conceivable where a library
has to set a seed, but I think the right pattern in that case would be
something like

old_state = np.random.get_state()
np.random.seed(some_int)
do_stuff()
np.random.set_state(**old._state)


> Puzzlingly, the root sin of unconditionally and unavoidably reseeding for
> some of these functions is still there even though I showed how and why to
> avoid it. This is one reason why I was skeptical that merely documenting
> RandomState or StableRandom to only be used for unit tests would work. :-)
>

Well, no matter what we do, I'm sure that there'll be lots of people who
will still get it wrong:)


> >> > Conclusion: the current proposal will cause work for the vast
> majority of libraries that depends on numpy. The total amount of that work
> will certainly not be counted in person-days/weeks, and more likely in
> years than months. So I'm not convinced yet that the current proposal is
> the best way forward.
> >>
> >> The mere usage of np.random.seed() doesn't imply that these packages
> actually require stream-compatibility. Some might, for sure, like where
> they are used in the unit tests, but that's not what you counted. At best,
> these numbers just mean that we can't eliminate np.random.seed() in a new
> system right away.
> >
> > Well, mere usage has been called an antipattern (also on your behalf),
> plus for scipy over half of the usages do give test failures (Warren's
> quick test). So I'd say that counting usages is a decent proxy for the work
> that has to be done.
>
> Sure. But with my new proposal, we don't have to change it (as much as I'd
> like to!). I'll draft up a PR to modify my NEP accordingly.
>

Sounds good!

Cheers,
Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion