Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs
Frozen dimensions: I started with just making every 3-vector and 3x3-matrix structured arrays with the relevant single sub-array entry I was actually suggesting omitting the structured dtype (ie, field names) altogether, and just using the subarray dtypes (which exist alone, but not in arrays). Another (small?) advantage is that I can use `axis This is a fair argument against my proposal - at any rate, I think we’d need a better story for subarray dtypes before trying to add support to them for ufuncs -- Broadcasting dimensions But perhaps a putative weighted_mean … is a decent example That’s fairly convincing as a non-chained ufunc case. Can you add an example like that to the NEP? Also, it has the benefit of being clear what the function can handle by inspection of the signature Is broadcasting (n),(n)->(),() less clear that (n|1),(n|1)->(),()? Can you come up with an example where only some dimensions make sense to broadcast? -- Eric On Mon, 11 Jun 2018 at 08:04 Marten van Kerkwijk wrote: > Nathaniel: >> >> Output shape feels very similar to >> output dtype to me, so maybe the general way to handle this would be >> to make the first callback take the input shapes+dtypes and return the >> desired output shapes+dtypes? >> >> This hits on an interesting alternative to frozen dimensions - np.cross >> could just become a regular ufunc with signature np.dtype((float64, 3)), >> np.dtype((float64, 3)) → np.dtype((float64, 3)) >> > As you note further down, the present proposal of just using numbers has > the advantage of being clear and easy. Another (small?) advantage is that I > can use `axis` to tell where my three coordinates are, rather than be stuck > with having them as the last dimension. > > Indeed, in my trials for wrapping the Standards Of Fundamental Astronomy > routines, I started with just making every 3-vector and 3x3-matrix > structured arrays with the relevant single sub-array entry. That worked, > but I ended up disliking the casting to and fro. > > >> Furthermore, the expansion quickly becomes cumbersome. For instance, for >> the all_equal signature of (n|1),(n|1)->() … >> >> I think this is only a good argument when used in conjunction with the >> broadcasting syntax. I don’t think it’s a reason for matmul not to have >> multiple signatures. Having multiple signatures is an disincentive to >> introduced too many overloads of the same function, which seems like a good >> thing to me >> > But implementation for matmul is actually considerably trickier, since the > internal loop now has to check the number of distinct dimensions. > > >> Summarizing my overall opinions: >> >>- I’m +0.5 on frozen dimensions. The use-cases seem reasonable, and >>it seems like an easy-ish way to get them. Allowing ufuncs to natively >>support subarray types might be a tidier solution, but that could come >> down >>the road >> >> Indeed, they are not mutually exclusive. My guess would be that the use > cases would be somewhat different. > > >> >>- I’m -1 on optional dimensions: they seem to legitimize creating >>many overloads of gufuncs. I’m already not a fan of how matmul has special >>cases for lower dimensions that don’t generalize well. To me, the best way >>to handle matmul would be to use the proposed __array_function__ to >>handle the shape-based special-case dispatching, either by: >> - Inserting dimensions, and calling the true gufunc >> np.linalg.matmul_2d (which is a function I’d like direct access to >> anyway). >> - Dispatching to one of four ufuncs >> >> I must admit I wish that `@` was just pure matrix multiplication... But > otherwise agree with Stephan as optional dimensions being the least-bad > solution. > > Aside: do agree we should think about how to expose the `linalg` gufuncs. > >> >>- Broadcasting dimensions: >> - I know you’re not suggesting this but: enabling broadcasting >> unconditionally for all gufuncs would be a bad idea, masking linalg >> bugs. >> (although einsum does support broadcasting…) >> >> Indeed, definitely *not* suggesting that! > > >> >>- >> - Does it really need a per-dimension flag, rather than a global >> one? Can you give a case where that’s useful? >> >> Mostly simply that the implementation is easier given the optional > dimensions... Also, it has the benefit of being clear what the function can > handle by inspection of the signature, i.e., it self-documents better (one > of my main arguments in favour of frozen dimensions...). > > >> >>- >> - If we’d already made all_equal a gufunc, I’d be +1 on adding >> broadcasting support to it >> - I’m -0.5 on the all_equal path in the first place. I think we >> either should have a more generic approach to combined ufuncs, or just >> declare them numbas job. >> >> I am working on and off on a way to generically chain ufuncs
Re: [Numpy-discussion] 1.14.5 bugfix release
On Mon, Jun 11, 2018 at 11:10 AM, Matti Picus wrote: > If there is a desire to do a bug-fix release 1.14.5 I would like to try my > hand at releasing it, using doc/RELEASE_WALKTHROUGH.rst.txt. There were a > few issues around compiling 1.14.4 on alpine and NetBSD. > Since 1.15 will probably be released soon, do we continue to push these > kind of bug fixes releases to 1.14.x? > Matti > We only need to make the release to fix the regressions. I was going to do it today/tomorrow as I think we have now covered all paths through the ifs. Usually it takes about 2-4 weeks for bug reports to settle out, but a think we can be a bit sooner here and the next release will be 1.15. If you want to give it a shot, go ahead. We need more people with some experience in the process, not to mention new perspectives on the walkthrough. I expect most of your time will be spent getting set up. I think you will also need commit privileges on `MacPython/numpy-wheels`, ping Matthew Brett for those. If you run into problems, let me know. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] 1.14.5 bugfix release
If there is a desire to do a bug-fix release 1.14.5 I would like to try my hand at releasing it, using doc/RELEASE_WALKTHROUGH.rst.txt. There were a few issues around compiling 1.14.4 on alpine and NetBSD. Since 1.15 will probably be released soon, do we continue to push these kind of bug fixes releases to 1.14.x? Matti ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NEP: Random Number Generator Policy
On Mon, Jun 11, 2018 at 10:26 AM, wrote: > > > On Mon, Jun 11, 2018 at 2:43 AM, Ralf Gommers > wrote: > >> >> >> On Sun, Jun 10, 2018 at 10:36 PM, Robert Kern >> wrote: >> >>> On Sun, Jun 10, 2018 at 8:04 PM Ralf Gommers >>> wrote: >>> > >>> > On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern >>> wrote: >>> >> >>> >> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers >>> wrote: >>> >> > >>> >> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern >>> wrote: >>> >> >> >>> >> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers < >>> ralf.gomm...@gmail.com> wrote: >>> >> >>> >>> >> >>> It may be worth having a look at test suites for scipy, >>> statsmodels, scikit-learn, etc. and estimate how much work this NEP causes >>> those projects. If the devs of those packages are forced to do large scale >>> migrations from RandomState to StableState, then why not instead keep >>> RandomState and just add a new API next to it? >>> >> >> >>> >> >> The problem is that we can't really have an ecosystem with two >>> different general purpose systems. >>> >> > >>> >> > Can't = prefer not to. >>> >> >>> >> I meant what I wrote. :-) >>> >> >>> >> > But yes, that's true. That's not what I was saying though. We want >>> one generic one, and one meant for unit testing only. You can achieve that >>> in two ways: >>> >> > 1. Change the current np.random API to new generic, and add a new >>> RandomStable for unit tests. >>> >> > 2. Add a new generic API, and document the current np.random API as >>> being meant for unit tests only, for other usage should be >>> preferred. >>> >> > >>> >> > (2) has a couple of pros: >>> >> > - you're not forcing almost every library and end user out there to >>> migrate their unit tests. >>> >> >>> >> But it has the cons that I talked about. RandomState *is* a fully >>> functional general purpose PRNG system. After all, that's its current use. >>> Documenting it as intended to be something else will not change that fact. >>> Documentation alone provides no real impetus to move to the new system >>> outside of the unit tests. And the community does need to move together to >>> the new system in their library code, or else we won't be able to combine >>> libraries together; these PRNG objects need to thread all the way through >>> between code from different authors if we are to write programs with a >>> controlled seed. The failure mode when people don't pay attention to the >>> documentation is that I can no longer write programs that compose these >>> libraries together. That's why I wrote "can't". It's not a mere preference >>> for not having two systems to maintain. It has binary Go/No Go implications >>> for building reproducible programs. >>> > >>> > I strongly suspect you are right, but only because you're asserting >>> "can't" so heavily. I have trouble formulating what would go wrong in case >>> there's two PRNGs used in a single program. It's not described in the NEP, >>> nor in the numpy.random docs (those don't even have any recommendations for >>> best practices listed as far as I can tell - that needs fixing). All you >>> explain in the NEP is that reproducible research isn't helped by the >>> current stream-compat guarantee. So a bit of (probably incorrect) devil's >>> advocate reasoning: >>> > - If there's no stream-compat guarantee, all a user can rely on is the >>> properties of drawing from a seeded PRNG. >>> > - Any use of a PRNG in library code can also only rely on properties >>> > - So now whether in a user's program libraries draw from one or two >>> seeded PRNGs doesn't matter for reproducibility, because those properties >>> don't change. >>> >>> Correctly making a stochastic program reproducible while retaining good >>> statistical properties is difficult. People don't do it well in the best of >>> circumstances. The best way that we've found to manage that difficulty is >>> to instantiate a single stream and use it all throughout your code. Every >>> new stream requires the management of more seeds (unless if we use the >>> fancy new algorithms that have settable stream IDs, but by stipulation, we >>> don't have these in this case). And now I have to thread both of these >>> objects through my code, and pass the right object to each third-party >>> library. These third-party libraries don't know anything about this weird >>> 2-stream workaround that you are doing, so we now have libraries that can't >>> build on each other unless if they are using the same compatible API, even >>> if I can make workarounds to build a program that combines two libraries >>> side-to-side. >>> >>> So yeah, people "can" do this. "It's just a matter of code" as my boss >>> likes to say. But it's making an already-difficult task more difficult. >>> >> >> Okay, that makes more sense to me now. It would be really useful to >> document such best practices and rationales. >> >> Note that scipy.stats distributions allow passing in either a RandomState >> instance or an integer as seed (which will be used for seedi
Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs
> > Nathaniel: > > Output shape feels very similar to > output dtype to me, so maybe the general way to handle this would be > to make the first callback take the input shapes+dtypes and return the > desired output shapes+dtypes? > > This hits on an interesting alternative to frozen dimensions - np.cross > could just become a regular ufunc with signature np.dtype((float64, 3)), > np.dtype((float64, 3)) → np.dtype((float64, 3)) > As you note further down, the present proposal of just using numbers has the advantage of being clear and easy. Another (small?) advantage is that I can use `axis` to tell where my three coordinates are, rather than be stuck with having them as the last dimension. Indeed, in my trials for wrapping the Standards Of Fundamental Astronomy routines, I started with just making every 3-vector and 3x3-matrix structured arrays with the relevant single sub-array entry. That worked, but I ended up disliking the casting to and fro. > Furthermore, the expansion quickly becomes cumbersome. For instance, for > the all_equal signature of (n|1),(n|1)->() … > > I think this is only a good argument when used in conjunction with the > broadcasting syntax. I don’t think it’s a reason for matmul not to have > multiple signatures. Having multiple signatures is an disincentive to > introduced too many overloads of the same function, which seems like a good > thing to me > But implementation for matmul is actually considerably trickier, since the internal loop now has to check the number of distinct dimensions. > Summarizing my overall opinions: > >- I’m +0.5 on frozen dimensions. The use-cases seem reasonable, and it >seems like an easy-ish way to get them. Allowing ufuncs to natively support >subarray types might be a tidier solution, but that could come down the > road > > Indeed, they are not mutually exclusive. My guess would be that the use cases would be somewhat different. > >- I’m -1 on optional dimensions: they seem to legitimize creating many >overloads of gufuncs. I’m already not a fan of how matmul has special cases >for lower dimensions that don’t generalize well. To me, the best way to >handle matmul would be to use the proposed __array_function__ to >handle the shape-based special-case dispatching, either by: > - Inserting dimensions, and calling the true gufunc > np.linalg.matmul_2d (which is a function I’d like direct access to > anyway). > - Dispatching to one of four ufuncs > > I must admit I wish that `@` was just pure matrix multiplication... But otherwise agree with Stephan as optional dimensions being the least-bad solution. Aside: do agree we should think about how to expose the `linalg` gufuncs. > >- Broadcasting dimensions: > - I know you’re not suggesting this but: enabling broadcasting > unconditionally for all gufuncs would be a bad idea, masking linalg > bugs. > (although einsum does support broadcasting…) > > Indeed, definitely *not* suggesting that! > >- > - Does it really need a per-dimension flag, rather than a global > one? Can you give a case where that’s useful? > > Mostly simply that the implementation is easier given the optional dimensions... Also, it has the benefit of being clear what the function can handle by inspection of the signature, i.e., it self-documents better (one of my main arguments in favour of frozen dimensions...). > >- > - If we’d already made all_equal a gufunc, I’d be +1 on adding > broadcasting support to it > - I’m -0.5 on the all_equal path in the first place. I think we > either should have a more generic approach to combined ufuncs, or just > declare them numbas job. > > I am working on and off on a way to generically chain ufuncs (goal would be to auto-create an inner loop that calls all the chained ufuncs loops in turn). Not sure that short-circuiting will be all that easy. I actually quite like the all_equal ufunc, but it is in part because I remember discovering how painfully slow (a==b).all() was (and still have a place where I would use it if it existed). And it does fit in the (admittedly vague) plans to try to make `.reduce` a gufunc. > >- > - Can you come up with a broadcasting use-case that isn’t just > chaining a reduction with a broadcasting ufunc? > > Perhaps the use is that it allows people to write gufuncs that are like such functions... Absent a mechanism to chain ufuncs, more complicated gufuncs are currently the easiest way to get fast more complicated algebra. But perhaps a putative weighted_mean(y, sigma) -> mean, sigma_mean is a decent example? Its signature would be (n),(n)->(),() but then you're forced to give individual sigmas for each point. With (n|1),(n|1)->(),() you are no longer forced to do that (though the case of all y being the same is less than useful here... I did at some point have an implementation that worked by core dimension of each a
Re: [Numpy-discussion] NEP: Random Number Generator Policy
On Mon, Jun 11, 2018 at 2:43 AM, Ralf Gommers wrote: > > > On Sun, Jun 10, 2018 at 10:36 PM, Robert Kern > wrote: > >> On Sun, Jun 10, 2018 at 8:04 PM Ralf Gommers >> wrote: >> > >> > On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern >> wrote: >> >> >> >> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers >> wrote: >> >> > >> >> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern >> wrote: >> >> >> >> >> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers >> wrote: >> >> >>> >> >> >>> It may be worth having a look at test suites for scipy, >> statsmodels, scikit-learn, etc. and estimate how much work this NEP causes >> those projects. If the devs of those packages are forced to do large scale >> migrations from RandomState to StableState, then why not instead keep >> RandomState and just add a new API next to it? >> >> >> >> >> >> The problem is that we can't really have an ecosystem with two >> different general purpose systems. >> >> > >> >> > Can't = prefer not to. >> >> >> >> I meant what I wrote. :-) >> >> >> >> > But yes, that's true. That's not what I was saying though. We want >> one generic one, and one meant for unit testing only. You can achieve that >> in two ways: >> >> > 1. Change the current np.random API to new generic, and add a new >> RandomStable for unit tests. >> >> > 2. Add a new generic API, and document the current np.random API as >> being meant for unit tests only, for other usage should be >> preferred. >> >> > >> >> > (2) has a couple of pros: >> >> > - you're not forcing almost every library and end user out there to >> migrate their unit tests. >> >> >> >> But it has the cons that I talked about. RandomState *is* a fully >> functional general purpose PRNG system. After all, that's its current use. >> Documenting it as intended to be something else will not change that fact. >> Documentation alone provides no real impetus to move to the new system >> outside of the unit tests. And the community does need to move together to >> the new system in their library code, or else we won't be able to combine >> libraries together; these PRNG objects need to thread all the way through >> between code from different authors if we are to write programs with a >> controlled seed. The failure mode when people don't pay attention to the >> documentation is that I can no longer write programs that compose these >> libraries together. That's why I wrote "can't". It's not a mere preference >> for not having two systems to maintain. It has binary Go/No Go implications >> for building reproducible programs. >> > >> > I strongly suspect you are right, but only because you're asserting >> "can't" so heavily. I have trouble formulating what would go wrong in case >> there's two PRNGs used in a single program. It's not described in the NEP, >> nor in the numpy.random docs (those don't even have any recommendations for >> best practices listed as far as I can tell - that needs fixing). All you >> explain in the NEP is that reproducible research isn't helped by the >> current stream-compat guarantee. So a bit of (probably incorrect) devil's >> advocate reasoning: >> > - If there's no stream-compat guarantee, all a user can rely on is the >> properties of drawing from a seeded PRNG. >> > - Any use of a PRNG in library code can also only rely on properties >> > - So now whether in a user's program libraries draw from one or two >> seeded PRNGs doesn't matter for reproducibility, because those properties >> don't change. >> >> Correctly making a stochastic program reproducible while retaining good >> statistical properties is difficult. People don't do it well in the best of >> circumstances. The best way that we've found to manage that difficulty is >> to instantiate a single stream and use it all throughout your code. Every >> new stream requires the management of more seeds (unless if we use the >> fancy new algorithms that have settable stream IDs, but by stipulation, we >> don't have these in this case). And now I have to thread both of these >> objects through my code, and pass the right object to each third-party >> library. These third-party libraries don't know anything about this weird >> 2-stream workaround that you are doing, so we now have libraries that can't >> build on each other unless if they are using the same compatible API, even >> if I can make workarounds to build a program that combines two libraries >> side-to-side. >> >> So yeah, people "can" do this. "It's just a matter of code" as my boss >> likes to say. But it's making an already-difficult task more difficult. >> > > Okay, that makes more sense to me now. It would be really useful to > document such best practices and rationales. > > Note that scipy.stats distributions allow passing in either a RandomState > instance or an integer as seed (which will be used for seeding a new > instance, not for np.random.seed) [1]. That seems like a fine design > pattern as well, and passing on a seed that way is fairly easy and as good > for reproducibil
Re: [Numpy-discussion] NEP: Random Number Generator Policy
On Sun, Jun 10, 2018 at 11:53 PM, Ralf Gommers wrote: > > On Sun, Jun 10, 2018 at 11:15 PM, Robert Kern wrote: >> >> The intention of this code is to shuffle two same-length sequences in the >> same way. So now if I write my code well to call np.random.seed() once at >> the start of my program, this function comes along and obliterates that with >> a fixed seed just so it can reuse the seed again to replicate the shuffle. > > > Yes, that's a big no-no. There are situations conceivable where a library > has to set a seed, but I think the right pattern in that case would be > something like > > old_state = np.random.get_state() > np.random.seed(some_int) > do_stuff() > np.random.set_state(**old._state) This will seem to work fine in testing, and then when someone tries to use your library in a multithreaded program everything will break in complicated and subtle ways :-(. I really don't think there's any conceivable situation where a library (as opposed to an application) can correctly use the global random state. -n -- Nathaniel J. Smith -- https://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NEP: Random Number Generator Policy
On Sun, Jun 10, 2018 at 11:54 PM Ralf Gommers wrote: > > On Sun, Jun 10, 2018 at 11:15 PM, Robert Kern wrote: >> Puzzlingly, the root sin of unconditionally and unavoidably reseeding for some of these functions is still there even though I showed how and why to avoid it. This is one reason why I was skeptical that merely documenting RandomState or StableRandom to only be used for unit tests would work. :-) > > Well, no matter what we do, I'm sure that there'll be lots of people who will still get it wrong:) Exactly! This is why I objected to leaving RandomState completely alone and just documenting it for use to generate test data. Inevitably, people will "get it wrong", so we need to design in anticipation of these failure modes and provide ways to work around them. >> Sure. But with my new proposal, we don't have to change it (as much as I'd like to!). I'll draft up a PR to modify my NEP accordingly. > > Sounds good! Thanks! Your and Josef's feedback on these points has been very helpful. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NEP: Random Number Generator Policy
On Sun, Jun 10, 2018 at 11:44 PM Ralf Gommers wrote: > Note that scipy.stats distributions allow passing in either a RandomState instance or an integer as seed (which will be used for seeding a new instance, not for np.random.seed) [1]. That seems like a fine design pattern as well, and passing on a seed that way is fairly easy and as good for reproducibility as passing in a single PRNG. > > [1] https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L612 Well, carefully. You wouldn't want to pass on the same integer seed to multiple functions. Accepting an integer seed is super-convenient at the command line/notebooks, though, or docstrings or in tests or other situations where your "reproducibility horizon" is small. These utilities are good for scaling from these small use cases to up to large ones. scikit-learn is also a good example of good PRNG hygiene: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L715 >> > Also, if there is to be a multi-year transitioning to the new API, would there be two PRNG systems anyway during those years? >> >> Sure, but with a deadline and not-just-documentation to motivate transitioning. >> >> But if we follow my alternative proposal, there'll be no need for deprecation! You've convinced me to not deprecate RandomState. > > That's not how I had read it, but great to hear that! Indeed, I did deprecate the name RandomState in that drafting, but it's not really necessary, and you've convinced me that we shouldn't do it. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NEP: Random Number Generator Policy (Robert Kern)
Maybe a good place for a stable, testing focused generator would be in numpy.random.testing. This could host a default implementation of StableGenerator, although a better name might be TestingGenerator. It would also help users decide that this is not the generator they are looking for (I think many people might think StableGenerator is a good thing, after all, who wants an UnstableGenerator). ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion