On Wed, 2021-02-17 at 11:15 +0100, Ralf Gommers wrote: > On Wed, Feb 17, 2021 at 2:14 AM Stephan Hoyer <sho...@gmail.com> > wrote: > > > On Tue, Feb 16, 2021 at 3:13 PM Sebastian Berg < > > sebast...@sipsolutions.net> > > wrote: > > > > > Hi all, > > > > > > In https://github.com/numpy/numpy/issues/18407 it was reported > > > that > > > there is a regression for `np.array()` and friends in NumPy 1.20 > > > for > > > code such as: > > > > > > np.array(["1234"], dtype=("U1", 4)) > > > # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1') > > > # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1') > > > > > > > > > The Basics > > > ---------- > > > > > > This happens when you ask for a rare "subarray" dtype, ways to > > > create > > > it are: > > > > > > np.dtype(("U1", 4)) > > > np.dtype("(4)U1,") # (does not have a field, only a > > > subarray) > > > > > > Both of which give the same subarray dtype a "U1" dtype with > > > shape 4. > > > One thing to know about these dtypes is that they cannot be > > > attached to > > > an array: > > > > > > np.zeros(3, dtype="(4)U1,").dtype == "U1" > > > np.zeros(3, dtype="(4)U1,").shape == (3, 4) > > > > > > I.e. the shape is moved/added into the array itself (instead of > > > remaining part of the dtype). > > > > > > The Change > > > ---------- > > > > > > Now what/why did something change? When filling subarray dtypes, > > > NumPy > > > normally fills every element with the same input. In the above > > > case in > > > most cases NumPy will give the 1.20 result because it assigns > > > "1234" to > > > every subarray element individually; maybe confusingly, this > > > truncates > > > so that only the "1" is actually assigned, we can proof it with a > > > structured dtype (same result in 1.19 and 1.20): > > > > > > >>> np.array(["1234"], dtype="(4)U1,i") > > > array([(['1', '1', '1', '1'], 1234)], > > > dtype=[('f0', '<U1', (4,)), ('f1', '<i4')]) > > > > > > Another, weirder case which changed (more obviously for the > > > better is: > > > > > > >>> np.array("1234", dtype="(4)U1,") > > > # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1') > > > # NumPy 1.19: array(['1', '', '', ''], dtype='<U1') > > > > > > And, to point it out, we can have subarrays that are not 1-D: > > > > > > >>> np.array(["12"],dtype=("(2,2)U1,")) > > > array([[['1', '1'], > > > ['2', '2']]], dtype='<U1') # NumPy 1.19, 1.20 all is '1' > > > > > > > > > The Cause > > > --------- > > > > > > The cause of the 1.19 behaviour is two-fold: > > > > > > 1. The "subarray" part of the dtype is moved into the array after > > > the > > > dimension is found. At this point strings are always considered > > > "scalars". In most above examples, the new array shape is > > > (1,)+(4,). > > > > > > 2. When filling the new array with values, it now has an > > > _additional_ > > > dimension! Because of this, the string is now suddenly > > > considered a > > > sequence, so it behaves the same as if `list("1234")`. Although, > > > normally, NumPy would never consider a string a sequence. > > > > > > > > > The Solution? > > > ------------- > > > > > > I honestly don't have one. We can consider strings as sequences > > > in > > > this weird special case. That will probably create other weird > > > special > > > cases, but they would be even more hidden (I expect mainly odder > > > things > > > throwing an error). > > > > > > Should we try to document this better in the release notes or can > > > we > > > think of some better (or at least louder) solution? > > > > > > I was honestly surprised there's even such a thing as a "subarray > data > type", I've never seen it used in the wild. Looking at the release > notes > you already have, > > https://numpy.org/devdocs/release/1.20.0-notes.html#arrays-cannot-be-using-subarray-dtypes > , > all I'm thinking is that no one should ever be writing code like > that. >
Sure, if you look at the big picture its arguably weird or even plain wrong. I guess the spelled out question here should have been: Does anyone think there is enough usage of this in the wild to worry about it? based on the current response, it seems, and I hope not... > > > There are way too many unsafe assumptions in this example. It's an > > edge > > case of an edge case. > > > > I don't think we should be beholden to continuing to support this > > behavior, which was obviously never anticipated. If there was a way > > to > > raise a warning or error in potentially ambiguous situations like > > this, I > > would support it. > > > We can warn for all subarrays (including deprecation), but that is probably too noisy/much. We probably can flag subarray+strings and warn in that case. Just a full undo seems tricky. What I mean is a warning like: Oops, string+subarray can lead to weird things and unfortunately a fix in behaviour means 1.20 may have a different result compared to <1.19.x. (you are seeing the new behaviour, see release notes) If that sounds useful, I can do it, but it will lead to an unavoidable warning. Cheers, Sebastian > +1 > > Ralf > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion