On Wed, Feb 17, 2021 at 2:14 AM Stephan Hoyer <sho...@gmail.com> wrote:
> On Tue, Feb 16, 2021 at 3:13 PM Sebastian Berg <sebast...@sipsolutions.net> > wrote: > >> Hi all, >> >> In https://github.com/numpy/numpy/issues/18407 it was reported that >> there is a regression for `np.array()` and friends in NumPy 1.20 for >> code such as: >> >> np.array(["1234"], dtype=("U1", 4)) >> # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1') >> # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1') >> >> >> The Basics >> ---------- >> >> This happens when you ask for a rare "subarray" dtype, ways to create >> it are: >> >> np.dtype(("U1", 4)) >> np.dtype("(4)U1,") # (does not have a field, only a subarray) >> >> Both of which give the same subarray dtype a "U1" dtype with shape 4. >> One thing to know about these dtypes is that they cannot be attached to >> an array: >> >> np.zeros(3, dtype="(4)U1,").dtype == "U1" >> np.zeros(3, dtype="(4)U1,").shape == (3, 4) >> >> I.e. the shape is moved/added into the array itself (instead of >> remaining part of the dtype). >> >> The Change >> ---------- >> >> Now what/why did something change? When filling subarray dtypes, NumPy >> normally fills every element with the same input. In the above case in >> most cases NumPy will give the 1.20 result because it assigns "1234" to >> every subarray element individually; maybe confusingly, this truncates >> so that only the "1" is actually assigned, we can proof it with a >> structured dtype (same result in 1.19 and 1.20): >> >> >>> np.array(["1234"], dtype="(4)U1,i") >> array([(['1', '1', '1', '1'], 1234)], >> dtype=[('f0', '<U1', (4,)), ('f1', '<i4')]) >> >> Another, weirder case which changed (more obviously for the better is: >> >> >>> np.array("1234", dtype="(4)U1,") >> # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1') >> # NumPy 1.19: array(['1', '', '', ''], dtype='<U1') >> >> And, to point it out, we can have subarrays that are not 1-D: >> >> >>> np.array(["12"],dtype=("(2,2)U1,")) >> array([[['1', '1'], >> ['2', '2']]], dtype='<U1') # NumPy 1.19, 1.20 all is '1' >> >> >> The Cause >> --------- >> >> The cause of the 1.19 behaviour is two-fold: >> >> 1. The "subarray" part of the dtype is moved into the array after the >> dimension is found. At this point strings are always considered >> "scalars". In most above examples, the new array shape is (1,)+(4,). >> >> 2. When filling the new array with values, it now has an _additional_ >> dimension! Because of this, the string is now suddenly considered a >> sequence, so it behaves the same as if `list("1234")`. Although, >> normally, NumPy would never consider a string a sequence. >> >> >> The Solution? >> ------------- >> >> I honestly don't have one. We can consider strings as sequences in >> this weird special case. That will probably create other weird special >> cases, but they would be even more hidden (I expect mainly odder things >> throwing an error). >> >> Should we try to document this better in the release notes or can we >> think of some better (or at least louder) solution? >> > I was honestly surprised there's even such a thing as a "subarray data type", I've never seen it used in the wild. Looking at the release notes you already have, https://numpy.org/devdocs/release/1.20.0-notes.html#arrays-cannot-be-using-subarray-dtypes, all I'm thinking is that no one should ever be writing code like that. > There are way too many unsafe assumptions in this example. It's an edge > case of an edge case. > > I don't think we should be beholden to continuing to support this > behavior, which was obviously never anticipated. If there was a way to > raise a warning or error in potentially ambiguous situations like this, I > would support it. > +1 Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion