On Tue, Feb 16, 2021 at 3:13 PM Sebastian Berg <sebast...@sipsolutions.net> wrote:
> Hi all, > > In https://github.com/numpy/numpy/issues/18407 it was reported that > there is a regression for `np.array()` and friends in NumPy 1.20 for > code such as: > > np.array(["1234"], dtype=("U1", 4)) > # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1') > # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1') > > > The Basics > ---------- > > This happens when you ask for a rare "subarray" dtype, ways to create > it are: > > np.dtype(("U1", 4)) > np.dtype("(4)U1,") # (does not have a field, only a subarray) > > Both of which give the same subarray dtype a "U1" dtype with shape 4. > One thing to know about these dtypes is that they cannot be attached to > an array: > > np.zeros(3, dtype="(4)U1,").dtype == "U1" > np.zeros(3, dtype="(4)U1,").shape == (3, 4) > > I.e. the shape is moved/added into the array itself (instead of > remaining part of the dtype). > > The Change > ---------- > > Now what/why did something change? When filling subarray dtypes, NumPy > normally fills every element with the same input. In the above case in > most cases NumPy will give the 1.20 result because it assigns "1234" to > every subarray element individually; maybe confusingly, this truncates > so that only the "1" is actually assigned, we can proof it with a > structured dtype (same result in 1.19 and 1.20): > > >>> np.array(["1234"], dtype="(4)U1,i") > array([(['1', '1', '1', '1'], 1234)], > dtype=[('f0', '<U1', (4,)), ('f1', '<i4')]) > > Another, weirder case which changed (more obviously for the better is: > > >>> np.array("1234", dtype="(4)U1,") > # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1') > # NumPy 1.19: array(['1', '', '', ''], dtype='<U1') > > And, to point it out, we can have subarrays that are not 1-D: > > >>> np.array(["12"],dtype=("(2,2)U1,")) > array([[['1', '1'], > ['2', '2']]], dtype='<U1') # NumPy 1.19, 1.20 all is '1' > > > The Cause > --------- > > The cause of the 1.19 behaviour is two-fold: > > 1. The "subarray" part of the dtype is moved into the array after the > dimension is found. At this point strings are always considered > "scalars". In most above examples, the new array shape is (1,)+(4,). > > 2. When filling the new array with values, it now has an _additional_ > dimension! Because of this, the string is now suddenly considered a > sequence, so it behaves the same as if `list("1234")`. Although, > normally, NumPy would never consider a string a sequence. > > > The Solution? > ------------- > > I honestly don't have one. We can consider strings as sequences in > this weird special case. That will probably create other weird special > cases, but they would be even more hidden (I expect mainly odder things > throwing an error). > > Should we try to document this better in the release notes or can we > think of some better (or at least louder) solution? > There are way too many unsafe assumptions in this example. It's an edge case of an edge case. I don't think we should be beholden to continuing to support this behavior, which was obviously never anticipated. If there was a way to raise a warning or error in potentially ambiguous situations like this, I would support it. > Cheers, > > Sebastian > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion