[Numpy-discussion] What to do about structured string dtype and string regression?

Sebastian Berg Tue, 16 Feb 2021 15:12:12 -0800

Hi all,

In https://github.com/numpy/numpy/issues/18407 it was reported that
there is a regression for `np.array()` and friends in NumPy 1.20 for
code such as:


    np.array(["1234"], dtype=("U1", 4))
    # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
    # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1')


The Basics
----------

This happens when you ask for a rare "subarray" dtype, ways to create
it are:

    np.dtype(("U1", 4))
    np.dtype("(4)U1,")  # (does not have a field, only a subarray)

Both of which give the same subarray dtype a "U1" dtype with shape 4. 
One thing to know about these dtypes is that they cannot be attached to
an array:

    np.zeros(3, dtype="(4)U1,").dtype == "U1"
    np.zeros(3, dtype="(4)U1,").shape == (3, 4)

I.e. the shape is moved/added into the array itself (instead of
remaining part of the dtype).

The Change
----------

Now what/why did something change?  When filling subarray dtypes, NumPy
normally fills every element with the same input. In the above case in
most cases NumPy will give the 1.20 result because it assigns "1234" to
every subarray element individually; maybe confusingly, this truncates
so that only the "1" is actually assigned, we can proof it with a
structured dtype (same result in 1.19 and 1.20):

    >>> np.array(["1234"], dtype="(4)U1,i")
    array([(['1', '1', '1', '1'], 1234)],
          dtype=[('f0', '<U1', (4,)), ('f1', '<i4')])

Another, weirder case which changed (more obviously for the better is:

    >>> np.array("1234", dtype="(4)U1,")
    # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
    # NumPy 1.19: array(['1', '', '', ''], dtype='<U1')

And, to point it out, we can have subarrays that are not 1-D:

    >>> np.array(["12"],dtype=("(2,2)U1,"))
    array([[['1', '1'],
        ['2', '2']]], dtype='<U1')  # NumPy 1.19, 1.20 all is '1'


The Cause
---------

The cause of the 1.19 behaviour is two-fold:

1. The "subarray" part of the dtype is moved into the array after the
dimension is found. At this point strings are always considered
"scalars".  In most above examples, the new array shape is (1,)+(4,).

2. When filling the new array with values, it now has an _additional_
dimension!  Because of this, the string is now suddenly considered a
sequence, so it behaves the same as if `list("1234")`.  Although,
normally, NumPy would never consider a string a sequence.


The Solution?
-------------

I honestly don't have one.  We can consider strings as sequences in
this weird special case.  That will probably create other weird special
cases, but they would be even more hidden (I expect mainly odder things
throwing an error).

Should we try to document this better in the release notes or can we
think of some better (or at least louder) solution?


Cheers,

Sebastian

signature.asc
Description: This is a digitally signed message part

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] What to do about structured string dtype and string regression?

Reply via email to