Hi all, This is a post about strings--for the purpose of discussion then I'll be assuming Python 2 and string means non-unicode strings. However, the discussion applies all the same to unicode strings.
For a long time Numpy has had the following behavior: When creating an array with a zero-width string dtype like 'S0', Numpy automatically increases the width of the dtype to support the longest string in the input, like so: >>> np.array(['abc', 'de'], dtype='S0') # or equivalently dtype=str array(['abc', 'de'], dtype='|S3') But it *always* converts to a one character string dtype, at a minimum. So even when passing in a list of empty strings: >>> np.array(['', '', ''], dtype='S0') array(['', '', ''], dtype='|S1') Or even >>> np.zeros(3, dtype='S0') array(['', '', ''], dtype='|S1') This behavior is encoded in PyArray_NewFromDescr_int [1] and is very old (since 2006) [2]. This made sense at the time, certainly, since the logic for handling zero-sized strides was shaky, but most issues with that have long since been worked out. However, there's an oversight associated with this that it *is* possible to make a structured dtype that has a zero-width string as one of its fields. But since even PyArray_View goes through PyArray_NewFromDescr, viewing such a field results in a non-empty view that contains garbage and allows writing garbage into a structured array. This is documented in several issues, such as #473 [3]. A fixed I've proposed in #6430 [4] takes a conservative approach of keeping all the existing behavior *except* in the case of structured arrays, where views with a dtype of 'S0' would be allowed. However, a simpler fix would be to just remove the restriction on creating arrays of dtype 'S0' in general (with my first example above being one exception--given a list of strings it will still convert 'S0' to a dtype that can hold the longest string in the list). I think I would prefer the general fix, but it would be a slight change in behavior for any code using PyArray_NewFromDescr to create string arrays. But would anyone actually be negatively impacted by such a change? It seems to me that any code actually relies on the existing behavior would smell fishy anyways. Thanks, Erik [1] https://github.com/numpy/numpy/blob/8cb3ec6ab804f594daf553e53e7cf7478656bebd/numpy/core/src/multiarray/ctors.c#L940-L956 [2] https://github.com/numpy/numpy/commit/b022765aa487070866663b1707e4a2a0d8ead2e8 [3] https://github.com/numpy/numpy/issues/473 [4] https://github.com/numpy/numpy/pull/6430 _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion