On Wed, Jan 22, 2014 at 2:46 AM, Oscar Benjamin <oscar.j.benja...@gmail.com>wrote:
> BTW, as much as the fixed-width 'S' dtype doesn't really work for str in > Python 3 it's also a poor fit for bytes since it strips trailing nulls: > > >>> a = np.array(['a\0s\0', 'qwert'], dtype='S') > >>> a > array([b'a\x00s', b'qwert'], > dtype='|S5') > >>> a[0] > b'a\x00s' WHOOA! Good catch, Oscar. This conversation started with me suggesting that 'S' on py3 should mean "ascii string" (or latin-1 string). Then it was pointed out that it was already being used for arbitrary bytes, and thus could not be changed to mean a string without breaking already working code. However, if 'S' is assigning meaning to null bytes, and doing something with that, then it is, indeed being treated as an ANSI string (or the old c string "type", anyway). And any code that is expecting it to be arbitrary bytes is already broken, and in a way that could result in pretty subtle, hard to find bugs in the future. I think we really need a proper bytes dtype (which could be 'S' with the null byte thing removed), and a proper one-byte-per-character string type. Though I still don't know the use case for the fixed-length bytes type that can't be satisfied with the other numeric types, maybe: In [58]: bytes_15 = np.dtype(('B', 15)) though that doesn't in fact do what I expect: In [59]: arr = np.zeros((5,), dtype = bytes_15) In [60]: arr Out[60]: array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8) shouldn't I get a shape (5,) array, with each element a compound dtype with 15 bytes in it??? How would I spell that? By the way, from the docs for dtypes: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html """ The first character specifies the kind of data and the remaining characters specify how many bytes of data. The supported kinds are 'b' Boolean 'i' (signed) integer 'u' unsigned integer 'f' floating-point 'c' complex-floating point 'S', 'a', string 'U' unicode 'V' raw data (void) """ Could we use the 'a' for ascii string? (even though in now mapps directly to 'S') And by the way, the docs clearly say "string" there -- not bytes, so at the very least we need to update the docs... -Chris Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion