On Wed, Jan 22, 2014 at 2:46 AM, Oscar Benjamin
<oscar.j.benja...@gmail.com>wrote:

> BTW, as much as the fixed-width 'S' dtype doesn't really work for str in
>  Python 3 it's also a poor fit for bytes since it strips trailing nulls:
>
> >>> a = np.array(['a\0s\0', 'qwert'], dtype='S')
> >>> a
> array([b'a\x00s', b'qwert'],
>       dtype='|S5')
> >>> a[0]
> b'a\x00s'


WHOOA!  Good catch, Oscar.

This conversation started with me suggesting that 'S' on py3 should mean
"ascii string" (or latin-1 string).

Then it was pointed out that it was already being used for arbitrary bytes,
and thus could not be changed to mean a string without breaking already
working code.

However,  if 'S' is assigning meaning to null bytes, and doing something
with that, then it is, indeed being treated as an ANSI string (or the old c
string "type", anyway). And any code that is expecting it to be arbitrary
bytes is already broken, and in a way that could result in pretty subtle,
hard to find bugs in the future.

I think we really need a proper bytes dtype (which could be 'S' with the
null byte thing removed), and a proper one-byte-per-character string type.

Though I still don't know the use case for the fixed-length bytes type that
can't be satisfied with the other numeric types, maybe:

In [58]: bytes_15 = np.dtype(('B', 15))


though that doesn't in fact do what I expect:

In [59]: arr = np.zeros((5,), dtype = bytes_15)

In [60]: arr
Out[60]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

shouldn't I get a shape (5,) array, with each element a compound dtype with
15 bytes in it???

How would I spell that?

By the way, from the docs for dtypes:

http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

"""
The first character specifies the kind of data and the remaining characters
specify how many bytes of data. The supported kinds are

'b'  Boolean
'i'  (signed) integer
'u'  unsigned integer
'f'  floating-point
'c'  complex-floating point
'S', 'a',  string
'U'  unicode
'V'  raw data (void)
"""
Could we use the 'a' for ascii string? (even though in now mapps directly
to 'S')

And by the way, the docs clearly say "string" there -- not bytes, so at the
very least we need to update the docs...

-Chris


Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to