On 22/02/15 19:21, Aldcroft, Thomas wrote: > Problems like this are now showing up in the wild [3]. Workarounds are > also showing up, like a way to easily convert from 'S' to 'U' within > astropy Tables [4], but this is really not a desirable way to go. > Gigabyte-sized string data arrays are not uncommon, so converting to > UCS-4 is a real memory and performance hit.
Why UCS-4? The Python's internal "flexible string respresentation" will use ascii for ascii text. By PEP 393 an application should not assume an internal string representation at all: https://www.python.org/dev/peps/pep-0393/ If the problem is PEP 393 violation in NumPy string or unicode dtype, we shouldn't violate it even further by adding a latin-1 encoded ascii string. We should let Python represent strings as it wants, and it will not bloat. I am m -1 on adding latin-1 and +1 on making the unicode dtype PEP 393 compliant if it is not. And on Python 3 'U' and 'S' should just be synonyms. You can also store an array of bytes with uint8. Then you can decode it however you like to make a Python string. If it is encoded as latin-1 then decode it as latin-1: In [1]: import numpy as np In [2]: ascii_bytestr = "The quick brown fox jumps over the lazy dog".encode('latin-1') In [3]: numpy_bytestr = np.array(memoryview(ascii_bytestr)) In [4]: numpy_bytestr.dtype, numpy_bytestr.shape Out[4]: (dtype('uint8'), (43,)) In [5]: bytes(numpy_bytestr).decode('latin-1') Out[5]: 'The quick brown fox jumps over the lazy dog' Sturla _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion