On Fri, Apr 21, 2017 at 11:34 AM, Chris Barker <[email protected]> wrote:
> 1) Use with/from Python -- both creating and working with numpy arrays. > > In this case, we want something compatible with Python's string (i.e. full > Unicode supporting) and I think should be as transparent as possible. > Python's string has made the decision to present a character oriented API > to users (despite what the manifesto says...). > Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size. We already have strong precedence for dtypes reflecting number of bytes used for storage even when Python doesn't: consider numeric types like int64 and float32 compared to the Python equivalents. It's an intrinsic aspect of NumPy that users need to think about how their data is actually stored. > However, there is a challenge here: numpy requires fixed-number-of-bytes > dtypes. And full unicode support with fixed number of bytes matching fixed > number of characters is only possible with UCS-4 -- hence the current > implementation. And this is actually just fine! I know we all want to be > efficient with data storage, but really -- in the early days of Unicode, > when folks thought 16 bits were enough, doubling the memory usage for > western language storage was considered fine -- how long in computer life > time does it take to double your memory? But now, when memory, disk space, > bandwidth, etc, are all literally orders of magnitude larger, we can't > handle a factor of 4 increase in "wasted" space? > Storage cost is always going to be a concern. Arguably, it's even more of a concern today than it used to be be, because compute has been improving faster than storage. > But as scientific text data often is 1-byte compatible, a > one-byte-per-char dtype is a fine idea, too -- and we pretty much have that > already with the existing string type -- that could simply be enhanced by > enforcing the encoding to be latin-9 (or latin-1, if you don't want the > Euro symbol). This would get us what scientists expect from strings in a > way that is properly compatible with Python's string type. You'd get > encoding errors if you tried to stuff anything else in there, and that's > that. > I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data. So -- I think we should address the use-cases separately -- one for > "normal" python use and simple interoperability with python strings, and > one for interoperability at the binary level. And an easy way to convert > between the two. > > For Python use -- a pointer to a Python string would be nice. > Yes, absolutely. If we want to be really fancy, we could consider a parametric object dtype that allows for object arrays of *any* homogeneous Python type. Even if NumPy itself doesn't do anything with that information, there are lots of use cases for that information. Then use a native flexible-encoding dtype for everything else. > No opposition here from me. Though again, I think utf-8 alone would also be enough. > Thinking out loud -- another option would be to set defaults for the > multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility > with the python string type -- and make folks make an effort to get > anything else. > The np.unicode_ type is already UCS-4 and the default for dtype=str on Python 3. We probably shouldn't change that, but if we set any default encoding for the new text type, I strongly believe it should be utf-8. One more note: if a user tries to assign a value to a numpy string array > that doesn't fit, they should get an error: > > EncodingError if it can't be encoded into the defined encoding. > > ValueError if it is too long -- it should not be silently truncated. > I think we all agree here.
_______________________________________________ NumPy-Discussion mailing list [email protected] https://mail.python.org/mailman/listinfo/numpy-discussion
