On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <sho...@gmail.com> wrote:
> In this case, we want something compatible with Python's string (i.e. full >> Unicode supporting) and I think should be as transparent as possible. >> Python's string has made the decision to present a character oriented API >> to users (despite what the manifesto says...). >> > > Yes, but NumPy doesn't really implement string operations, so fortunately > this is pretty irrelevant to us -- except for our API for specifying dtype > size. > Exactly -- the character-orientation of python strings means that people are used to thinking that strings have a length that is the number of characters in the string. I think there will a cognitive dissonance if someone does: arr[i] = a_string Which then raises a ValueError, something like: String too long for a string[12] dytype array. When len(a_string) <= 12 AND that will only occur if there are non-ascii characters in the string, and maybe only if there are more than N non-ascii characters. i.e. it is very likely to be a run-time error that may not have shown up in tests. So folks need to do something like: len(a_string.encode('utf-8')) to see if their string will fit. If not, they need to truncate it, and THAT is non-obvious how to do, too -- you don't want to truncate the encodes bytes naively, you could end up with an invalid bytestring. but you don't know how many characters to truncate, either. > We already have strong precedence for dtypes reflecting number of bytes > used for storage even when Python doesn't: consider numeric types like > int64 and float32 compared to the Python equivalents. It's an intrinsic > aspect of NumPy that users need to think about how their data is actually > stored. > sure, but a float64 is 64 bytes forever an always and the defaults perfectly match what python is doing under its hood --even if users don't think about. So the default behaviour of numpy matched python's built-in types. Storage cost is always going to be a concern. Arguably, it's even more of a >> concern today than it used to be be, because compute has been improving >> faster than storage. >> > sure -- but again, what is the use-case for numpy arrays with a s#$)load of text in them? common? I don't think so. And as you pointed out numpy doesn't do text processing anyway, so cache performance and all that are not important. So having UCS-4 as the default, but allowing folks to select a more compact format if they really need it is a good way to go. Just like numpy generally defaults to float64 and Int64 (or 32, depending on platform) -- users can select a smaller size if they have a reason to. I guess that's my summary -- just like with numeric values, numpy should default to Python-like behavior as much as possible for strings, too -- with an option for a knowledgeable user to do something more performant. > I still don't understand why a latin encoding makes sense as a preferred > one-byte-per-char dtype. The world, including Python 3, has standardized on > UTF-8, which is also one-byte-per-char for (ASCII) scientific data. > utf-8 is NOT a one-byte per char encoding. IF you want to assure that your data are one-byte per char, then you could use ASCII, and it would be binary compatible with utf-8, but not sure what the point of that is in this context. latin-1 or latin-9 buys you (over ASCII): - A bunch of accented characters -- sure it only covers the latin languages, but does cover those much better. - A handful of other characters, including scientifically useful ones. (a few greek characters, the degree symbol, etc...) - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. For Python use -- a pointer to a Python string would be nice. >> > > Yes, absolutely. If we want to be really fancy, we could consider a > parametric object dtype that allows for object arrays of *any* homogeneous > Python type. Even if NumPy itself doesn't do anything with that > information, there are lots of use cases for that information. > hmm -- that's nifty idea -- though I think strings could/should be special cased. > Then use a native flexible-encoding dtype for everything else. >> > > No opposition here from me. Though again, I think utf-8 alone would also > be enough. > maybe so -- the major reason for supporting others is binary data exchange with other libraries -- but maybe most of them have gone to utf-8 anyway. One more note: if a user tries to assign a value to a numpy string array >> that doesn't fit, they should get an error: >> > >> EncodingError if it can't be encoded into the defined encoding. >> >> ValueError if it is too long -- it should not be silently truncated. >> > > I think we all agree here. > I'm actually having second thoughts -- see above -- if the encoding is utf-8, then truncating is non-trivial -- maybe it would be better for numpy to do it for you. Or set a flag as to which you want? The current 'S' dtype truncates silently already: In [6]: arr Out[6]: array(['this', 'that'], dtype='|S4') In [7]: arr[0] = "a longer string" In [8]: arr Out[8]: array(['a lo', 'that'], dtype='|S4') (similarly for the unicode type) So at least we are used to that. BTW -- maybe we should keep the pathological use-case in mind: really short strings. I think we are all thinking in terms of longer strings, maybe a name field, where you might assign 32 bytes or so -- then someone has an accented character in their name, and then ge30 or 31 characters -- no big deal. But what if you have a simple label or something with 1 or two characters: Then you have 2 bytes to store the name in, and someone tries to put an "odd" character in there, and you get an empty string. not good. Also -- if utf-8 is the default -- what do you get when you create an array from a python string sequence? Currently with the 'S' and 'U' dtypes, the dtype is set to the longest string passed in. Are we going to pad it a bit? stick with the exact number of bytes? It all comes down to this: Python3 has made a very deliberate (and I think Good) choice to treat text as a string of characters, where the user does not need to know or care about encoding issues. Numpy's defaults should do the same thing. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion