On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.bar...@noaa.gov> wrote:
> On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <sho...@gmail.com> wrote: > > >> In this case, we want something compatible with Python's string (i.e. >>> full Unicode supporting) and I think should be as transparent as possible. >>> Python's string has made the decision to present a character oriented API >>> to users (despite what the manifesto says...). >>> >> >> Yes, but NumPy doesn't really implement string operations, so fortunately >> this is pretty irrelevant to us -- except for our API for specifying dtype >> size. >> > > Exactly -- the character-orientation of python strings means that people > are used to thinking that strings have a length that is the number of > characters in the string. I think there will a cognitive dissonance if > someone does: > > arr[i] = a_string > > Which then raises a ValueError, something like: > > String too long for a string[12] dytype array. > > When len(a_string) <= 12 > > AND that will only occur if there are non-ascii characters in the string, > and maybe only if there are more than N non-ascii characters. i.e. it is > very likely to be a run-time error that may not have shown up in tests. > > So folks need to do something like: > > len(a_string.encode('utf-8')) to see if their string will fit. If not, > they need to truncate it, and THAT is non-obvious how to do, too -- you > don't want to truncate the encodes bytes naively, you could end up with an > invalid bytestring. but you don't know how many characters to truncate, > either. > > >> We already have strong precedence for dtypes reflecting number of bytes >> used for storage even when Python doesn't: consider numeric types like >> int64 and float32 compared to the Python equivalents. It's an intrinsic >> aspect of NumPy that users need to think about how their data is actually >> stored. >> > > sure, but a float64 is 64 bytes forever an always and the defaults > perfectly match what python is doing under its hood --even if users don't > think about. So the default behaviour of numpy matched python's built-in > types. > > > Storage cost is always going to be a concern. Arguably, it's even more of >>> a concern today than it used to be be, because compute has been improving >>> faster than storage. >>> >> > sure -- but again, what is the use-case for numpy arrays with a s#$)load > of text in them? common? I don't think so. And as you pointed out numpy > doesn't do text processing anyway, so cache performance and all that are > not important. So having UCS-4 as the default, but allowing folks to select > a more compact format if they really need it is a good way to go. Just like > numpy generally defaults to float64 and Int64 (or 32, depending on > platform) -- users can select a smaller size if they have a reason to. > > I guess that's my summary -- just like with numeric values, numpy should > default to Python-like behavior as much as possible for strings, too -- > with an option for a knowledgeable user to do something more performant. > > >> I still don't understand why a latin encoding makes sense as a preferred >> one-byte-per-char dtype. The world, including Python 3, has standardized on >> UTF-8, which is also one-byte-per-char for (ASCII) scientific data. >> > > utf-8 is NOT a one-byte per char encoding. IF you want to assure that your > data are one-byte per char, then you could use ASCII, and it would be > binary compatible with utf-8, but not sure what the point of that is in > this context. > > latin-1 or latin-9 buys you (over ASCII): > > - A bunch of accented characters -- sure it only covers the latin > languages, but does cover those much better. > > - A handful of other characters, including scientifically useful ones. (a > few greek characters, the degree symbol, etc...) > > - round-tripping of binary data (at least with Python's encoding/decoding) > -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the > same bytes back. You may get garbage, but you won't get an EncodingError. > +1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective. > > For Python use -- a pointer to a Python string would be nice. >>> >> >> Yes, absolutely. If we want to be really fancy, we could consider a >> parametric object dtype that allows for object arrays of *any* homogeneous >> Python type. Even if NumPy itself doesn't do anything with that >> information, there are lots of use cases for that information. >> > > hmm -- that's nifty idea -- though I think strings could/should be special > cased. > > >> Then use a native flexible-encoding dtype for everything else. >>> >> >> No opposition here from me. Though again, I think utf-8 alone would also >> be enough. >> > > maybe so -- the major reason for supporting others is binary data exchange > with other libraries -- but maybe most of them have gone to utf-8 anyway. > > One more note: if a user tries to assign a value to a numpy string array >>> that doesn't fit, they should get an error: >>> >> >>> EncodingError if it can't be encoded into the defined encoding. >>> >>> ValueError if it is too long -- it should not be silently truncated. >>> >> >> I think we all agree here. >> > > I'm actually having second thoughts -- see above -- if the encoding is > utf-8, then truncating is non-trivial -- maybe it would be better for numpy > to do it for you. Or set a flag as to which you want? > > The current 'S' dtype truncates silently already: > > In [6]: arr > > Out[6]: > array(['this', 'that'], > dtype='|S4') > > In [7]: arr[0] = "a longer string" > > In [8]: arr > > Out[8]: > array(['a lo', 'that'], > dtype='|S4') > > (similarly for the unicode type) > > So at least we are used to that. > > BTW -- maybe we should keep the pathological use-case in mind: really > short strings. I think we are all thinking in terms of longer strings, > maybe a name field, where you might assign 32 bytes or so -- then someone > has an accented character in their name, and then ge30 or 31 characters -- > no big deal. > I wouldn't call it a pathological use case, it doesn't seem so uncommon to have large datasets of short strings. I personally deal with a database of hundreds of billions of 2 to 5 character ASCII strings. This has been a significant blocker to Python 3 adoption in my world. BTW, for those new to the list or with a short memory, this topic has been discussed fairly extensively at least 3 times before. Hopefully the *fourth* time will be the charm! https://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html https://mail.scipy.org/pipermail/numpy-discussion/2014-July/070574.html https://mail.scipy.org/pipermail/numpy-discussion/2015-February/072311.html - Tom > > > But what if you have a simple label or something with 1 or two characters: > Then you have 2 bytes to store the name in, and someone tries to put an > "odd" character in there, and you get an empty string. not good. > > Also -- if utf-8 is the default -- what do you get when you create an > array from a python string sequence? Currently with the 'S' and 'U' dtypes, > the dtype is set to the longest string passed in. Are we going to pad it a > bit? stick with the exact number of bytes? > > It all comes down to this: > > Python3 has made a very deliberate (and I think Good) choice to treat text > as a string of characters, where the user does not need to know or care > about encoding issues. Numpy's defaults should do the same thing. > > -CHB > > > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion