On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker <chris.bar...@noaa.gov> wrote:
> latin-1 or latin-9 buys you (over ASCII): > > ... > > - round-tripping of binary data (at least with Python's encoding/decoding) > -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the > same bytes back. You may get garbage, but you won't get an EncodingError. > For a new application, it's a good thing if a text type breaks when you to stuff arbitrary bytes in it (see Python 2 vs Python 3 strings). Certainly, I would argue that nobody should write data in latin-1 unless they're doing so for the sake of a legacy application. I do understand the value in having some "string" data type that could be used by default by loaders for legacy file formats/applications (i.e., netCDF3) that support unspecified "one byte strings." Then you're a few short calls away from viewing (i.e., array.view('text[my_real_encoding]'), if we support arbitrary encodings) or decoding (i.e., np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the proper encoding. It's not realistic to expect users to know the true encoding for strings from a file before they even look at the data. On the other hand, if this is the use-case, perhaps we really want an encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes. > Then use a native flexible-encoding dtype for everything else. >>> >> >> No opposition here from me. Though again, I think utf-8 alone would also >> be enough. >> > > maybe so -- the major reason for supporting others is binary data exchange > with other libraries -- but maybe most of them have gone to utf-8 anyway. > Indeed, it would be helpful for this discussion to know what other encodings are actually currently used by scientific applications. So far, we have real use cases for at least UTF-8, UTF-32, ASCII and "unknown". The current 'S' dtype truncates silently already: > One advantage of a new (non-default) dtype is that we can change this behavior. > Also -- if utf-8 is the default -- what do you get when you create an > array from a python string sequence? Currently with the 'S' and 'U' dtypes, > the dtype is set to the longest string passed in. Are we going to pad it a > bit? stick with the exact number of bytes? > It might be better to avoid this for now, and force users to be explicit about encoding if they use the dtype for encoded text. We can keep bytes/str mapped to the current choices.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion