On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.face...@gmail.com> wrote:
> Is there any reason not to support all Unicode encodings that python does, > with the same names and semantics? This would surely be the simplest to > understand. > I think it should support all fixed-length encodings, but not the non-fixed length ones -- they just don't fit well into the numpy data model. > Also, if latin1 is to going to be the only practical 8-bit encoding, maybe > check with some non-Western users to make sure it's not going to wreck > their lives? I'd have selected ASCII as an encoding to treat specially, if > any, because Unicode already does that and the consequences are familiar. > (I'm used to writing and reading French without accents because it's passed > through ASCII, for example.) > latin-1 (or latin-9) only makes things better than ASCII -- it buys most of the accented characters for the European language and some symbols that are nice to have (I use the degree symbol a lot...). And it is ASCII compatible -- so there is NO reason to choose ASCII over Latin-* Which does no good for non-latin languages -- so we need to hear from the community -- is there a substantial demand for a non-latin one-byte per character encoding? > Variable-length encodings, of which UTF-8 is obviously the one that makes > good handling essential, are indeed more complicated. But is it strictly > necessary that string arrays hold fixed-length *strings*, or can the > encoding length be fixed instead? That is, currently if you try to assign a > longer string than will fit, the string is truncated to the number of > characters in the data type. > we could do that, yes, but an improperly truncated "string" becomes invalid -- just seems like a recipe for bugs that won't be found in testing. memory is cheap, compressing is fast -- we really shouldn't get hung up on this! Note: if you are storing a LOT of text (which I have no idea why you would use numpy anyway), then the memory size might matter, but then semi-arbitrary truncation would probably matter, too. I expect most text storage in numpy arrays is things like names of datasets, ids, etc, etc -- not massive amounts of text -- so storage space really isn't critical. but having an id or something unexpectedly truncated could be bad. I think practical experience has shown us that people do not handle "mostly fixed length but once in awhile not" text well -- see the nightmare of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so errors are far more likely to be found in tests (why would you use utf-8 is all your data are in ascii???). but still -- why invite hard to test for errors? Final point -- as Julian suggests, one reason to support utf-8 is for interoperability with other systems -- but that makes errors more of an issue -- if it doesn't pass through the numpy truncation machinery, invalid data could easily get put in a numpy array. -CHB it would allow UTF-8 to be used just the way it usually is - as an > encoding that's almost 8-bit. > ouch! that perception is the route to way too many errors! it is by no means almost 8-bit, unless your data are almost ascii -- in which case, use latin-1 for pity's sake! This highlights my point though -- if we support UTF-8, people WILL use it, and only test it with mostly-ascii text, and not find the bugs that will crop up later. All this said, it seems to me that the important use cases for string > arrays involve interaction with existing binary formats, so people who have > to deal with such data should have the final say. (My own closest approach > to this is the FITS format, which is restricted by the standard to ASCII.) > yup -- not sure we'll get much guidance here though -- netdf does not solve this problem well, either. But if you are pulling, say, a utf-8 encoded string out of a netcdf file -- it's probably better to pull it out as bytes and pass it through the python decoding/encoding machinery than pasting the bytes straight to a numpy array and hope that the encoding and truncation are correct. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion