> if you truncate a utf-8 bytestring, you may get invalid data Note that in general truncating unicode codepoints is not a safe operation either, as combining characters are a thing. So I don't think this is a good argument against UTF8.
Also, is silent truncation a think that we want to allow to happen anyway? That sounds like something the user ought to be alerted to with an exception. > if you wanted to specify that a numpy element would be able to hold, say, N characters > ... > It simply is not the right way to handle text if [...] you need fixed-length storage It seems to me that counting code points is pretty futile in unicode, due to combining characters. The only two meaningful things to count are: * Graphemes, as that's what the user sees visually. These can span multiple code-points * Bytes of encoded data, as that's the space needed to store them So I would argue that the approach of fixed-codepoint-length storage is itself a flawed design, and so should not be used as a constraint on numpy. Counting graphemes is hard, so that leaves the only sensible option as a byte count. I don't forsee variable-length encodings being a problem implementation-wise - they only become one if numpy were to acquire a vectorized substring function that is intended to return a view. I think I'd be in favor of supporting all encodings, and falling back on python to handle encoding/decoding them. On Thu, 20 Apr 2017 at 18:44 Chris Barker <chris.bar...@noaa.gov> wrote: > On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <sho...@gmail.com> wrote: > >> I agree with Anne here. Variable-length encoding would be great to have, >> but even fixed length UTF-8 (in terms of memory usage, not characters) >> would solve NumPy's Python 3 string problem. NumPy's memory model needs a >> fixed size per array element, but that doesn't mean we need a fixed sized >> per character. Each element in a UTF-8 array would be a string with a fixed >> number of codepoints, not characters. >> > > Ah, yes -- the nightmare of Unicode! > > No, it would not be a fixed number of codepoints -- it would be a fixed > number of bytes (or "code units"). and an unknown number of characters. > > As Julian pointed out, if you wanted to specify that a numpy element would > be able to hold, say, N characters (actually code points, combining > characters make this even more confusing) then you would need to allocate > N*4 bytes to make sure you could hold any string that long. Which would be > pretty pointless -- better to use UCS-4. > > So Anne's suggestion that numpy truncates as needed would make sense -- > you'd specify say N characters, numpy would arbitrarily (or user specified) > over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a > string that didn't fit. Then you'd need to make sure you truncated > correctly, so as not to create an invalid string (that's just code, it > could be made correct). > > But how much to over allocate? for english text, with an occasional > scientific symbol, only a little. for, say, Japanese text, you'd need a > factor 2 maybe? > > Anyway, the idea that "just use utf-8" solves your problems is really > dangerous. It simply is not the right way to handle text if: > > you need fixed-length storage > you care about compactness > > In fact, we already have this sort of distinction between element size and >> memory usage: np.string_ uses null padding to store shorter strings in a >> larger dtype. >> > > sure -- but it is clear to the user that the dtype can hold "up to this > many" characters. > > >> The only reason I see for supporting encodings other than UTF-8 is for >> memory-mapping arrays stored with those encodings, but that seems like a >> lot of extra trouble for little gain. >> > > I see it the other way around -- the only reason TO support utf-8 is for > memory mapping with other systems that use it :-) > > On the other hand, if we ARE going to support utf-8 -- maybe use it for > all unicode support, rather than messing around with all the multiple > encoding options. > > I think a 1-byte-per char latin-* encoded string is a good idea though -- > scientific use tend to be latin only and space constrained. > > All that being said, if the truncation code were carefully written, it > would mostly "just work" > > -CHB > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion