On Thu, Jan 23, 2014 at 11:49 AM, Chris Barker <chris.bar...@noaa.gov>wrote:
> Thanks for poking into this all. I've lost track a bit, but I think: > > The 'S' type is clearly broken on py3 (at least). I think that gives us > room to change it, and backward compatibly is less of an issue because it's > broken already -- do we need to preserve bug-for-bug compatibility? Maybe, > but I suspect in this case, not -- the code the "works fine" on py3 with > the 'S' type is probably only lucky that it hasn't encountered the issues > yet. > > And no matter how you slice it, code being ported to py3 needs to deal > with text handling issues. > > But here is where we stand: > > The 'S' dtype: > > - was designed for one-byte-per-char text data. > - was mapped to the py2 string type. > - used the classic C null-terminated approach. > - can be used for arbitrary bytes (as the py2 string type can), but not > quite, as it truncates null bytes -- so it really a bad idea to use it that > way. > > Under py3: > The 'S' type maps to the py3 bytes type, because that's the closest to > the py2 string type. But it also does some inconsistent things with > encoding, and does treat a lot of other things as text. But the py3 bytes > type does not have the same text handling as the py2 string type, so things > like: > > s = 'a string' > np.array((s,), dtype='S')[0] == s > > Gives you False, rather than True on py2. This is because a py3 string is > translated to the 'S' type (presumable with the default encoding, another > maybe not a good idea, but returns a bytes object, which does not compare > true to a py3 string. YOu can work aroudn this with varios calls to > encode() and decode, and/or using b'a string', but that is ugly, kludgy, > and doesn't work well with the py3 text model. > > > The py2 => py3 transition separated bytes and strings: strings are > unicode, and bytes are not to be used for text (directly). While there is > some text-related functionality still in bytes, the core devs are quite > clear that that is for special cases only, and not for general text > processing. > > I don't think numpy should fight this, but rather embrace the py3 text > model. The most natural way to do that is to use the existing 'U' dtype for > text. Really the best solution for most cases. (Like the above case) > > However, there is a use case for a more efficient way to deal with text. > There are a couple ways to go about that that have been brought up here: > > 1: have a more efficient unicode dtype: variable length, > multiple encoding options, etc.... > - This is a fine idea that would support better text handling in > numpy, and _maybe_ better interaction with external libraries (HDF, etc...) > > 2: Have a one-byte-per-char text dtype: > - This would be much easier to implement fit into the current numpy > model, and satisfy a lot of common use cases for scientific data sets. > > We could certainly do both, but I'd like to see (2) get done sooner than > later.... > This is pretty much my sense of things at the moment. I think 1) is needed in the long term but that 2) is a quick fix that solves most problems in the short term. > > A related issue is whether numpy needs a dtype analogous to py3 bytes -- > I'm still not sure of the use-case there, so can't comment -- would it need > to be fixed length (fitting into the numpy data model better) or variable > length, or ??? Some folks are (apparently) using the current 'S' type in > this way, but I think that's ripe for errors, due to the null bytes issue. > Though maybe there is a null-bytes-are-special binary format that isn't > text -- I have no idea. > > So what do we do with 'S'? It really is pretty broken, so we have a > couple choices: > > (1) depricate it, so that it stays around for backward compatibility > but encourage people to either use 'U' for text, or one of the new dtypes > that are yet to be implemented (maybe 's' for a one-byte-per-char dtype), > and use either uint8 or the new bytes dtype that is yet to be implemented. > > (2) fix it -- in this case, I think we need to be clear what it is: > -- A one-byte-char-text type? If so, it should map to a py3 string, > and have a defined encoding (ascii or latin-1, probably), or even better a > settable encoding (but only for one-byte-per-char encodings -- I don't > think utf-8 is a good idea here, as a utf-8 encoded string is of unknown > length. (there is some room for debate here, as the 'S' type is fixed > length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as > long as it doesn't partially truncate in teh middle of a charactor) > I think we should make it a one character encoded type compatible with str in python 2, and maybe latin-1 in python 3. I'm thinking latin-1 because of pep 393 where it is effectively a UCS-1, but ascii might be a bit more flexible because it is a subset of utf-8 and might serve better in python 2. > -- a bytes type? in which case, we should clean out all teh > automatic conversion to-from text that iare in it now. > > I'm not sure what to do about a bytes type. > I vote for it being our one-byte text type -- it almost is already, and it > would make the easiest transition for folks from py2 to py3. But backward > compatibility is backward compatibility. > > Not sure what to do here. It would be nice if S was a string type of given encoding. Might be worth an experiment to see how much breaks. > > numpy arrays need a decode and encode method > > > I'm not sure that they do. Rather there needs to be a text dtype that >> knows what encoding to use in order to have a binary interface as >> exposed by .tostring() and friends and but produce unicode strings >> when indexed from Python code. Having both a text and a binary >> interface to the same data implies having an encoding. > > > I agree with Oscar here -- let's not conflate encode and decoded data -- > the py3 text model is a fine one, we should work with it as much > as practical. > > UNLESS: if we do add a bytes dtype, then it would be a reasonable use case > to use it to store encoded text (just like the py3 bytes types), in which > case it would be good to have encode() and decode() methods or ufuncs -- > probably ufuncs. But that should be for special purpose, at the I/O > interface kind of stuff. > > Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion