On Fri, Jan 17, 2014 at 12:36 PM, <josef.p...@gmail.com> wrote: > > ('S' ?) -- which is probably not what you want particularly if you > specify > > an encoding. Though I can't figure out at the moment why the previous one > > failed -- where did the bytes object come from when the encoding was > > specified? > > Yes, it's a utf-8 file with nonascii. > > I don't know what I **should** want. >
well, you **should** want: The numbers parsed out for you (Other wise, why use recfromtxt), and the text as properly decoded unicode strings. Python does very well with unicode -- and you are MUCH happier if you do the encoding/decoding as close to I/O as possible. recfromtxt is, in a way, decoding already, converting ascii representation of numbers to an internal binary representation -- why not handle the text at the same time. There certainly are use cases for keeping the text as encoded bytes, but I'd say those fall into the categories of: 1) Special case 2) You should know what you are doing. So having recfromtxt auto-determine that for you makes little sense. Note that if you don't know the file encoding, this is tricky. My thoughts: 1) don't use the system default encoding!!! (see my other note on that!) 2) Either: a) open as a binary file and use bytes for anything that doesn't parse as text -- this means that the user will need to do the conversion to text themselves b) decode as latin-1: this would work well for ascii and _some_ non-ascii text, and would be recoverable for ALL text. I prefer (b). The point here is that if the user gets bytes, then they will either have to assume ascii, or need to hand-decode it, and if they just want assume ascii, they have a bytes object with limited text functionality so will probably need to decode it anyway (unless they are just passing it through) If the user gets unicode objects that are may not properly decoded, they can either assume it was ascii, and if they only do ascii-compatible things with it, it will work, or they can encode/decode it and get the proper stuff back, but only if they know the encoding, and if that's the case, why did they not specify that in the first place? > For now I do want bytes, because that's how I changed statsmodels in > the py3 conversion. > > This was just based on the fact that recfromtxt doesn't work with > strings on python 3, so I switched to using bytes following the lead > of numpy. > Well, that's really too bad -- it doesn't sound like you wanted bytes, it sounds like you wanted something that didn't crash -- fair enough. But the "proper" solution is for recfromtext to support text.... I'm mainly worried about backwards compatibility, since we have been > using this for 2 or 3 years. It would be easy to change in statsmodels > when gen/recfromtxt is fixed, but I assume there is lots of other code > using similar interpretation of S/bytes in numpy. > well, it does sound like enough folks are using 'S' to mean bytes -- too bad, but what can we do now about that? I'd like a 's' for ascii-stings though. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion