On 19 January 2014 06:19, Nick Coghlan <ncogh...@gmail.com> wrote: > > While I agree it's not relevant to the PEP 460/461 discussions, so > long as numpy.loadtxt is explicitly documented as only working with > latin-1 encoded files (it currently isn't), there's no problem.
Actually there is problem. If it explicitly specified the encoding as latin-1 when opening the file then it could document the fact that it works for latin-1 encoded files. However it actually uses the system default encoding to read the file and then converts the strings to bytes with the as_bytes function that is hard-coded to use latin-1: https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 So it only works if the system default encoding is latin-1 and the file content is white-space and newline compatible with latin-1. Regardless of whether the file itself is in utf-8 or latin-1 it will only work if the system default encoding is latin-1. I've never used a system that had latin-1 as the default encoding (unless you count cp1252 as latin-1). > If it's supposed to work with other encodings (but the entire file is > still required to use a consistent encoding), then it just needs > encoding and errors arguments to fit the Python 3 text model (with > "latin-1" documented as the default encoding). This is the right solution. Have an encoding argument, document the fact that it will use the system default encoding if none is specified, and re-encode using the same encoding to fit any dtype='S' bytes column. This will then work for any encoding including the ones that aren't ASCII-compatible (e.g. utf-16). Then instead of having a compat module with an as_bytes helper to get rid of all the unicode strings on Python 3, you can have a compat module with an open_unicode helper to do the right thing on Python 2. The as_bytes function is just a way of fighting the Python 3 text model: "I don't care about mojibake just do whatever it takes to shut up the interpreter and its error messages and make sure it works for ASCII data." > If it is intended to > allow S columns to contain text in arbitrary encodings, then that > should also be supported by the current API with an adjustment to the > default behaviour, since passing something like > codecs.getdecoder("utf-8") as a column converter should do the right > thing. However, if you're currently decoding S columns with latin-1 > *before* passing the value to the converter, then you'll need to use a > WSGI style decoding dance instead: > > def fix_encoding(text): > return text.encode("latin-1").decode("utf-8") # For example That's just getting silly IMO. If the file uses mixed encodings then I don't consider it a valid "text file" and see no reason for loadtxt to support reading it. > That's more wasteful than just passing the raw bytes through for > decoding, but is the simplest backwards compatible option if you're > doing latin-1 decoding already. > > If different rows in the *same* column are allowed to have different > encodings, then that's not a valid use of the operation (since the > column converter has no access to the rest of the row to determine > what encoding should be used for the decode operation). Ditto. Oscar _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com