On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote: > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin > <oscar.j.benja...@gmail.com>wrote: > > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote: > > > Julian Taylor <jtaylor.debian <at> googlemail.com> writes: > > > [clip] > > > > > > > For backward compatibility we *cannot* change S. > > > > Do you mean to say that loadtxt cannot be changed from decoding using > > system > > default, splitting on newlines and whitespace and then encoding the > > substrings > > as latin-1? > > > > unicode dtypes have nothing to do with the loadtxt issue. They are not > related.
I'm talking about what loadtxt does with the 'S' dtype. As I showed earlier, if the file is not encoded as ascii or latin-1 then the byte strings are corrupted (see below). This is because loadtxt opens the file with the default system encoding (by not explicitly specifying an encoding): https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732 It then processes each line with asbytes() which encodes them as latin-1: https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784 https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 Being an English speaker I don't normally use non-ascii characters in filenames but my system (Ubuntu Linux) still uses utf-8 rather than latin-1 or (and rightly so!). > > > > An obvious improvement would be along the lines of what Chris Barker > > suggested: decode as latin-1, do the processing and then reencode as > > latin-1. > > > > no, the right solution is to add an encoding argument. > Its a 4 line patch for python2 and a 2 line patch for python3 and the issue > is solved, I'll file a PR later. What is the encoding argument for? Is it to be used to decode, process the text and then re-encode it for an array with dtype='S'? Note that there are two encodings: one for reading from the file and one for storing in the array. The former describes the content of the file and the latter will be used if I extract a byte-string from the array and pass it to any Python API. > No latin1 de/encoding is required for anything, I don't know why you would > want do to that in this context. > Does opening latin1 files even work with current loadtxt? It's the only encoding that works for dtype='S'. > It currently uses UTF-8 which is to my knowledge not compatible with latin1. It uses utf-8 (on my system) to read and latin-1 (on any system) to encode and store in the array, corrupting any non-ascii characters. Here's a demonstration: $ ipython3 Python 3.2.3 (default, Sep 25 2013, 18:22:43) Type "copyright", "credits" or "license" for more information. IPython 0.12.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: with open('Õscar.txt', 'w') as fout: pass In [2]: import os In [3]: os.listdir('.') Out[3]: ['Õscar.txt'] In [4]: with open('filenames.txt', 'w') as fout: ...: fout.writelines([f + '\n' for f in os.listdir('.')]) ...: In [5]: with open('filenames.txt') as fin: ...: print(fin.read()) ...: filenames.txt Õscar.txt In [6]: import numpy In [7]: filenames = numpy.loadtxt('filenames.txt') <snip> ValueError: could not convert string to float: b'filenames.txt' In [8]: filenames = numpy.loadtxt('filenames.txt', dtype='S') In [9]: filenames Out[9]: array([b'filenames.txt', b'\xd5scar.txt'], dtype='|S13') In [10]: open(filenames[1]) --------------------------------------------------------------------------- IOError Traceback (most recent call last) /users/enojb/.rcs/tmp/<ipython-input-10-3bf2418688a2> in <module>() ----> 1 open(filenames[1]) IOError: [Errno 2] No such file or directory: '\udcd5scar.txt' In [11]: open('Õscar.txt'.encode('utf-8')) Out[11]: <_io.TextIOWrapper name=b'\xc3\x95scar.txt' mode='r' encoding='UTF-8'> Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion