On Wed, Jan 15, 2014 at 11:40:58AM -0800, Chris Barker wrote: > On Wed, Jan 15, 2014 at 9:57 AM, Charles R Harris <charlesr.har...@gmail.com > > wrote: > > > > There was a discussion of this long ago and UCS-4 was chosen as the numpy > > standard. There are just too many complications that arise in supporting > > both. > > > > fair enough -- but loadtxt appears to be broken just the same. Any > proposals for that? > > My proposal: > > loadtxt accepts an encoding argument. > > default is ascii -- that's what it's doing now, anyway, yes?
No it's loading the file reading a line, encoding the line with latin-1, and then putting the repr of the resulting byte-string as a unicode string into a UCS-4 array (dtype='<Ux'). I can't see any good reason for that behaviour. > > If the file is encoded ascii, then a one-byte-per character dtype is used > for text data, unless the user specifies otherwise (do they need to specify > anyway?) > > If the file has another encoding, the the default dtype for text is unicode. That's a silly idea. There's already the dtype='S' for ascii that will give one byte per character. However numpy.loadtxt(dtype='S') doesn't actually use ascii IIUC. It loads the file as text with the default system encoding, encodes the text with latin-1 and stores the resulting bytes into a dtype='S' array. I think it should just open the file in binary read the bytes and store them in the dtype='S' array. The current behaviour strikes me as a hangover from the Python 2.x 8-bit text model. > Not sure about other one-byte per character encodings (e.g. latin-1) > > The defaults may be moot, if the loadtxt doesn't have auto-detection of > text in a filie anyway. > > This all required that there be an obvious way for the user to spell the > one-byte-per character dtype -- I think 'S' will do it. They should use 'S' and not encoding='ascii'. If the user provides an encoding then it should be used to open the file and decode it to unicode resulting in a dtype='U' array. (Python 3 handles this all for you). > Note to OP: what happens if you specify 'S' for your dtype, rather than str > - it works for me on py2: > > In [16]: np.loadtxt('pathlist.txt', dtype='S') > Out[16]: > array(['C:\\Users\\Documents\\Project\\mytextfile1.txt', > 'C:\\Users\\Documents\\Project\\mytextfile2.txt', > 'C:\\Users\\Documents\\Project\\mytextfile3.txt'], > dtype='|S42') It only seems to work because you're using ascii data. On Py3 you'll have byte strings corresponding to the text in the file encoded as latin-1 (regardless of the encoding used in the file). loadtxt doesn't open the file in binary or specify an encoding so the file will be opened with the system default encoding as determined by the standard builtins.open. The resulting text is decoded according to that encoding and then reencoded as latin-1 which will corrupt the binary form of the data if the system encoding is not compatible with latin-1 (e.g. ascii and latin-1 will work but utf-8 will not). > > Note: this leaves us with what to pass back to the user when they index > into an array of type 'S*' -- a bytes object or a unicode object (decoded > as ascii). I think a unicode object, in keeping with proper py3 behavior. > This would be like we currently do with, say floating point numbers: > > We can store/operate with 32 bit floats, but when you pass it back as a > python type, you get the native python float -- 64bit. > > NOTE: another option is to use latin-1 all around, rather than ascii -- you > may get garbage from the higher value bytes, but it won't barf on you. I guess you're alluding to the idea that reading/writing files as latin-1 will pretend to seamlessly decode/encode any bytes preserving binary data in any round-trip. This concept is already broken if you intend to do any processing, indexing or slicing of the array. Additionally the current loadtxt behaviour fails to achieve this round-trip even for the 'S' dtype even if you don't do any processing: $ ipython3 Python 3.2.3 (default, Sep 25 2013, 18:22:43) Type "copyright", "credits" or "license" for more information. IPython 0.12.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: with open('tmp.py', 'w') as fout: # Implicitly utf-8 here fout.write('Åå\n' * 3) ...: In [2]: import numpy In [3]: a = numpy.loadtxt('tmp.py') <snip> ValueError: could not convert string to float: b'\xc5\xe5' In [4]: a = numpy.loadtxt('tmp.py', dtype='S') In [5]: a Out[5]: array([b'\xc5\xe5', b'\xc5\xe5', b'\xc5\xe5'], dtype='|S2') In [6]: a.tostring() Out[6]: b'\xc5\xe5\xc5\xe5\xc5\xe5' In [7]: with open('tmp.py', 'rb') as fin: ...: text = fin.read() ...: In [8]: text Out[8]: b'\xc3\x85\xc3\xa5\n\xc3\x85\xc3\xa5\n\xc3\x85\xc3\xa5\n' This is a mess. I don't know about how to handle backwards compatibility but the sensible way to handle this in *both* Python 2 and 3 is that dtype='S' opens the file in binary, reads byte strings, and stores them in an array with dtype='S'. dtype='U' should open the file as text with an encoding argument (or system default if not supplied), decode the bytes and create an array with dtype='U'. The only reasonable difference between Python 2 and 3 is which of these two behaviours dtype=str should do. Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion