On Fri, Jan 17, 2014 at 03:12:32PM +0100, Julian Taylor wrote: > On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin > <oscar.j.benja...@gmail.com>wrote: > > > On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote: > > > > > > no, the right solution is to add an encoding argument. > > > Its a 4 line patch for python2 and a 2 line patch for python3 and the > > issue > > > is solved, I'll file a PR later. > > > > What is the encoding argument for? Is it to be used to decode, process the > > text and then re-encode it for an array with dtype='S'? > > > > it is only used to decode the file into text, nothing more. > loadtxt is supposed to load text files, it should never have to deal with > bytes ever. > But I haven't looked into the function deeply yet, there might be ugly > surprises. > > The output of the array is determined by the dtype argument and not by the > encoding argument.
If the dtype is 'S' then the output should be bytes and you therefore need to encode the text; there's no such thing as storing text in bytes without an encoding. Strictly speaking the 'U' dtype uses the encoding 'ucs-4' or 'utf-32' which just happens to be as simple as expressing the corresponding unicode code points as int32 so it's reasonable to think of it as "not encoded" in some sense (although endianness becomes an issue in utf-32). On 17 January 2014 14:11, <josef.p...@gmail.com> wrote: > Windows seems to use consistent en/decoding throughout (example run in IDLE) The reason for the Py3k bytes/text overhaul is that there were lots of situations where things *seemed* to work until someone happens to use a character you didn't try. "Seems to" doesn't cut it! :) > Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 > 32 bit (Intel)] on win32 > >>>> filenames = numpy.loadtxt('filenames.txt', dtype='S') >>>> filenames > array([b'weighted_kde.py', b'_proportion.log.py', b'__init__.py', > b'\xd5scar.txt'], > dtype='|S18') >>>> fn = open(filenames[-1]) >>>> fn.read() > '1,2,3,hello\n5,6,7,Õscar\n' >>>> fn > <_io.TextIOWrapper name=b'\xd5scar.txt' mode='r' encoding='cp1252'> You don't show how you created the file. I think that in your case the content of 'filenames.txt' is correctly encoded latin-1. My guess is that you did the same as me and opened it in text mode and wrote the unicode string allowing Python to encode it for you. Judging by the encoding on fn above I'd say that it wrote the file with cp1252 which is mostly compatible with latin-1. Try it with a byte that is incompatible between cp1252 and latin-1 e.g.: In [3]: b'\x80'.decode('cp1252') Out[3]: '€' In [4]: b'\x80'.decode('latin-1') Out[4]: '\x80' In [5]: b'\x80'.decode('cp1252').encode('latin-1') --------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) /users/enojb/<ipython-input-5-cfd8b16d6d9f> in <module>() ----> 1 b'\x80'.decode('cp1252').encode('latin-1') UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256) Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion