On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote: > Julian Taylor <jtaylor.debian <at> googlemail.com> writes: > [clip] > > - inconvenience in dealing with strings in python 3. > > > > bytes are not strings in python3 which means ascii data is either a byte > > array which can be inconvenient to deal with or 4 byte unicode which > > wastes space.
It doesn't waste that much space in practice. People have been happily using Python 2's 4-byte-per-char unicode string on wide builds (e.g. on Linux) for years in all kinds of text heavy applications. $ python2 Python 2.7.3 (default, Sep 26 2013, 20:03:06) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getsizeof(u'a' * 1000) 4052 > > For backward compatibility we *cannot* change S. Do you mean to say that loadtxt cannot be changed from decoding using system default, splitting on newlines and whitespace and then encoding the substrings as latin-1? An obvious improvement would be along the lines of what Chris Barker suggested: decode as latin-1, do the processing and then reencode as latin-1. Or just open the file in binary and use the bytes string methods. Either of these has the advantage that it won't corrupt the binary representation of the data - assuming ascii-compatible whitespace and newlines (e.g. utf-8 and most currently used 8-bit encodings). In the situations where the current behaviour differs from this the user *definitely* has mojibake. Can anyone possibly be relying on that (except in the sense of having implemented a workaround that would break if it was fixed)? > > Maybe we could change > > the meaning of 'a' but it would be safer to add a new dtype, possibly > > 'S' can be deprecated in favor of 'B' when we have a specific encoding > > dtype. > > > > The main issue is probably: is it worth it and who does the work? > > I don't think this is a good idea: the bytes vs. unicode separation in > Python 3 exists for a good reason. If unicode is not needed, why not just > use the bytes data type throughout the program? Or on the other hand, why try to use bytes when you're clearly dealing with text data? If you're concerned about memory usage why not use Python strings? As of CPython 3.3 strings consisting only of latin-1 characters are stored with 1 char-per-byte. This is only really sensible for immutable strings with an opaque memory representation though so numpy shouldn't try to copy it. > (Also, assuming that ASCII is in general good for text-format data is > quite US-centric.) Indeed. The original use case in this thread was a text file containing file paths. In most of the world there's a reasonable chance that file paths can contain non-ascii characters. The current behaviour of decoding using one codec and encoding with latin-1 would, in many cases, break if the user tried to e.g. open() a file using a byte-string from the array. Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion