On Wed, Apr 26, 2017 at 3:15 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote:
> On 26.04.2017 03:55, josef.p...@gmail.com wrote: > > On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris > > <charlesr.har...@gmail.com> wrote: > >> > >> > >> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.k...@gmail.com> > wrote: > >>> > >>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal > >>> <chris.bar...@noaa.gov> wrote: > >>> > >>>>> Presumably you're getting byte strings (with unknown encoding. > >>>> > >>>> No -- thus is for creating and using mostly ascii string data with > >>>> python and numpy. > >>>> > >>>> Unknown encoding bytes belong in byte arrays -- they are not text. > >>> > >>> You are welcome to try to convince Thomas of that. That is the status > quo > >>> for him, but he is finding that difficult to work with. > >>> > >>>> I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, > >>>> with a few extra characters" data. With all the sloppiness over the > years, > >>>> there are way to many files like that. > >>> > >>> That sloppiness that you mention is precisely the "unknown encoding" > >>> problem. Your previous advocacy has also touched on using latin-1 to > decode > >>> existing files with unknown encodings as well. If you want to advocate > for > >>> using latin-1 only for the creation of new data, maybe stop talking > about > >>> existing files? :-) > >>> > >>>> Note: the primary use-case I have in mind is working with ascii text > in > >>>> numpy arrays efficiently-- folks have called for that. All I'm saying > is use > >>>> Latin-1 instead of ascii -- that buys you some useful extra > characters. > >>> > >>> For that use case, the alternative in play isn't ASCII, it's UTF-8, > which > >>> buys you a whole bunch of useful extra characters. ;-) > >>> > >>> There are several use cases being brought forth here. Some involve file > >>> reading, some involve file writing, and some involve in-memory > manipulation. > >>> Whatever change we make is going to impinge somehow on all of the use > cases. > >>> If all we do is add a latin-1 dtype for people to use to create new > >>> in-memory data, then someone is going to use it to read existing data > in > >>> unknown or ambiguous encodings. > >> > >> > >> > >> The maximum length of an UTF-8 character is 4 bytes, so we could use > that to > >> size arrays by character length. The advantage over UTF-32 is that it is > >> easily compressible, probably by a factor of 4 in many cases. That > doesn't > >> solve the in memory problem, but does have some advantages on disk as > well > >> as making for easy display. We could compress it ourselves after > encoding by > >> truncation. > >> > >> Note that for terminal display we will want something supported by the > >> system, which is another problem altogether. Let me break the problem > down > >> into four categories > >> > >> Storage -- hdf5, .npy, fits, etc. > >> Display -- ? > >> Modification -- editing > >> Parsing -- fits, etc. > >> > >> There is probably no one solution that is optimal for all of those. > >> > >> Chuck > >> > >> > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion@python.org > >> https://mail.python.org/mailman/listinfo/numpy-discussion > >> > > > > > > quoting Julian > > > > ''' > > I probably have formulated my goal with the proposal a bit better, I am > > not very interested in a repetition of which encoding to use debate. > > In the end what will be done allows any encoding via a dtype with > > metadata like datetime. > > This allows any codec (including truncated utf8) to be added easily (if > > python supports it) and allows sidestepping the debate. > > > > My main concern is whether it should be a new dtype or modifying the > > unicode dtype. Though the backward compatibility argument is strongly in > > favour of adding a new dtype that makes the np.unicode type redundant. > > ''' > > > > I don't quite understand why this discussion goes in a direction of an > > either one XOR the other dtype. > > > > I thought the parameterized 1-byte encoding that Julian mentioned > > initially sounds useful to me. > > > > (I'm not sure I will use it much, but I also don't use float16 ) > > > > Josef > > Indeed, > Most of this discussion is irrelevant to numpy. > Numpy only really deals with the in memory storage of strings. And in > that it is limited to fixed length strings (in bytes/codepoints). > How you get your messy strings into numpy arrays is not very relevant to > the discussion of a smaller representation of strings. > You couldn't get messy strings into numpy without first sorting it out > yourself before, you won't be able to afterwards. > Numpy will offer a set of encodings, the user chooses which one is best > for the use case and if the user screws it up, it is not numpy's problem. > > You currently only have a few ways to even construct string arrays: > - array construction and loops > - genfromtxt (which is again just a loop) > - memory mapping which I seriously doubt anyone actually does for the S > and U dtype > > Having a new dtype changes nothing here. You still need to create numpy > arrays from python strings which are well defined and clean. > If you put something in that doesn't encode you get an encoding error. > No oddities like surrogate escapes are needed, numpy arrays are not > interfaces to operating systems nor does numpy need to _add_ support for > historical oddities beyond what it already has. > If you want to represent bytes exactly as they came in don't use a text > dtype (which includes the S dtype, use i1). > > Concerning variable sized strings, this is simply not going to happen. > Nobody is going to rewrite numpy to support it, especially not just for > something as unimportant as strings. > Best you are going to get (or better already have) is object arrays. It > makes no sense to discuss it unless someone comes up with an actual > proposal and the willingness to code it. > > > What is a relevant discussion is whether we really need a more compact > but limited representation of text than 4-byte utf32 at all. > Its usecase is for the most part just for python3 porting and saving > some memory in some ascii heavy cases, e.g. astronomy. > It is not that significant anymore as porting to python3 has mostly > already happened via the ugly byte workaround and memory saving is > probably not as significant in the context of numpy which is already > heavy on memory usage. > > My initial approach was to not add a new dtype but to make unicode > parametrizable which would have meant almost no cluttering of numpys > internals and keeping the api more or less consistent which would make > this a relatively simple addition of minor functionality for people that > want it. > But adding a completely new partially redundant dtype for this usecase > may be a too large change to the api. Having two partially redundant > string types may confuse users more than our current status quo of our > single string type (U). > > Discussing whether we want to support truncated utf8 has some merit as > it is a decision whether to give the users an even larger gun to shot > themselves in the foot with. > But I'd like to focus first on the 1 byte type to add a symmetric API > for python2 and python3. > utf8 can always be added latter should we deem it a good idea. > I think we can implement viewers for strings as ndarray subclasses. Then one could do `my_string_array.view(latin_1)`, and so on. Essentially that just changes the default encoding of the 'S' array. That could also work for uint8 arrays if needed. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion