On 04/06/17 20:04, Mikhail V wrote: > Initialize array from a string currently looks like: > > s= "012 abc" > A= fromstring(s,"u1") > print A -> > [48 49 50 32 97 98 99] > > Perfect. > Now when writing values it will not work > as IMO it should, namley consider this example: > > B= zeros(7,"u1") > B[0]=s[1] > print B -> > [1 0 0 0 0 0 0] > > Ugh? It tries to parse the s[1] character "1" as integer and writes 1 to B[0]. > First thing I would expect is a value error and I'd never expect it does > that high-level manipulations with parsing. > IMO ideally it would do the following instead: > > B[0]=s[1] > print B -> > [49 0 0 0 0 0 0] > > So it should just write ord(s[1]) to B. > Sounds logical? For me very much. > Further, one could write like this: > > B[:] = s > print B-> > [48 49 50 32 97 98 99] > > Namely cast the string into byte array. IMO this would be > the logical expected behavior. I disagree. If numpy treated bytestrings as sequences of uint8s (which would, granted, be perfectly reasonable, at least in py3), you wouldn't have needed the fromstring function in the first place. Personally, I think I would prefer this, actually. However, numpy normally treats strings as objects that can sometimes be cast to numbers, so this behaviour is perfectly logical.
For what it's worth, in Python 3 (which you probably should want to be using), everything behaves as you'd expect: >>> import numpy as np >>> s = b'012 abc' >>> a = np.fromstring(s, 'u1') >>> a array([48, 49, 50, 32, 97, 98, 99], dtype=uint8) >>> b = np.zeros(7, 'u1') >>> b[0] = s[1] >>> b array([49, 0, 0, 0, 0, 0, 0], dtype=uint8) >>> > Currently it just throws the value error if met non-digits in a string, > so IMO current casting hardly can be of practical use. > > Furthermore, I think this code: > > A= array(s,"u1") > > Could act exactly same as: > > A= fromstring(s,"u1") > > But this is just a side-idea for spelling simplicty/generality. > Not really necessary. There is also something to be said for the current behaviour: >>> np.array('100', 'u1') array(100, dtype=uint8) However, the fact that this works for bytestrings on Python 3 is, in my humble opinion, ridiculous: >>> np.array(b'100', 'u1') # b'100' IS NOT TEXT array(100, dtype=uint8) This is of course consistent with the fact that you can cast a bytestring to builtin python int or float (but not complex). Interestingly enough, numpy complex behaves differently from python complex: >>> complex(b'1') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: complex() argument must be a string or a number, not 'bytes' >>> complex('1') (1+0j) >>> np.complex128('1') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: a float is required >>> > Further thoughts: > If trying to create "u1" array from a Pyhton 3 string, question is, > whether it should throw an error, I think yes, and in this case > "u4" type should be explicitly specified by initialisation, I suppose. > And e.g. translation from unicode to extended ascii (Latin1) or whatever > should be done on Python side or with explicit translation. If you ask me, passing a unicode string to fromstring with sep='' (i.e. to parse binary data) should ALWAYS raise an error: the semantics only make sense for strings of bytes. Currently, there appears to be some UTF-8 conversion going on, which creates potentially unexpected results: >>> s = 'αβγδ' >>> a = np.fromstring(s, 'u1') >>> a array([206, 177, 206, 178, 206, 179, 206, 180], dtype=uint8) >>> assert len(a) * a.dtype.itemsize == len(s) Traceback (most recent call last): File "<stdin>", line 1, in <module> AssertionError >>> This is, apparently (https://github.com/numpy/numpy/issues/2152), due to how the internals of Python deal with unicode strings in C code, and not due to anything numpy is doing. Speaking of unexpected results, I'm not sure you realize what fromstring does when you give it a multi-byte dtype: >>> s = 'αβγδ' >>> a = np.fromstring(s, 'u4') >>> a array([2999890382, 3033445326], dtype=uint32) >>> Give fromstring() a numpy unicode string, and all is right with the world: >>> s = np.array('αβγδ') >>> s array('αβγδ', dtype='<U4') >>> np.fromstring(s, 'u4') array([945, 946, 947, 948], dtype=uint32) >>> IMHO calling fromstring(..., sep='') with a unicode string should be deprecated and perhaps eventually forbidden. (Or fixed, but that would break backwards compatibility) > Python3 assumes 4-byte strings but in reality most of the time > we deal with 1-byte strings, so there is huge waste of resources > when dealing with 4-bytes. For many serious projects it is just not needed. That's quite enough anglo-centrism, thank you. For when you need byte strings, Python 3 has a type for that. For when your strings contain text, bytes with no information on encoding are not enough. > Furthermore I think some of the methods from "chararray" submodule > should be possible to use directly on normal integer arrays without > conversions to other array types. > So I personally don't realy get why the need of additional chararray type, > Its all numbers anyway and it's up to the programmer to > decide what size of translation tables/value ranges he wants to use. chararray is deprecated. > There can be some convinience methods for ascii operations, > like eg char.toupper(), but currently they don't seem to work with integer > arrays so why not make those potentially useful methots usable > and make them work on normal integer arrays? I don't know what you're doing, but I don't think numpy is normally the right tool for text manipulation... > [snip] > > as a side-note, I don't think that encoding should be assumed much for > creating new array types, it is up to the programmer > to decide what 'meanings' the bytes have. Agreed! -- Thomas _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion