On Mon, Jun 5, 2017 at 3:59 PM, Mikhail V <mikhail...@gmail.com> wrote:
> -- classify by "forward/backward" conversion: > For this time consider only forward, i.e. I copy data from string > to numpy array > > -- classify by " bytes vs ordinals ": > > a) bytes: If I need raw bytes - in this case e.g. > > B = bytes(s.encode()) > no need to call "bytes" -- encode() returns a bytes object: In [1]: s = "this is a simple ascii-only string" In [2]: b = s.encode() In [3]: type(b) Out[3]: bytes In [4]: b Out[4]: b'this is a simple ascii-only string' > > will do it. then I can copy data to array. So currently there are methods > coverings this. If I understand correctly the data extracted corresponds > to utf-?? byte feed, i.e. non-constant byte-length of chars (1 up to > 4 bytes per char for > the 'wide' unicode, correct me if I am wrong). > In [5]: s.encode? Docstring: S.encode(encoding='utf-8', errors='strict') -> bytes So the default is utf-8, but you can set any encoding you want (that python supports) In [6]: s.encode('utf-16') Out[6]: b'\xff\xfet\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00 \x00s\x00i\x00m\x00p\x00l\x00e\x00 \x00a\x00s\x00c\x00i\x00i\x00-\x00o\x00n\x00l\x00y\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00' > b): I need *ordinals* > Yes, I need ordinals, so for the bytes() method, if a Python 3 > string contains only > basic ascii, I can so or so convert to bytes then to integer array > and the length will > be the same 1byte for each char. > Although syntactically seen, and with slicing, this will look e.g. like: > > s= "012 abc" > B = bytes(s.encode()) # convert to bytes > k = len(s) > arr = np.zeros(k,"u1") # init empty array length k > arr[0:2] = list(B[0:2]) > print ("my array: ", arr) > -> > my array: [48 49 0 0 0 0 0] > This can be done more cleanly: In [15]: s= "012 abc" In [16]: b = s.encode('ascii') # you want to use the ascii encoding so you don't get utf-8 cruft if there are non-ascii characters # you could use latin-1 too (Or any other one-byte per char encoding In [17]: arr = np.fromstring(b, np.uint8) # this is using fromstring() to means it's old py definiton - treat teh contenst as bytes # -- it really should be called "frombytes()" # you could also use: In [22]: np.frombuffer(b, dtype=np.uint8) Out[22]: array([48, 49, 50, 32, 97, 98, 99], dtype=uint8)In [18]: print arr In [19]: print(arr) [48 49 50 32 97 98 99] # you got the ordinals In [20]: "".join([chr(i) for i in arr]) Out[20]: '012 abc' # yes, they are the right ones... > Result seems correct. Note that I also need to use list(B), otherwise > the slicing does not work (fills both values with 1, no idea where 1 > comes from). > that is odd -- I can't explain it right now either... > Or I can write e.g.: > arr[0:2] = np.fromstring(B[0:2], "u1") > > But looks indeed like a 'hack' and not so simple. > is the above OK? > -- classify "what is maximal ordinal value in the string" > Well, say, I don't know what is maximal ordinal, e.g. here I take > 3 Cyrillic letters instead of 'abc': > > s= "012 АБВ" > k = len(s) > arr = np.zeros(k,"u4") # init empty 32 bit array length k > arr[:] = np.fromstring(np.array(s),"u4") > -> > [ 48 49 50 32 1040 1041 1042] > so this is making a numpy string, which is a UCS-4 encoding unicode -- i.e. 4 bytes per charactor. Then you care converting that to an 4-byte unsigned int. but no need to do it with fromstring: In [52]: s Out[52]: '012 АБВ' In [53]: s_arr.reshape((1,)).view(np.uint32) Out[53]: array([ 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) we need the reshape() because .view does not work with array scalars -- not sure why not? > This gives correct results indeed. So I get my ordinals as expected. > So this is better/preferred way, right? > I would maybe do it more "directly" -- i.e. use python's string to do the encoding: In [64]: s Out[64]: '012 АБВ' In [67]: np.fromstring(s.encode('U32'), dtype=np.uint32) Out[67]: array([65279, 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) that first value is the byte-order mark (I think...), you can strip it off with: In [68]: np.fromstring(s.encode('U32')[4:], dtype=np.uint32) Out[68]: array([ 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) or, probably better simply specify the byte order in the encoding: In [69]: np.fromstring(s.encode('UTF-32LE'), dtype=np.uint32) Out[69]: array([ 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) arr = np.ordinals(s) > arr[0:2] = np.ordinals(s[0:2]) # with slicing > > or, e.g. in such format: > > arr = np.copystr(s) > arr[0:2] = np.copystr(s[0:2]) > I don't think any of this is necessary -- the UCS4 (Or UTF-32) "encoding" is pretty much the ordinals anyway. As you notices, if you make a numpy unicode string array, and change the dtype to unsigned int32, you get what you want. You really don't want to mess with any of this unless you understand unicode and encodings anyway.... Though it is a bit akward -- why is your actual use-case for working with ordinals??? BTW, you can use regular python to get the ordinals first: In [71]: np.array([ord(c) for c in s]) Out[71]: array([ 48, 49, 50, 32, 1040, 1041, 1042]) Though for Python 2 could raise questions why need casting to "u4". > this would all work the same with python 2 if you used unicode objects instead of strings. Maybe good to put: from __future__ import unicode_literals in your source.... > So approximately are my ideas. > For me it would cover many application cases. I'm still curious as to your use-cases -- when do you have a bunch of ordinal values?? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion