On 23 January 2014 17:42, <josef.p...@gmail.com> wrote: > On Thu, Jan 23, 2014 at 12:13 PM, <josef.p...@gmail.com> wrote: >> On Thu, Jan 23, 2014 at 11:58 AM, <josef.p...@gmail.com> wrote: >>> >>> No, a view doesn't change the memory, it just changes the >>> interpretation and there shouldn't be any conversion involved. >>> astype does type conversion, but it goes through ascii encoding which fails. >>> >>>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3') >>>>>> b.tostring() >>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>>>> b.view('S12') >>> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], >>> dtype='|S12') >>> >>> The conversion happens somewhere in the array creation, but I have no >>> idea about the memory encoding for uc2 and the low level layouts. > >>>> b = np.array(['Õsc', 'zxc'], dtype='<U3') >>>> b[0].tostring() > b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' >>>> 'Õsc'.encode('utf-32LE') > b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' > > Is that the encoding for 'U' ?
On a little-endian system, yes. I realise what' happening now. 'U' represents unicode characters as a 32-bit unsigned integer giving the code point of the character. The first 256 code points are exactly the 256 characters representable with latin-1 in the same order. So 'Õ' has the code point 0xd5 and is encoded as the byte 0xd5 in latin-1. As a 32 bit integer the code point is 0x000000d5 but in little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So when you reinterpret that as 'S4' it strips the remaining nulls to get the byte string b'\xd5'. Which is the latin-1 encoding for the character. The same will happen for any string of latin-1 characters. However if you do have a code point of 256 or greater then you'll get a byte strings of length 2 or more. On a big-endian system I think you'd get b'\x00\x00\x00\xd5'. > another sideeffect of null truncation: cannot decode truncated data > >>>> b.view('S4').tostring() > b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>> b.view('S4')[0] > b'\xd5' >>>> b.view('S4')[0].tostring() > b'\xd5' >>>> b.view('S4')[:1].tostring() > b'\xd5\x00\x00\x00' > >>>> b.view('S4')[0].decode('utf-32LE') > Traceback (most recent call last): > File "<pyshell#101>", line 1, in <module> > b.view('S4')[0].decode('utf-32LE') > File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode > return codecs.utf_32_le_decode(input, errors, True) > UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position > 0: truncated data > >>>> b.view('S4')[:1].tostring().decode('utf-32LE') > 'Õ' > > numpy arrays need a decode and encode method I'm not sure that they do. Rather there needs to be a text dtype that knows what encoding to use in order to have a binary interface as exposed by .tostring() and friends and but produce unicode strings when indexed from Python code. Having both a text and a binary interface to the same data implies having an encoding. Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion