On Thu, Jan 23, 2014 at 1:36 PM, Oscar Benjamin <oscar.j.benja...@gmail.com> wrote: > On 23 January 2014 17:42, <josef.p...@gmail.com> wrote: >> On Thu, Jan 23, 2014 at 12:13 PM, <josef.p...@gmail.com> wrote: >>> On Thu, Jan 23, 2014 at 11:58 AM, <josef.p...@gmail.com> wrote: >>>> >>>> No, a view doesn't change the memory, it just changes the >>>> interpretation and there shouldn't be any conversion involved. >>>> astype does type conversion, but it goes through ascii encoding which >>>> fails. >>>> >>>>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3') >>>>>>> b.tostring() >>>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>>>>> b.view('S12') >>>> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], >>>> dtype='|S12') >>>> >>>> The conversion happens somewhere in the array creation, but I have no >>>> idea about the memory encoding for uc2 and the low level layouts. >> >>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3') >>>>> b[0].tostring() >> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' >>>>> 'Õsc'.encode('utf-32LE') >> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' >> >> Is that the encoding for 'U' ? > > On a little-endian system, yes. I realise what' happening now. 'U' > represents unicode characters as a 32-bit unsigned integer giving the > code point of the character. The first 256 code points are exactly the > 256 characters representable with latin-1 in the same order. > > So 'Õ' has the code point 0xd5 and is encoded as the byte 0xd5 in > latin-1. As a 32 bit integer the code point is 0x000000d5 but in > little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So > when you reinterpret that as 'S4' it strips the remaining nulls to get > the byte string b'\xd5'. Which is the latin-1 encoding for the > character. The same will happen for any string of latin-1 characters. > However if you do have a code point of 256 or greater then you'll get > a byte strings of length 2 or more. > > On a big-endian system I think you'd get b'\x00\x00\x00\xd5'.
I curious consequence of this, if we have only 1 character elements: >>> a = np.array([si.encode('utf-16LE') for si in ['Õ', 'z']], dtype='S') >>> a32 = np.array([si.encode('utf-32LE') for si in ['Õ', 'z']], dtype='S') >>> a[0], a32[0] (b'\xd5', b'\xd5') >>> a[0] == a32[0] True >>> a32 = np.array([si.encode('utf-32BE') for si in ['Õ', 'z']], dtype='S') >>> a = np.array([si.encode('utf-16BE') for si in ['Õ', 'z']], dtype='S') >>> a[0], a32[0] (b'\x00\xd5', b'\x00\x00\x00\xd5') >>> a[0] == a32[0] False Josef > >> another sideeffect of null truncation: cannot decode truncated data >> >>>>> b.view('S4').tostring() >> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>>> b.view('S4')[0] >> b'\xd5' >>>>> b.view('S4')[0].tostring() >> b'\xd5' >>>>> b.view('S4')[:1].tostring() >> b'\xd5\x00\x00\x00' >> >>>>> b.view('S4')[0].decode('utf-32LE') >> Traceback (most recent call last): >> File "<pyshell#101>", line 1, in <module> >> b.view('S4')[0].decode('utf-32LE') >> File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode >> return codecs.utf_32_le_decode(input, errors, True) >> UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position >> 0: truncated data >> >>>>> b.view('S4')[:1].tostring().decode('utf-32LE') >> 'Õ' >> >> numpy arrays need a decode and encode method > > I'm not sure that they do. Rather there needs to be a text dtype that > knows what encoding to use in order to have a binary interface as > exposed by .tostring() and friends and but produce unicode strings > when indexed from Python code. Having both a text and a binary > interface to the same data implies having an encoding. > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion