On 23 January 2014 17:42,  <josef.p...@gmail.com> wrote:
> On Thu, Jan 23, 2014 at 12:13 PM,  <josef.p...@gmail.com> wrote:
>> On Thu, Jan 23, 2014 at 11:58 AM,  <josef.p...@gmail.com> wrote:
>>>
>>> No, a view doesn't change the memory, it just changes the
>>> interpretation and there shouldn't be any conversion involved.
>>> astype does type conversion, but it goes through ascii encoding which fails.
>>>
>>>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>>>> b.tostring()
>>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>>>> b.view('S12')
>>> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
>>>       dtype='|S12')
>>>
>>> The conversion happens somewhere in the array creation, but I have no
>>> idea about the memory encoding for uc2 and the low level layouts.
>
>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>> b[0].tostring()
> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
>>>> 'Õsc'.encode('utf-32LE')
> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
>
> Is that the encoding for 'U' ?

On a little-endian system, yes. I realise what' happening now. 'U'
represents unicode characters as a 32-bit unsigned integer giving the
code point of the character. The first 256 code points are exactly the
256 characters representable with latin-1 in the same order.

So 'Õ' has the code point 0xd5 and is encoded as the byte 0xd5 in
latin-1. As a 32 bit integer the code point is 0x000000d5 but in
little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So
when you reinterpret that as 'S4' it strips the remaining nulls to get
the byte string b'\xd5'. Which is the latin-1 encoding for the
character. The same will happen for any string of latin-1 characters.
However if you do have a code point of 256 or greater then you'll get
a byte strings of length 2 or more.

On a big-endian system I think you'd get b'\x00\x00\x00\xd5'.

> another sideeffect of null truncation: cannot decode truncated data
>
>>>> b.view('S4').tostring()
> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>> b.view('S4')[0]
> b'\xd5'
>>>> b.view('S4')[0].tostring()
> b'\xd5'
>>>> b.view('S4')[:1].tostring()
> b'\xd5\x00\x00\x00'
>
>>>> b.view('S4')[0].decode('utf-32LE')
> Traceback (most recent call last):
>   File "<pyshell#101>", line 1, in <module>
>     b.view('S4')[0].decode('utf-32LE')
>   File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode
>     return codecs.utf_32_le_decode(input, errors, True)
> UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position
> 0: truncated data
>
>>>> b.view('S4')[:1].tostring().decode('utf-32LE')
> 'Õ'
>
> numpy arrays need a decode and encode method

I'm not sure that they do. Rather there needs to be a text dtype that
knows what encoding to use in order to have a binary interface as
exposed by .tostring() and friends and but produce unicode strings
when indexed from Python code. Having both a text and a binary
interface to the same data implies having an encoding.


Oscar
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to