On Thu, Jan 23, 2014 at 11:23:09AM -0500, [email protected] wrote:
>
> another curious example, encode utf-8 to latin-1 bytes
>
> >>> b
> array(['Õsc', 'zxc'],
> dtype='<U3')
> >>> b[0].encode('utf8')
> b'\xc3\x95sc'
> >>> b[0].encode('latin1')
> b'\xd5sc'
> >>> b.astype('S')
> Traceback (most recent call last):
> File "<pyshell#40>", line 1, in <module>
> b.astype('S')
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
> position 0: ordinal not in range(128)
> >>> c = b.view('S4').astype('S1').view('S3')
> >>> c
> array([b'\xd5sc', b'zxc'],
> dtype='|S3')
> >>> c[0].decode('latin1')
> 'Õsc'
Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
ascii:
>>> np.array(['Õsc']).astype('S4')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0:
ordinal not in range(128)
>>> np.array(['Õsc']).view('S4')
array([b'\xd5', b's', b'c'],
dtype='|S4')
> --------
> The original numpy py3 conversion used latin-1 as default
> (It's still used in statsmodels, and I haven't looked at the structure
> under the common py2-3 codebase)
>
> if sys.version_info[0] >= 3:
> import io
> bytes = bytes
> unicode = str
> asunicode = str
These two functions are an abomination:
> def asbytes(s):
> if isinstance(s, bytes):
> return s
> return s.encode('latin1')
> def asstr(s):
> if isinstance(s, str):
> return s
> return s.decode('latin1')
Oscar
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion