Terry J. Reedy <tjre...@udel.edu> added the comment:

Reading http://tools.ietf.org/html/rfc1843 suggests that the reason that there 
is no HZ pair in cjkencodings.py is that it is not a cjkencoding. Instead it is 
a formatter or meta-encoding for intermixing ascii codes and GB2312(-80) codes. 
(I assume the '-80' suffix means the 1980 version.)

In a bytes environment, I believe a strict HZ decoder would simply separate the 
input bytes into alternating ascii and GB bytes by splitting on the shift 
chars, changing '~~' to '~', and deleting '~\n' (2 chars). So it would need a 
special-case test. Python shifts between ascii and GB2312 decoders to produce a 
unicode stream. Because of the deletion of line-continuation markers, the codec 
is not 1 to 1. A test sentence should contain both that and an encoded ~.

>>> hz=b'''\
This ASCII sentence has a tilde: ~~.
The next sentence is in GB.~{<:Ky2;S{#,~}~
~{NpJ)l6HK!#~}Bye.'''
>>> hz
b'This ASCII sentence has a tilde: ~~.\nThe next sentence is in 
GB.~{<:Ky2;S{#,~}~\n~{NpJ)l6HK!#~}Bye.'
>>> HZ = hz.decode('HZ')
>>> HZ
'This ASCII sentence has a tilde: ~.\nThe next sentence is in GB.己所不欲,勿施於人。Bye.'
# second '\n' deleted
>>> HZ.encode('HZ')
b'This ASCII sentence has a tilde: ~.\nThe next sentence is in 
GB.~{<:Ky2;S{#,NpJ)l6HK!#~}Bye.'
# no '~}~\n~{' in the middle of GC codes.

I believe hz and u8=HZ.encode() should work as a test pair for the working of 
the hz parser itself:
>>> u8 = HZ.encode()
>>> u8
b'This ASCII sentence has a tilde: ~.\nThe next sentence is in 
GB.\xe5\xb7\xb1\xe6\x89\x80\xe4\xb8\x8d\xe6\xac\xb2\xef\xbc\x8c\xe5\x8b\xbf\xe6\x96\xbd\xe6\x96\xbc\xe4\xba\xba\xe3\x80\x82Bye.'
>>> u8.decode() == hz.decode('HZ')
True

However, I have no idea what the hz codec is doing with the shifted byte pairs 
between '~{' and '~}' All the gb codecs decode b'<:Ky2;S{#,NpJ)l6HK!#' to 
'<:Ky2;S{#,NpJ)l6HK!#' (ie, ascii chars to same unicode chars). And they encode 
'己所不欲,勿施於人。' to bytes with the high bit set.

I figured it out. The 1995 rfc says "A GB (GB1 and GB2) code is a two byte 
code, where the first byte is in the range $21-$77   (hexadecimal), and the 
second byte is in the range $21-$7E." This was in the days of for 7-bit bytes, 
at least for safe transmission. Now that we use 8-bit bytes nearly everywhere, 
the gb specs have probably be updated since 1980. This makes hz rather 
obsolete, since high-bit unset ascii codes and high-bit set gb codes can be 
mixed without the hz wrapping. In any case, Python's gb codecs act this way. So 
the hz codec is setting and unsetting the high bit when passing bytes to and 
from gb codec (assuming it does not use a modified version internally).
>>> hhz = [c - 128 for c in '己所不欲,勿施於人。'.encode('GB2312')]
>>> bytes(hhz)
b'<:Ky2;S{#,NpJ)l6HK!#'

Perhaps there should be a separate test like the above to be sure that hz 
really uses GB2312-80, as specified.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12057>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to