Re: nntplib encoding problem

Thomas L. Shinnick Sun, 27 Feb 2011 19:31:19 -0800

At 08:12 PM 2/27/2011, you wrote:

On 28/02/2011 01:31, Laurent Duchesne wrote:

Hi,


I'm using python 3.2 and got the following error:

nntpClient = nntplib.NNTP_SSL(...)
nntpClient.group("alt.binaries.cd.lossless")
nntpClient.over((534157,534157))

... 'subject': 'Myl\udce8ne Farmer - Anamorphosee (Japan Edition) 1995
[02/41] "Back.jpg" yEnc (1/3)' ...

overview = nntpClient.over((534157,534157))
print(overview[1][0][1]['subject'])

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in
position 3: surrogates not allowed

I'm not sure if I should report this as a bug in nntplib or if I'm doing
something wrong.

Note that I get the same error if I try to write this data to a file:

h = open("output.txt", "a")
h.write(overview[1][0][1]['subject'])

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in
position 3: surrogates not allowed

It's looks like the subject was originally encoded as Latin-1 (or
similar) (b'Myl\xe8ne Farmer - Anamorphosee (Japan Edition) 1995
[02/41] "Back.jpg" yEnc (1/3)') but has been decoded as UTF-8 with
"surrogateescape" passed as the "errors" parameter.


3.2 Docs
  6.6. codecs  Codec registry and base classes
    Possible values for errors are
      'surrogateescape': replace with surrogate U+DCxx, see PEP 383

Yes, it would have been 0xE8 -  Mylène

Googling on surrogateescape I can see lots ofargument about unintended outcomes.... yikes!

You can get the "correct" Unicode by encoding as UTF-8 with
"surrogateescape" and then decoding as Latin-1:
overview[1][0][1]['subject'].encode("utf-8","surrogateescape").decode("latin-1")

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: nntplib encoding problem

Reply via email to