On Jun 4, 2:38 am, Daniel Mahoney <[EMAIL PROTECTED]> wrote: > I'm working on an app that's processing Usenet messages. I'm making a > connection to my NNTP feed and grabbing the headers for the groups I'm > interested in, saving the info to disk, and doing some post-processing. > I'm finding a few bizarre characters and I'm not sure how to handle them > pythonically. > > One of the lines I'm finding this problem with contains: > 137050 Cleo and I have an anouncement! "Mlle. =?iso-8859-1?Q?Ana=EFs?=" > <[EMAIL PROTECTED]> Sun, 21 Nov 2004 16:21:50 -0500 > <[EMAIL PROTECTED]> 4478 69 Xref: > sn-us rec.pets.cats.community:137050 > > The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=". > An HTML rendering of what this string should look would be "Anaïs". > > What I'm doing now is a brute-force substitution from the version in the > file to the HTML version. That's ugly. What's a better way to translate > that string? Or is my problem that I'm grabbing the headers from the NNTP > server incorrectly?
>>> from email.Header import decode_header >>> decode_header("=?iso-8859-1?Q?Ana=EFs?=") [('Ana\xefs', 'iso-8859-1')] >>> (s, e), = decode_header("=?iso-8859-1?Q?Ana=EFs?=") >>> s 'Ana\xefs' >>> e 'iso-8859-1' >>> s.decode(e) u'Ana\xefs' >>> import unicodedata >>> import htmlentitydefs >>> for c in s.decode(e): ... print ord(c), unicodedata.name(c) ... 65 LATIN CAPITAL LETTER A 110 LATIN SMALL LETTER N 97 LATIN SMALL LETTER A 239 LATIN SMALL LETTER I WITH DIAERESIS 115 LATIN SMALL LETTER S >>> htmlentitydefs.codepoint2name[239] 'iuml' >>> -- http://mail.python.org/mailman/listinfo/python-list