Re: How do I decode unicode characters in the subject using email.message_from_string()?
John Machin sjmac...@lexicon.net wrote: On Feb 25, 11:07=A0am, Roy H. Han starsareblueandfara...@gmail.com wrote: Dear python-list, I'm having some trouble decoding an email header using the standard imaplib.IMAP4 class and email.message_from_string method. In particular, email.message_from_string() does not seem to properly decode unicode characters in the subject. How do I decode unicode characters in the subject? You don't. You can't. You decode str objects into unicode objects. You encode unicode objects into str objects. If your input is not a str object, you have a problem. I can't speak for the OP, but I had a similar (and possibly identical-in-intent) question. Suppose you have a Subject line that looks like this: Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= How do you get the email module to decode that into unicode? The same question applies to the other header lines, and the answer is it isn't easy, and I had to read and reread the docs and experiment for a while to figure it out. I understand there's going to be a sprint on the email module at pycon, maybe some of this will get improved then. Here's the final version of my test program. The third to last line is one I thought ought to work given that Header has a __unicode__ method. The final line is the one that did work (note the kludge to turn None into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning, and this code shows why!) --- from email import message_from_string from email.header import Header, decode_header x = message_from_string(\ To: test Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= this is a test. ) print x print for key, header in x.items(): print key, 'type', type(header) print key+:, unicode(Header(header)).decode('utf-8') print key+:, decode_header(header) print key+:, ''.join([s.decode(t or 'ascii') for (s, t) in decode_header(header)]).encode('utf-8') --- From nobody Wed Feb 25 08:35:29 2009 To: test Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= this is a test. To type type 'str' To: test To: [('test', None)] To: test Subject type type 'str' Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= Subject: [('u' Obselete type, None), (-- it is identical to 'd'. (7), 'iso-8859-1')] Subject: 'u' Obselete type-- it is identical to 'd'. (7) --RDM -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
Thanks for writing back, RDM and John Machin. Tomorrow I'll try the code you suggested, RDM. It looks quite helpful and I'll report the results. In the meantime, John asked for more data. The sender's email client is Microsoft Outlook 11. The recipient email client is Lotus Notes. Actual Subject =?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?= Expected Subject Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records X-Mailer Microsoft Office Outlook 11 X-MimeOLE Produced By Microsoft MimeOLE V6.00.2900.5579 RHH On Wed, Feb 25, 2009 at 8:39 AM, rdmur...@bitdance.com wrote: John Machin sjmac...@lexicon.net wrote: On Feb 25, 11:07=A0am, Roy H. Han starsareblueandfara...@gmail.com wrote: Dear python-list, I'm having some trouble decoding an email header using the standard imaplib.IMAP4 class and email.message_from_string method. In particular, email.message_from_string() does not seem to properly decode unicode characters in the subject. How do I decode unicode characters in the subject? You don't. You can't. You decode str objects into unicode objects. You encode unicode objects into str objects. If your input is not a str object, you have a problem. I can't speak for the OP, but I had a similar (and possibly identical-in-intent) question. Suppose you have a Subject line that looks like this: Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= How do you get the email module to decode that into unicode? The same question applies to the other header lines, and the answer is it isn't easy, and I had to read and reread the docs and experiment for a while to figure it out. I understand there's going to be a sprint on the email module at pycon, maybe some of this will get improved then. Here's the final version of my test program. The third to last line is one I thought ought to work given that Header has a __unicode__ method. The final line is the one that did work (note the kludge to turn None into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning, and this code shows why!) --- from email import message_from_string from email.header import Header, decode_header x = message_from_string(\ To: test Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= this is a test. ) print x print for key, header in x.items(): print key, 'type', type(header) print key+:, unicode(Header(header)).decode('utf-8') print key+:, decode_header(header) print key+:, ''.join([s.decode(t or 'ascii') for (s, t) in decode_header(header)]).encode('utf-8') --- From nobody Wed Feb 25 08:35:29 2009 To: test Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= this is a test. To type type 'str' To: test To: [('test', None)] To: test Subject type type 'str' Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= Subject: [('u' Obselete type, None), (-- it is identical to 'd'. (7), 'iso-8859-1')] Subject: 'u' Obselete type-- it is identical to 'd'. (7) --RDM -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
Roy H. Han wrote: On Wed, Feb 25, 2009 at 8:39 AM, rdmur...@bitdance.com wrote: [Top-posting corrected] John Machin sjmac...@lexicon.net wrote: On Feb 25, 11:07=A0am, Roy H. Han starsareblueandfara...@gmail.com wrote: Dear python-list, I'm having some trouble decoding an email header using the standard imaplib.IMAP4 class and email.message_from_string method. In particular, email.message_from_string() does not seem to properly decode unicode characters in the subject. How do I decode unicode characters in the subject? You don't. You can't. You decode str objects into unicode objects. You encode unicode objects into str objects. If your input is not a str object, you have a problem. I can't speak for the OP, but I had a similar (and possibly identical-in-intent) question. Suppose you have a Subject line that looks like this: Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= How do you get the email module to decode that into unicode? The same question applies to the other header lines, and the answer is it isn't easy, and I had to read and reread the docs and experiment for a while to figure it out. I understand there's going to be a sprint on the email module at pycon, maybe some of this will get improved then. Here's the final version of my test program. The third to last line is one I thought ought to work given that Header has a __unicode__ method. The final line is the one that did work (note the kludge to turn None into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning, and this code shows why!) --- from email import message_from_string from email.header import Header, decode_header x = message_from_string(\ To: test Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= this is a test. ) print x print for key, header in x.items(): print key, 'type', type(header) print key+:, unicode(Header(header)).decode('utf-8') print key+:, decode_header(header) print key+:, ''.join([s.decode(t or 'ascii') for (s, t) in decode_header(header)]).encode('utf-8') --- From nobody Wed Feb 25 08:35:29 2009 To: test Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= this is a test. To type type 'str' To: test To: [('test', None)] To: test Subject type type 'str' Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?= Subject: [('u' Obselete type, None), (-- it is identical to 'd'. (7), 'iso-8859-1')] Subject: 'u' Obselete type-- it is identical to 'd'. (7) Thanks for writing back, RDM and John Machin. Tomorrow I'll try the code you suggested, RDM. It looks quite helpful and I'll report the results. In the meantime, John asked for more data. The sender's email client is Microsoft Outlook 11. The recipient email client is Lotus Notes. Actual Subject =?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?= Expected Subject Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records X-Mailer Microsoft Office Outlook 11 X-MimeOLE Produced By Microsoft MimeOLE V6.00.2900.5579 from email.header import decode_header print decode_header(=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=) [('Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records', 'us-ascii')] regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
Steve Holden st...@holdenweb.com wrote: from email.header import decode_header print decode_header(=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=) [('Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records', 'us-ascii')] It is interesting that decode_header does what I would consider to be the right thing (from a pragmatic standpoint) with that particular bit of Microsoft not-quite-standards-compliant brain-damage; but, removing the tab is not in fact standards compliant if I'm reading the RFC correctly. --RDM -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
Cool, it works! Thanks, RDM, for stating the right approach. Thanks, Steve, for teaching by example. I wonder why the email.message_from_string() method doesn't call email.header.decode_header() automatically. On Wed, Feb 25, 2009 at 9:50 AM, rdmur...@bitdance.com wrote: Steve Holden st...@holdenweb.com wrote: from email.header import decode_header print decode_header(=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=) [('Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records', 'us-ascii')] It is interesting that decode_header does what I would consider to be the right thing (from a pragmatic standpoint) with that particular bit of Microsoft not-quite-standards-compliant brain-damage; but, removing the tab is not in fact standards compliant if I'm reading the RFC correctly. --RDM -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
rdmur...@bitdance.com wrote: Steve Holden st...@holdenweb.com wrote: from email.header import decode_header print decode_header(=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=) [('Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records', 'us-ascii')] It is interesting that decode_header does what I would consider to be the right thing (from a pragmatic standpoint) with that particular bit of Microsoft not-quite-standards-compliant brain-damage; but, removing the tab is not in fact standards compliant if I'm reading the RFC correctly. You'd need to quote me chapter and verse on that. I understood that the tab simply indicated continuation, but it's a *long* time since I read the RFCs. regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
* Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500) Thanks, RDM, for stating the right approach. Thanks, Steve, for teaching by example. I wonder why the email.message_from_string() method doesn't call email.header.decode_header() automatically. And I wonder why you would think the header contains Unicode characters when it says us-ascii (=?us-ascii?Q?). I think there is a tendency to label everything Unicode someone does not understand. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe thors...@thorstenkampe.de escribió: * Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500) Thanks, RDM, for stating the right approach. Thanks, Steve, for teaching by example. I wonder why the email.message_from_string() method doesn't call email.header.decode_header() automatically. And I wonder why you would think the header contains Unicode characters when it says us-ascii (=?us-ascii?Q?). I think there is a tendency to label everything Unicode someone does not understand. And I wonder why you would think the header does *not* contain Unicode characters when it says us-ascii?. I think there is a tendency here too... -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
* Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200) En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe thors...@thorstenkampe.de escribió: * Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500) Thanks, RDM, for stating the right approach. Thanks, Steve, for teaching by example. I wonder why the email.message_from_string() method doesn't call email.header.decode_header() automatically. And I wonder why you would think the header contains Unicode characters when it says us-ascii (=?us-ascii?Q?). I think there is a tendency to label everything Unicode someone does not understand. And I wonder why you would think the header does *not* contain Unicode characters when it says us-ascii?. Basically because it didn't contain any Unicode characters (anything outside the ASCII range). Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
Thorsten Kampe wrote: * Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200) En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe thors...@thorstenkampe.de escribió: * Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500) Thanks, RDM, for stating the right approach. Thanks, Steve, for teaching by example. I wonder why the email.message_from_string() method doesn't call email.header.decode_header() automatically. And I wonder why you would think the header contains Unicode characters when it says us-ascii (=?us-ascii?Q?). I think there is a tendency to label everything Unicode someone does not understand. And I wonder why you would think the header does *not* contain Unicode characters when it says us-ascii?. Basically because it didn't contain any Unicode characters (anything outside the ASCII range). And I imagine that Gabriel's point was -- and my point certainly is -- that Unicode includes all the characters *inside* the ASCII range. TJG -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
En Wed, 25 Feb 2009 15:01:08 -0200, Thorsten Kampe thors...@thorstenkampe.de escribió: * Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200) En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe thors...@thorstenkampe.de escribió: * Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500) Thanks, RDM, for stating the right approach. Thanks, Steve, for teaching by example. I wonder why the email.message_from_string() method doesn't call email.header.decode_header() automatically. And I wonder why you would think the header contains Unicode characters when it says us-ascii (=?us-ascii?Q?). I think there is a tendency to label everything Unicode someone does not understand. And I wonder why you would think the header does *not* contain Unicode characters when it says us-ascii?. Basically because it didn't contain any Unicode characters (anything outside the ASCII range). I think you have to revise your definition of Unicode. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
Steve Holden st...@holdenweb.com wrote: rdmur...@bitdance.com wrote: Steve Holden st...@holdenweb.com wrote: from email.header import decode_header print decode_header(=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=) [('Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records', 'us-ascii')] It is interesting that decode_header does what I would consider to be the right thing (from a pragmatic standpoint) with that particular bit of Microsoft not-quite-standards-compliant brain-damage; but, removing the tab is not in fact standards compliant if I'm reading the RFC correctly. You'd need to quote me chapter and verse on that. I understood that the tab simply indicated continuation, but it's a *long* time since I read the RFCs. Tab is not mentioned in RFC 2822 except to say that it is a valid whitespace character. Header folding (insertion of crlf) can occur most places whitespace appears, and is defined in section 2.2.3 thusly: Each header field is logically a single line of characters comprising the field name, the colon, and the field body. For convenience however, and to deal with the 998/78 character limitations per line, the field body portion of a header field can be split into a multiple line representation; this is called folding. The general rule is that wherever this standard allows for folding white space (not simply WSP characters), a CRLF may be inserted before any WSP. For example, the header field: Subject: This is a test can be represented as: Subject: This is a test [irrelevant note elided] The process of moving from this folded multiple-line representation of a header field to its single line representation is called unfolding. Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP. Each header field should be treated in its unfolded form for further syntactic and semantic evaluation. So, the whitespace characters are supposed to be left unchanged after unfolding. --David -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
rdmur...@bitdance.com wrote: [...] The process of moving from this folded multiple-line representation of a header field to its single line representation is called unfolding. Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP. Each header field should be treated in its unfolded form for further syntactic and semantic evaluation. So, the whitespace characters are supposed to be left unchanged after unfolding. That would certainly appear to be the case. Thanks. regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
* Tim Golden (Wed, 25 Feb 2009 17:27:07 +) Thorsten Kampe wrote: * Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200) En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe [...] And I wonder why you would think the header contains Unicode characters when it says us-ascii (=?us-ascii?Q?). I think there is a tendency to label everything Unicode someone does not understand. And I wonder why you would think the header does *not* contain Unicode characters when it says us-ascii?. Basically because it didn't contain any Unicode characters (anything outside the ASCII range). And I imagine that Gabriel's point was -- and my point certainly is -- that Unicode includes all the characters *inside* the ASCII range. I know that this was Gabriel's point. And my point was that Gabriel's point was pointless. If you call any text (or character) Unicode then the word Unicode is generalized to an extent where it doesn't mean anything at all anymore and becomes a buzz word. With the same reason you could call ASCII an Unicode encoding (which it isn't) because all ASCII characters are Unicode characters (code points). Only encodings that cover the full Unicode range can reasonably be called Unicode encodings. The OP just saw some weird characters in the email subject and thought I know. It looks weird. Must be Unicode. But it wasn't. It was good ole ASCII - only Quoted Printable encoded. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
En Wed, 25 Feb 2009 15:44:18 -0200, rdmur...@bitdance.com escribió: Tab is not mentioned in RFC 2822 except to say that it is a valid whitespace character. Header folding (insertion of crlf) can occur most places whitespace appears, and is defined in section 2.2.3 thusly: [...] So, the whitespace characters are supposed to be left unchanged after unfolding. Yep, there is an old bug report sleeping in the tracker about this... -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
En Wed, 25 Feb 2009 16:19:35 -0200, Thorsten Kampe thors...@thorstenkampe.de escribió: * Tim Golden (Wed, 25 Feb 2009 17:27:07 +) Thorsten Kampe wrote: * Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200) En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe [...] And I wonder why you would think the header contains Unicode characters when it says us-ascii (=?us-ascii?Q?). I think there is a tendency to label everything Unicode someone does not understand. And I wonder why you would think the header does *not* contain Unicode characters when it says us-ascii?. Basically because it didn't contain any Unicode characters (anything outside the ASCII range). And I imagine that Gabriel's point was -- and my point certainly is -- that Unicode includes all the characters *inside* the ASCII range. I know that this was Gabriel's point. And my point was that Gabriel's point was pointless. If you call any text (or character) Unicode then the word Unicode is generalized to an extent where it doesn't mean anything at all anymore and becomes a buzz word. If it's text, it should use Unicode. Maybe not now, but in a few years, it will be totally unacceptable not to properly use Unicode to process textual data. With the same reason you could call ASCII an Unicode encoding (which it isn't) because all ASCII characters are Unicode characters (code points). Only encodings that cover the full Unicode range can reasonably be called Unicode encodings. Not at all. ASCII is as valid as character encoding (coded character set as the Unicode guys like to say) as ISO 10646 (which covers the whole range). The OP just saw some weird characters in the email subject and thought I know. It looks weird. Must be Unicode. But it wasn't. It was good ole ASCII - only Quoted Printable encoded. Good f*cked ASCII is Unicode too. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I decode unicode characters in the subject using email.message_from_string()?
On Feb 25, 11:07 am, Roy H. Han starsareblueandfara...@gmail.com wrote: Dear python-list, I'm having some trouble decoding an email header using the standard imaplib.IMAP4 class and email.message_from_string method. In particular, email.message_from_string() does not seem to properly decode unicode characters in the subject. How do I decode unicode characters in the subject? You don't. You can't. You decode str objects into unicode objects. You encode unicode objects into str objects. If your input is not a str object, you have a problem. I'm no expert on the email package, but experts don't have crystal balls, so let's gather some data for them while we're waiting for their timezones to align: Presumably your code is doing something like: msg = email.message_from_string(a_string) Please report the results of print repr(a_string) and print type(msg) print msg.items() and tell us what you expected. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list