New submission from Martijn Pieters <[email protected]>:
When encountering identifier headers such as Message-ID containing a msg-id
token longer than 77 characters (including the <...> angle brackets), the email
package folds that header using RFC 2047 encoded words, e.g.
Message-ID:
<154810422972.4.16142961424846318...@aaf39fce-569e-473a-9453-6862595bd8da.prvt.dyno.rt.heroku.com>
becomes
Message-ID: =?utf-8?q?=3C154810422972=2E4=2E16142961424846318784=40aaf39fce-?=
=?utf-8?q?569e-473a-9453-6862595bd8da=2Eprvt=2Edyno=2Ert=2Eheroku=2Ecom=3E?=
The msg-id token here is this long because Heroku Dyno machines use a UUID in
the FQDN, but Heroku is hardly the only source of such long msg-id tokens.
Microsoft's Outlook.com / Office365 email servers balk at the RFC2047 encoded
word use here and attempt to wrap the email in a TNEF winmail.dat attachment,
then may fail at this under some conditions that I haven't quite worked out yet
and deliver an error message to the recipient with the helpful message "554
5.6.0 Corrupt message content", or just deliver the ever unhelpful winmail.dat
attachment to the unsuspecting recipient (I'm only noting these symptom here
for future searches).
I encountered this issue with long Message-ID values generated by
email.util.make_msgid(), but this applies to all RFC 5322 section 3.6.4
Identification Fields headers, as well as the corresponding headers from RFC
822 section 4.6 (covered by section 4.5.4 in 5322).
What is happening here is that the email._header_value_parser module has no
handling for the msg-id tokens *at all*, and email.headerregistry has no
dedicated header class for identifier headers. So these headers are parsed as
unstructured, and folded at will.
RFC2047 section 5 on the other hand states that the msg-id token is strictly
off-limits, and no RFC2047 encoding should be used to encode such elements.
Because headers *can* exceed 78 characters (RFC 5322 section 2.1.1 states that
"Each line of characters MUST be no more than 998 characters, and SHOULD be no
more than 78 characters[.]") I think that RFC5322 msg-id tokens should simply
not be folded, at all. The obsoleted RFC822 syntax for msg-id makes them equal
to the addr-spec token, where the local-part (before the @) contains word
tokens; those would be fair game but then at least apply the RFC2047 encoded
word replacement only to those word tokens.
For now, I worked around the issue by using a custom policy that uses 998 as
the maximum line length for identifier headers:
from email.policy import EmailPolicy
# Headers that contain msg-id values, RFC5322
MSG_ID_HEADERS = {'message-id', 'in-reply-to', 'references', 'resent-msg-id'}
class MsgIdExcemptPolicy(EmailPolicy):
def _fold(self, name, value, *args, **kwargs):
if name.lower() in MSG_ID_HEADERS and self.max_line_length - len(name)
- 2 < len(value):
# RFC 5322, section 2.1.1: "Each line of characters MUST be no
# more than 998 characters, and SHOULD be no more than 78
# characters, excluding the CRLF.". To avoid msg-id tokens from
being folded
# by means of RFC2047, fold identifier lines to the max length
instead.
return self.clone(max_line_length=998)._fold(name, value, *args,
**kwargs)
return super()._fold(name, value, *args, **kwargs)
This ignores the fact that In-Reply-To and References contain foldable
whitespace in between each msg-id, but it at least let us send email through
smtp.office365.com again without confusing recipients.
----------
components: email
messages: 334210
nosy: barry, mjpieters, r.david.murray
priority: normal
severity: normal
status: open
title: email package folds msg-id identifiers using RFC2047 encoded words where
it must not
versions: Python 3.7, Python 3.8
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue35805>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com