New submission from Anders Kaseorg <ande...@mit.edu>: We ran into a UnicodeEncodeError exception using email.parser to parse this email <https://lists.cam.ac.uk/pipermail/cl-isabelle-users/2021-February/msg00135.html>, with full headers available in the raw archive <https://lists.cam.ac.uk/pipermail/cl-isabelle-users/2021-February.txt>. The offending header is hilariously invalid:
Content-Type: text/plain; charset*=utf-8”''utf-8%E2%80%9D but I’m filing an issue since the parser is intended to be robust against invalid input. Minimal reproduction: >>> import email, email.policy >>> email.message_from_bytes(b"Content-Type: text/plain; >>> charset*=utf-8\xE2\x80\x9D''utf-8%E2%80%9D", policy=email.policy.default) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.10/email/__init__.py", line 46, in message_from_bytes return BytesParser(*args, **kws).parsebytes(s) File "/usr/local/lib/python3.10/email/parser.py", line 123, in parsebytes return self.parser.parsestr(text, headersonly) File "/usr/local/lib/python3.10/email/parser.py", line 67, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "/usr/local/lib/python3.10/email/parser.py", line 57, in parse return feedparser.close() File "/usr/local/lib/python3.10/email/feedparser.py", line 187, in close self._call_parse() File "/usr/local/lib/python3.10/email/feedparser.py", line 180, in _call_parse self._parse() File "/usr/local/lib/python3.10/email/feedparser.py", line 256, in _parsegen if self._cur.get_content_type() == 'message/delivery-status': File "/usr/local/lib/python3.10/email/message.py", line 578, in get_content_type value = self.get('content-type', missing) File "/usr/local/lib/python3.10/email/message.py", line 471, in get return self.policy.header_fetch_parse(k, v) File "/usr/local/lib/python3.10/email/policy.py", line 163, in header_fetch_parse return self.header_factory(name, value) File "/usr/local/lib/python3.10/email/headerregistry.py", line 608, in __call__ return self[name](name, value) File "/usr/local/lib/python3.10/email/headerregistry.py", line 196, in __new__ cls.parse(value, kwds) File "/usr/local/lib/python3.10/email/headerregistry.py", line 453, in parse kwds['decoded'] = str(parse_tree) File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 126, in __str__ return ''.join(str(x) for x in self) File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 126, in <genexpr> return ''.join(str(x) for x in self) File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 798, in __str__ for name, value in self.params: File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 783, in params value = value.decode(charset, 'surrogateescape') UnicodeEncodeError: 'utf-8' codec can't encode characters in position 5-7: surrogates not allowed ---------- components: email messages: 387685 nosy: andersk, barry, r.david.murray priority: normal severity: normal status: open title: UnicodeEncodeError: surrogates not allowed when parsing invalid charset versions: Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue43323> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com