New submission from Zirak <ziraker...@gmail.com>: Example:
In [52]: import email.message In [53]: m = email.message.Message() In [54]: m.set_payload('abc', 'utf8') In [55]: m.get_payload() # correctly encoded Out[55]: 'YWJj\n' In [56]: m.set_payload('abc', 'utf8') In [57]: m.get_payload() # no more encoding? Out[57]: 'abc' In [58]: m.get_payload(decode=True) # wut? Out[58]: b'i\xb7' In [59]: print(str(m)) MIME-Version: 1.0 Content-Type: text/plain; charset="utf8" Content-Transfer-Encoding: base64 abc While the first `set_payload` correctly encodes and sets the message's Content-Transfer-Encoding, the second call doesn't properly encode the payload according to its existing Content-Transfer-Encoding. Tested on 3.6, 3.5 and 2.7. `email.message.set_payload` does not directly encode the payload, instead `email.message.set_charset` does, around line 353: https://github.com/python/cpython/blob/b067c8fdd1e205bd0411417b6d5e4b832c3773fc/Lib/email/message.py#L353-L368 In both invocations of `set_payload`, the payload is not encoded according to the encoding. On the first invocation, the `CTE` header is correctly set according to `charset.get_body_encoding` (#354 and #368) and the payload is encoded (#356 or #367, the latter in this case). On the second invocation, the `CTE` header is already set, so the payload is never encoded. This is especially dangerous when passing `decode=True` to `get_payload` after the 2nd `set_payload`, as that may throw an error in some cases (trying to base64 decode a string which makes no sense to it. that's how I arrived on this bug, but I can't for the life of me replicate an exception). This is a bit finicky to fix. If we change `set_charset` to always encode the current payload, we risk double-encoding when `set_charset` is not called through `set_payload`. However if `set_charset` tries to decode the payload, it will produce incorrect results when *it is* called through `set_payload`. urgh. We can move the actual encoding code away from `set_charset`, either into `set_payload` or a third function, but that'll break any code calling `set_payload` without a charset and then calling `set_charset`. urgh. One possible solution is for both `set_charset` and `set_payload` to call a third function, e.g. `_encode_payload`. Perhaps something like (pseudocode): def set_payload(self, payload, charset): # ... if 'Content-Transfer-Encoding' in self: self._payload = self._encode_payload(payload) self.set_charset(charset) # ... def set_charset(self, charset): # ... if 'Content-Transfer-Encoding' not in self: self._payload = self._encode_payload() self.add_header(...) def _encode_payload(self, payload): # code in lines 353-366 This way, `set_charset` handles the cases where CTE was never defined, and `set_payload` makes sure to encode the payload when a CTE is present. It may work, but something about this gives me unrest. For example, if you `set_charset` once and it encodes your payload as base64, and you `set_charset` again with a charset whose `get_body_encoding` returns qp, the payload would still be base64 even though it *should be* qp. urgh. Is this a big enough concern? Is there a superior approach I'm missing? Thanks in advance! ---------- components: email messages: 304628 nosy: Zirak Ertan, barry, r.david.murray priority: normal severity: normal status: open title: Calling email.message.set_payload twice produces an invalid eml type: behavior versions: Python 2.7, Python 3.5, Python 3.6 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue31820> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com