[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset
New submission from Marko Lalic: When the message's Content-Transfer-Encoding is set to 8bit, the get_payload(decode=True) method returns the payload encoded using raw-unicode-escape. This means that it is impossible to decode the returned bytes using the content charset obtained by the get_content_charset method. It seems this should be fixed so that get_payload returns the bytes as found in the payload when Content-Transfer-Encoding is 8bit, exactly like Python2.7 handles it. from email import message_from_string message = message_from_string(MIME-Version: 1.0 ... Content-Type: text/plain; charset=utf-8 ... Content-Disposition: inline ... Content-Transfer-Encoding: 8bit ... ... ünicöde data..) message.get_content_charset() 'utf-8' message.get_payload(decode=True) b'\xfcnic\xf6de data..' message.get_payload(decode=True).decode(message.get_content_charset()) Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 0: invalid start byte message.get_payload(decode=True).decode('raw-unicode-escape') 'ünicöde data..' -- components: email messages: 191526 nosy: barry, mlalic, r.david.murray priority: normal severity: normal status: open title: get_payload method returns bytes which cannot be decoded using the message's charset type: behavior versions: Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18271 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset
Serhiy Storchaka added the comment: message.get_payload(decode=True).decode('latin1') 'ünicöde data..' -- nosy: +serhiy.storchaka versions: +Python 3.4 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18271 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset
Marko Lalic added the comment: That will work fine as long as the characters are actually latin. We cannot forget the rest of the unicode character planes. Consider:: message = message_from_string(MIME-Version: 1.0 ... Content-Type: text/plain; charset=utf-8 ... Content-Disposition: inline ... Content-Transfer-Encoding: 8bit ... ... 한글ᥡ╥ສए) message.get_payload(decode=True).decode('latin1') '\\ud55c\\uae00\\u1961\\u2565\\u0eaa\\u090f' message.get_payload(decode=True).decode('raw-unicode-escape') '한글ᥡ╥ສए' However, even if latin1 did work, the main point is that a different encoding than the one the message specifies must be used in order to decode the bytes to a unicode string. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18271 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset
R. David Murray added the comment: The python3 email package's handling of 8bit definitely has quirks. (So did the python2 email package's, but they were different quirks. :) You can't correctly handle 8bit unless you use message_from_bytes and take the input from a byte string. It is a good question what should be done with a unicode string that claims its payload is 8bit...since that situation can't arise on the wire (or in a disk file), perhaps it should produce an exception (message must be parsed as binary data?) The problem with that idea is that the email parser promises to never raise errors, but always produce *some* sort of model from the input, possibly with defects attached. All that aside, here is what you want to be doing: from email import message_from_bytes message = message_from_bytes(bMIME-Version: 1.0 ... Content-Type: text/plain; charset=utf-8 ... Content-Disposition: inline ... Content-Transfer-Encoding: 8bit ... ... \xc3\xbcnic\xc3\xb6de data..) message.get_content_charset() 'utf-8' message.get_payload(decode=True) b'\xc3\xbcnic\xc3\xb6de data..' message.get_payload(decode=True).decode('utf-8') 'ünicöde data..' message.get_payload() 'ünicöde data..' You will note that get_payload without the decode automatically does the charset decode. I know this is counter-intuitive, but we are dealing with a legacy API that I had to retrofit. Think of decode=True as produce binary from the wire content transfer encoding, and decode=False as produce the string representation of the payload. For ASCII content-transfer-encodings, this is more intuitive (the raw quoted printable, for example), but for 8bit we can only produce a python string if we do the unicode decode...so that's what we do. You will also note that the payload in this case really *is* utf-8, whereas in your example it was unicode...and what the python3 email package does with a unicode payload is not well defined and is definitely buggy. I'm going to close this issue, because dealing with the vagaries of 8bit with string input is on my master list of things to tackle this summer, and will be dealt with in the context of other changes. -- resolution: - invalid stage: - committed/rejected status: open - closed versions: -Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18271 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset
Marko Lalic added the comment: Thank you for your reply. Unfortunately, I have a use case where message_from_bytes has a pretty great disadvantage. I have to parse the received message and then forward it completely unchanged, apart from possibly adding a few new headers. The problem with message_from_bytes is that it changes the Content-Transfer-Encoding header to base64 (and consequently base64 encodes the content). Do you possibly have a suggestion how to currently go about solving this problem? A possible solution I can spot from your answer is to check the Content-Transfer-Encoding before getting the payload and use the version without decode=True when it is 8bit. Maybe there is something more elegant? Thank you in advance. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18271 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset
R. David Murray added the comment: If all you are changing is headers (and you con't change the CTE), then when you use BytesGenerator to re-serialize the message, it is supposed to preserve the existing CTE/payload. (Whether or not you call get_payload, regardless of arguments, does not matter; get_payload does not modify the Message object...though set_payload does, of course). If you have a case where the payload is being re-encoded even though you have not changed the content-type or content-transfer-encoding headers or the payload, then that is a bug. Of course, if you use just Generator (which is what str uses), the output message must be in ASCII, so in that case it does indeed transcode 8bit payloads to base64. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18271 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com