[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset

2013-06-20 Thread Marko Lalic

New submission from Marko Lalic:

When the message's Content-Transfer-Encoding is set to 8bit, the 
get_payload(decode=True) method returns the payload encoded using 
raw-unicode-escape. This means that it is impossible to decode the returned 
bytes using the content charset obtained by the get_content_charset method.

It seems this should be fixed so that get_payload returns the bytes as found in 
the payload when Content-Transfer-Encoding is 8bit, exactly like Python2.7 
handles it.

 from email import message_from_string
 message = message_from_string(MIME-Version: 1.0
... Content-Type: text/plain; charset=utf-8
... Content-Disposition: inline
... Content-Transfer-Encoding: 8bit
... 
... ünicöde data..)
 message.get_content_charset()
'utf-8'
 message.get_payload(decode=True)
b'\xfcnic\xf6de data..'
 message.get_payload(decode=True).decode(message.get_content_charset())
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 0: invalid 
start byte
 message.get_payload(decode=True).decode('raw-unicode-escape')
'ünicöde data..'

--
components: email
messages: 191526
nosy: barry, mlalic, r.david.murray
priority: normal
severity: normal
status: open
title: get_payload method returns bytes which cannot be decoded using the 
message's charset
type: behavior
versions: Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset

2013-06-20 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 message.get_payload(decode=True).decode('latin1')
'ünicöde data..'

--
nosy: +serhiy.storchaka
versions: +Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset

2013-06-20 Thread Marko Lalic

Marko Lalic added the comment:

That will work fine as long as the characters are actually latin. We cannot 
forget the rest of the unicode character planes. Consider::

 message = message_from_string(MIME-Version: 1.0
... Content-Type: text/plain; charset=utf-8
... Content-Disposition: inline
... Content-Transfer-Encoding: 8bit
... 
... 한글ᥡ╥ສए)
 message.get_payload(decode=True).decode('latin1')
'\\ud55c\\uae00\\u1961\\u2565\\u0eaa\\u090f'
 message.get_payload(decode=True).decode('raw-unicode-escape')
'한글ᥡ╥ສए'

However, even if latin1 did work, the main point is that a different encoding 
than the one the message specifies must be used in order to decode the bytes to 
a unicode string.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset

2013-06-20 Thread R. David Murray

R. David Murray added the comment:

The python3 email package's handling of 8bit definitely has quirks.  (So did 
the python2 email package's, but they were different quirks. :)

You can't correctly handle 8bit unless you use message_from_bytes and take the 
input from a byte string.  It is a good question what should be done with a 
unicode string that claims its payload is 8bit...since that situation can't 
arise on the wire (or in a disk file), perhaps it should produce an exception 
(message must be parsed as binary data?)  The problem with that idea is that 
the email parser promises to never raise errors, but always produce *some* sort 
of model from the input, possibly with defects attached.

All that aside, here is what you want to be doing:

 from email import message_from_bytes
 message = message_from_bytes(bMIME-Version: 1.0
... Content-Type: text/plain; charset=utf-8
... Content-Disposition: inline
... Content-Transfer-Encoding: 8bit
... 
... \xc3\xbcnic\xc3\xb6de data..)
 message.get_content_charset()
'utf-8'
 message.get_payload(decode=True)
b'\xc3\xbcnic\xc3\xb6de data..'
 message.get_payload(decode=True).decode('utf-8')
'ünicöde data..'
 message.get_payload()
'ünicöde data..'

You will note that get_payload without the decode automatically does the 
charset decode.  I know this is counter-intuitive, but we are dealing with a 
legacy API that I had to retrofit.  Think of decode=True as produce binary 
from the wire content transfer encoding, and decode=False as produce the 
string representation of the payload.  For ASCII content-transfer-encodings, 
this is more intuitive (the raw quoted printable, for example), but for 8bit we 
can only produce a python string if we do the unicode decode...so that's what 
we do.

You will also note that the payload in this case really *is* utf-8, whereas in 
your example it was unicode...and what the python3 email package does with a 
unicode payload is not well defined and is definitely buggy.

I'm going to close this issue, because dealing with the vagaries of 8bit with 
string input is on my master list of things to tackle this summer, and will be 
dealt with in the context of other changes.

--
resolution:  - invalid
stage:  - committed/rejected
status: open - closed
versions:  -Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset

2013-06-20 Thread Marko Lalic

Marko Lalic added the comment:

Thank you for your reply.

Unfortunately, I have a use case where message_from_bytes has a pretty great 
disadvantage. I have to parse the received message and then forward it 
completely unchanged, apart from possibly adding a few new headers. The problem 
with message_from_bytes is that it changes the Content-Transfer-Encoding header 
to base64 (and consequently base64 encodes the content).

Do you possibly have a suggestion how to currently go about solving this 
problem? A possible solution I can spot from your answer is to check the 
Content-Transfer-Encoding before getting the payload and use the version 
without decode=True when it is 8bit. Maybe there is something more elegant?

Thank you in advance.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset

2013-06-20 Thread R. David Murray

R. David Murray added the comment:

If all you are changing is headers (and you con't change the CTE), then when 
you use BytesGenerator to re-serialize the message, it is supposed to preserve 
the existing CTE/payload.  (Whether or not you call get_payload, regardless of 
arguments, does not matter; get_payload does not modify the Message 
object...though set_payload does, of course).

If you have a case where the payload is being re-encoded even though you have 
not changed the content-type or content-transfer-encoding headers or the 
payload, then that is a bug.

Of course, if you use just Generator (which is what str uses), the output 
message must be in ASCII, so in that case it does indeed transcode 8bit 
payloads to base64.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com