R. David Murray <rdmur...@bitdance.com> added the comment:

OK, I'm not entirely sure I want to post this, but....

Antoine and I were having a conversation about nntplib and email and I noted 
that unicode as an email transmission channel acts as if it required 7bit clean 
data.  That is, that there's no way to use unicode as an 8bit data transmission 
channel.  Antoine pointed out that there is PEP 383, and that he is using that 
in his nntplib update to tunnel 8bit data (if there is any) from and back to 
the nntp server.  I said I couldn't do that with email because I not only 
needed to transmit the data, I also needed to *parse* it.

Antoine pointed out that you can in fact parse a header even if it has 
surrogateescape code points in it.

So I started thinking about that.  In point of fact, from the point of view of 
an email parser, non-ASCII bytes are pretty much opaque.  They don't affect the 
semantics of the parsing.  Either they are invalid data (in headers), or they 
are opaque content data (8bit Content-Transfer-Encoding).

So...I came up with a horrible little hack, which is attached here as a patch.  
This is horrible because it is a perversion of the Python3 desire to make a 
clean separation between bytes and strings.  The only thing it really has to 
recommend it is that it works: it allows email5 (the version of email currently 
in Python3) to read wire-format messages and parse them into valid message 
structures.

The patch is a proof of concept and is far from complete.  It handles only 
message bodies (but those are the most important) and has no doc updates and 
only one test.  If this approach is deemed worth considering, I will flesh out 
the tests and make sure the corner cases are handled correctly, and write docs 
with lots of notes about why this is perverse and email6 will make it all 
better :)

I feel bad about posting this both because it is an ugly hack and because it 
will likely slow down email6 development (because it will make email5 mostly 
work).  But making email5 mostly work in 3.2 seems like a case where 
practicality beats purity.

The essence of the hack is as follows: Given binary data we encode it to ASCII 
using the surrogateescape error handler.  Then, when a message body is 
retrieved we check to see if there are any surrogates in it, and if there are 
we encode it back to ASCII using surrogateescape, thereby recovering the 
original bytes.  For "Content-Transfer-Encoding: 8bit" parts we can then try to 
decode it using the declared charset, or ASCII with the replace error handler 
if the charset isn't known.  But in any case the original binary data is 
accessible by using 'decode=True' in the call to get_payload.  (NB for those 
not familiar with the API: decode=True refers to decoding the 
Content-Transfer-Encoding, *not* decoding to unicode...which means after CTE 
decoding you end up with a byte string).

For headers, which are not supposed to have 8bit data in them, the best we can 
do is re-decode them with ASCII/replace, but at least it will be possible to 
parse the messages.  (The current patch doesn't do this.)

Another thing missing from the current patch is the generator side.  But since 
the binary data for the message content is now available, it should be possible 
to have a generator that outputs binary.

Note that in this patch I've introduced new functions/methods for getting 
binary string data in, but for file input one needs to open the file as text 
using ASCII encoding and the surrogateescape error handler.

I've only done minimal testing on this (obviously), and so I may find a 
showstopper somewhere along the way, but so far it seems to work, and logically 
it seems like it should work.

I don't know if that makes me happy or sad :)

----------
keywords: +patch
versions:  -Python 3.1
Added file: http://bugs.python.org/file18962/email_parse_bytes.diff

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue4661>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to