On Tue, 05 Oct 2010 22:05:33 +1000, Nick Coghlan wrote: > On Tue, Oct 5, 2010 at 3:41 PM, Stephen J. Turnbull <step...@xemacs.org> > wrote: > > R. David Murray writes: > > > Only if the email package contains a coding error would the > > > surrogates escape and cause problems for user code. > > > > I don't think it is reasonable to internalize surrogates that way; > > some applications *will* want to look at them and do something useful > > with them (delete them or replace them with U+FFFD or ...). However, > > I argue below that the presence of surrogates already means the user > > code is under fire, and this puts the problem in a canonical form so > > the user code can prepare for it (if that is desirable). > > Hang on here, this objection doesn't seem to quite mesh with what RDM > is proposing (and the similar trick I am considering for > urllib.parse).
[snip Nick's clear explanation of the issue and using surrogates to allow string-based algorithms to work] > My understanding is that email6 in 3.3 will essentially follow that > same model. What I believe RDM is suggesting is an in-between approach > for the 3.2 email module: > > - if you pass in bytes data that isn't 7-bit clean and naively use the > str APIs to access the headers, then it will complain loudly if it is > about to return escaped data (but will decode the body in accordance > with the Content Transfer Encoding) Almost correct. What it will do when it does not have the information needed to decode the bytes correctly (ie: the message is not RFC compliant) is to replace the unknown bytes with '?' characters. This means that you can render a "dirty" email to the terminal, for example, and the invalid bytes will show as '?'s.[*] > - if you pass in bytes data and know what you are doing, then you can > access that raw bytes data and do your own decoding With the current patch this is a true statement for message bodies, but not for message headers. There is no easy way to add access to the bytes version of headers to the email5 API, but since any such data would be non-RFC compliant anyway, that will just have to be good enough for now. > I've probably grossly oversimplified what RDM is suggesting, but it > sounds plausible as a useful interim stepping stone to the more > comprehensive type separation in email6. The more I look at the patch the more I think this can be an internal implementation detail in email5 just like you might do for urllib. So the email5 API will have a way to put bytes in, a way to get decoded data out, and a way to get a bytes out (except for individual header values). The model object will be the same no matter what you put in or take out. The additional methods added to the email5 API to make this possible will be: message_from_bytes (and Parser.parsebytes) message_from_binary_file Feedparser.feedbytes BytesGenerator message_from_bytes and message_from_binary_file are currently part of the proposed email6 API, and I was thinking about some version of Feedparser.feedbytes[**]. BytesGenerator wasn't, but now perhaps it will be (and certainly will be in the backward compatibility interface). -- R. David Murray www.bitdance.com [*] Why '?' and not the unicode invalid character character? Well, the email5 Generate.flatten can be used to generate data for transmission over the wire *if* the source is RFC compliant and 7bit-only, and this would be a normal email5 usage pattern (that is, smtplib.SMTP.sendmail expects ASCII-only strings as input!). So the data generated by Generator.flatten should not include unicode...which raises a problem for CTE 8bit sections that the patch doesn't currently address. [**] Benjamin asked how the patch would affect backward compatibility support in email6, and I said it wouldn't make it harder. However, if feedbytes calls can be mixed with feed calls, which in the simplest implementation they could be, then if email6 does *not* use surrogates internally its feedparser algorithm would need to be considerably more complicated to be backward compatible with this. So when I add Feedparser.parsebytes to my patch, I am at least initially going to disallow mixing calls to feed and feedbytes. Which is another reason to add that method so as to keep the use of the surrogateescape an implementation detail. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com