On Thu, 16 Sep 2010 16:51:58 -0400
"R. David Murray"<rdmur...@bitdance.com> wrote:
What do we store in the model? We could say that the model is always
text. But then we lose information about the original bytes message,
and we can't reproduce it. For various reasons (mailman being a big one),
this is not acceptable. So we could say that the model is always bytes.
But we want access to (for example) the header values as text, so header
lookup should take string keys and return string values[2].
Why can't you have both in a single class? If you create the class
using a bytes source (a raw message sent by SMTP, for example), the
class automatically parses and decodes it to unicode strings; if you
create the class using an unicode source (the text body of the e-mail
message and the list of recipients, for example), the class
automatically creates the bytes representation.
(of course all processing can be done lazily for performance reasons)
What about email files on disk? They could be bytes, or they could be,
effectively, text (for example, utf-8 encoded).
Such a file can be two things:
- the raw encoding of a whole message (including headers, etc.), then
it should be fed as a bytes object
- the single text body of a hypothetical message, then it should be fed
as a unicode object
I don't see any possible middle-ground.
On disk, using utf-8,
one might store the text representation of the message, rather than
the wire-format (ASCII encoded) version. We might want to write such
messages from scratch.
But then the user knows the encoding (by "user" I mean what/whoever
calls the email API) and mentions it to the email package.
What I'm having an issue with is that you are talking about a bytes
representation and an unicode representation of a message. But they
aren't representations of the same things:
- if it's a bytes representation, it will be the whole, raw message
including envelope / headers (also, MIME sections etc.)
- if it's an unicode representation, it will only be a section of the
message decodable as such (a text/plain MIME section, for example;
or a decoded header value; or even a single e-mail address part of a
decoded header)
So, there doesn't seem to be any reason for having both a BytesMessage
and an UnicodeMessage at the same abstraction level. They are both
representing different things at different abstraction levels. I don't
see any potential for confusion: raw assembled e-mail message = bytes;
decoded text section of a message = unicode.
As for the problem of potential "bogus" raw e-mail data
(e.g., undecodable headers), well, I guess the library has to make a
choice between purity and practicality, or perhaps let the user choose
themselves. For example, through a `strict` flag. If `strict` is true,
raise an error as soon as a non-decodable byte appears in a header, if
`strict` is false, decode it through a default (encoding, errors)
convention which can be overriden by the user (a sensible possibility
being "utf-8, surrogateescape" to allow for lossless round-tripping).
As I said above, we could insist that files on
disk be in wire-format, and for many applications that would work fine,
but I think people would get mad at us if didn't support text files[3].
Again, this simply seems to be two different abstraction levels:
pre-generated raw email messages including headers, or a single text
waiting to be embedded in an actual e-mail.
Anyway, what polymorphism means in email is that if you put in bytes,
you get a BytesMessage, if you put in strings you get a StringMessage,
and if you want the other one you convert.
And then you have two separate worlds while ultimately the same
concepts are underlying. A library accepting BytesMessage will crash
when a program wants to give a StringMessage and vice-versa. That
doesn't sound very practical.
[1] Now that surrogateesscape exists, one might suppose that strings
could be used as an 8bit channel, but that only works if you don't need
to *parse* the non-ASCII data, just transmit it.
Well, you can parse it, precisely. Not only, but it round-trips if you
unparse it again:
header_bytes = b"From: bogus\xFFname<some...@python.com>"
name, value = header_bytes.decode("utf-8", "surrogateescape").split(":")
name
'From'
value
' bogus\udcffname<some...@python.com>'
"{0}:{1}".format(name, value).encode("utf-8", "surrogateescape")
b'From: bogus\xffname<some...@python.com>'
In the end, what I would call a polymorphic best practice is "try to
avoid bytes/str polymorphism if your domain is well-defined
enough" (which I admit URLs aren't necessarily; but there's no
question a single text/XXX e-mail section is text, and a whole
assembled e-mail message is bytes).
Regards
Antoine.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk