Re: [Python-Dev] Patch making the current email package (mostly) support bytes
Steven D'Aprano writes: I don't think anyone has ever suggested change for change's sake. If they have, I'd love to read the PEP for it. Not to mention the BDFL's pronouncement message!wink ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
ba...@python.org wrote in the full post below: I'm reminded of a survey Guido conducted at some long past Python conference. He asked (paraphrasing): raise your hand if you think Python is changing too fast. Lots of hands went up. Then he asked, raise your hand if you have a feature you want to get in the next version. Lots of hands went up. When? I doubt that you'd get the same reaction today given the schism that 3.X has created. Regardless, this underscores much of what I'm trying to get across here. Python conference attendees are hardly representative of the user base at large. Even today, they are probably just 0.1% of the whole. This list's readership is an order of magnitude smaller still. Open doesn't mean all that much to those outside the 0.01% whose preferences set the agenda. I appreciate that some people here do indeed weigh compatibility carefully, and realize that there are multiple valid viewpoints on this issue. And regrettably, I have neither solutions nor time to give this thread the further attention it deserves. So my point is just this: Change for change's sake is truly not what most Python users want. If Python core developers want 3.X to become as popular as 2.X, they should be less concerned with posts on this list or hands at a conference, than with the feet of the masses whose votes will ultimately decide 3.X's fate. --Mark Lutz (http://learning-python.com, http://rmi.net/~lutz) Date: Fri, 8 Oct 2010 14:20:32 -0400 From: Barry Warsaw ba...@python.org To: python-dev@python.org Subject: Re: [Python-Dev] Patch making the current email package (mostly) support bytes On Oct 08, 2010, at 03:44 PM, l...@rmi.net wrote: Ultimately, development in the open source world is driven by the very few with time to show up, rather than by the very many who depend on it. This can unfortunately lead to the perception of thrashing by end users. Some even come to see the net effect as not that much different from closed models. I have no solution to offer, except to underscore again that changes made here affect very many people who are too busy using Python to participate here. Especially given the still tentative state of 3.X, stability matters. I'm reminded of a survey Guido conducted at some long past Python conference. He asked (paraphrasing): raise your hand if you think Python is changing too fast. Lots of hands went up. Then he asked, raise your hand if you have a feature you want to get in the next version. Lots of hands went up. I'm sympathetic to the view that changes in Python can be disruptive to end users. The Python community itself takes this seriously too, as evidenced by the language moratorium[*]. But OTOH, Python cannot stagnate and even fixing things means changing things. The reality too is that Python releases come out approximately every 18 months, and a year and a half can either seem like an excruciatingly long time, or blink of the eye depending on which side of the fence you stand on. Yes, stability matters, but Python 3 is still a new snakeling and I suspect that as the pace of porting picks up, more changes will be necessary. Adding new modules named like distutils2 or unittest2 is less than satisfying but useful for keeping older APIs around. I'm sad to hear that some people think that our development model differs little from closed source development. To me, nothing could be further from the truth. But the adage does go (s)he who does the work, decides, and this is the forum for those who are doing the work. I think everyone here welcomes advocates for under-represented Python communities, and their concerns should be taken in consideration when changes are discussed. But ultimately, Python must evolve to stay relevant or it will die. This is where competing design trade-offs must be discussed. If not here, by us, then where and by whom? -Barry [*] Mostly instituted to allow alternative implementations to catch up, it does necessarily slow the pace of changes visible to end users. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Wed, 13 Oct 2010 03:01:57 am l...@rmi.net wrote: So my point is just this: Change for change's sake is truly not what most Python users want. If Python core developers want 3.X to become as popular as 2.X, they should be less concerned with posts on this list or hands at a conference, than with the feet of the masses whose votes will ultimately decide 3.X's fate. I don't think anyone has ever suggested change for change's sake. If they have, I'd love to read the PEP for it. -- Steven D'Aprano ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Oct 08, 2010, at 12:37 PM, Stephen J. Turnbull wrote: Ouch. RFC 822 line wrapping is a bytes-bytes transformation, and the client shouldn't see it at all unless it inspects the wire format. Header wrapping sucks even more because it's supposed to take the semantic context into account, which means that a generic Header wrapping algorithm cannot work for everything. E.g. Received: headers are supposed to wrap after the semicolon. The current email package does a pretty poor job of emulating this requirement, though it often gets it right enough. David has plans for addressing this problem. -Barry signature.asc Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
Barry Warsaw writes: Header wrapping sucks even more because it's supposed to take the semantic context into account, which means that a generic Header wrapping algorithm cannot work for everything. E.g. Received: headers are supposed to wrap after the semicolon. Received headers are an easy special case: An Internet mail program MUST NOT change or delete a Received: line that was previously added to the message header section. (RFC 5321, sec. 4.4) So you save them as bytes and Barry's your FLUFL, as they say. If email wants to *produce* them (as a service to say smtplib), then it wants to comply with the detailed recommendations in RFC 5321, sec. 4.4 anyway; I don't think there's a good reason treat Received headers as text since they're conceptually part of the wire protocol. (Except for the information of curious users, but then getting it exactly right is best done by just passing the whole thing, folds and all, to .decode('ascii'), I should think.) I should think you *want* addresses and suchlike structured headers (Content-Type with several RFC 2231 parameters, anyone?) to line up nicely, too. So generic folding algorithms are really only applicable to unstructured text fields like Subject and Summary anyway. You can call that sucky if you like, I prefer to call it tasteful. wink ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
Thanks for both your reply and work, David. I'm going to have to test my email clients under the 3.2 patch when it gels. It's good to hear that email5 API support remains a goal. I don't mean to single out this change unfairly, of course. My real concern is not as much with the specific technical aspects of this proposal, as with the generally low priority that backward compatibility sometimes receives on this list. The bytecode file model change in 3.2 comes to mind as another example; sound as it may be, I'm not sure this list has any idea how many users, systems, or docs may be impacted by this. Though not always true, the work here does sometimes appear to be conducted in a vacuum. Ultimately, development in the open source world is driven by the very few with time to show up, rather than by the very many who depend on it. This can unfortunately lead to the perception of thrashing by end users. Some even come to see the net effect as not that much different from closed models. I have no solution to offer, except to underscore again that changes made here affect very many people who are too busy using Python to participate here. Especially given the still tentative state of 3.X, stability matters. --Mark Lutz (http://learning-python.com, http://rmi.net/~lutz) -Original Message- From: R. David Murray rdmur...@bitdance.com To: l...@rmi.net Subject: Re: [Python-Dev] Patch making the current email package (mostly) support bytes Date: Thu, 07 Oct 2010 13:46:02 -0400 On Thu, 07 Oct 2010 16:03:18 -, l...@rmi.net wrote: I'm forwarding a link to the code of these clients to David by private email in case they might be useful as a test case (O'Reilly has already posted them ahead of the book, but they may be a bit too heavy for use in formal testing). Thanks very much. I will take a look, and expect they will be helpful. The email package is obviously less than ideal today, and there are many other clients for it besides my own, of course. But making it backward incompatible at this point is likely to be seen as a big negative to newcomers evaluating 3.X viability. And as I tried to make clear in June, this list should carefully weigh the PR cost of pulling the rug out from under those brave souls who have already taken the time to accommodate the 3.X world you've mandated. Well, as I have said before the plan is to provide backward compatibility in email6, so that you only need to change your code if you want to take advantage of improved or new functionality. If this turns out not to be possible for some reason, then we aren't going to suddenly stop supporting email5. That's not the Python Way :) (Example: we added ArgParse post-3.0, and lots of people wanted to deprecate OptParse, but we aren't planning on removing OptParse.) Do you see any issues with the patch I'm proposing? My goal is to make things work that didn't work before, but nothing that worked before should stop working, if I do my job right. The one *potentially* backward-incompatible change that I'm consciously considering (that is, any other backward incompatibilities will be bugs) is having DecodedGenerator fully decode headers and emit full unicode, rather than the ASCII-only unicode that Generator emits. Can you think of any problem that that would cause? A quick grep indicates your own code does not use that generator (possibly because currently it does not do that decoding). I could, of course, only enable header decoding if a flag is passed requesting it, and as I write this I realize that that is indeed what I should do. Even though I haven't been able to think of a case where DecodedGenerator producing non-ASCII unicode would be an issue, that doesn't mean there isn't one :) To put that more strongly, the Python user base is much larger than this list's readership. If I'm using 3.1 email, so are many others. People will accept the 3.X world you make up to a point, but it's impossible to code to a moving target, much less base a product on it. At some point, they'll simply stop trying to keep up; in fact, some already have. Fixes are a Good Thing, of course, and this particular change's scope remains to be seen; but to channel most of the users I meet out there in the real world today: Enough with the 3.X changes already, eh? Now that Python3 is out, the backward compatibility policy for it is the same as it always was for Python2. Only the transition from 2 to 3 broke backward compatibility in a significant way. From here on, we are as conservative as we always have been at making backward incompatible changes (that is, we don't do it intentionally without a good reason and a deprecation cycle, and if we do it unintentionally it is a regression and treated as such). -- R. David Murray www.bitdance.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
step...@xemacs.org wrote in the full message below: If having 1 *and* 2 is so important to particular users, but they come into conflict because of proposed changes in Python, then they're going to have to give up 3, come here, and articulate their needs. But I _did_ come here and articulate my needs, and received this antagonistic response for my efforts. If you really value user input, you may want to explore the nature of your reaction to it. Trust me: criticism goes with the territory any time your actions impact a large group of people. This seems inherent here. Frankly, your view of the roles of developers and users seems so upside down to me that I doubt anything I could say here would matter. You're more than welcome to ignore an interjection of reality and adopt a closed group mindset, of course, but you do so at the peril of the system you're working on. For my part, one week from now I'll be standing up again in front of a group of 20 Python beginners, and basically apologizing for both the present and ongoing 3.X changes they must conform to in the near future. Python may not be Perl 6 yet, but its image is already tarnished in the real world where people make technology choices, due to its rapid pace of change. It's a genuine problem. In the end, I suppose I'm just one of those lazy end users you mentioned who are too busy to spend 24/7 hanging out on this list in order to head off changes that will break their code. (Yes, sarcasm intended.) --Mark Lutz (http://learning-python.com, http://rmi.net/~lutz) -Original Message- From: Stephen J. Turnbull step...@xemacs.org To: l...@rmi.net Subject: Re: [Python-Dev] Patch making the current email package (mostly)support bytes Date: Fri, 08 Oct 2010 14:33:22 +0900 l...@rmi.net writes: To put that more strongly, the Python user base is much larger than this list's readership. Agreed. Nevertheless, this is the channel (not channel) that the developers listen on, and substantial effort is made to let Python users know that. I think they do know it, too. If I'm using 3.1 email, so are many others. That's not obvious. 3.1 email is unusable for several applications. In fact, for human factors reasons (humans are very likely to communicate with other humans who use the same encodings, and to accept occasional glitches they must deal with manually), MUAs are likely to port relatively easily as good enough software. But I doubt very much that folks writing MTAs or spam filters that must run unattended, often in long-lived, very active processes, are producing production versions using Python 3 email yet. People will accept the 3.X world you make up to a point, but it's impossible to code to a moving target, much less base a product on it. Impossible is nothing. It's a decision that each individual developer makes for herself. I haven't heard Mailman devs complain about the impossibility of dealing with the proposed changes, for example. Quite the reverse, in fact. At some point, they'll simply stop trying to keep up; in fact, some already have. Predictable and predicted. Where's the balance? I don't know, but channeling the users is not a lot of help. There are three worthy goals here: 1. Taking advantage of improvements in to-be-released Pythons. 2. Not changing one's own working code. 3. Not participating in python-dev/email-sig. Take any two; one can't have all three. More specifically, it's interesting that most of the users you talk to care enough to actually say they don't want more incompatible changes. But what are we supposed to take from that? Some fixes have to be incompatible; do the users want the fix or the compatibility? You waffle (as a good representative often must): Fixes are a Good Thing, of course, and this particular change's scope remains to be seen; but to channel most of the users I meet out there in the real world today: Enough with the 3.X changes already, eh? But that's also a decision each developer *can* make for himself. Python does not withdraw products, or even withdraw support, just because the core developers release something they consider better. If having 1 *and* 2 is so important to particular users, but they come into conflict because of proposed changes in Python, then they're going to have to give up 3, come here, and articulate their needs. As you are doing -- but to have real influence, you're going to have to do the review of David's patch that he requests. I really don't see how the process can work any other way. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
Barry Warsaw writes: On Oct 07, 2010, at 04:40 AM, Stephen J. Turnbull wrote: I'm fairly certain that most of the modern causes of [Unicode errors in Mailman] are post-parse modifications of the message. IOW, in Mailman's architecture, we try to parse the raw data into a Message object tree very early in the pipeline, and then a pickled version of that gets passed between the queue runners. Where we've gotten into trouble before has been things like adding the Subject prefixes and such. Not to mention those wonderful unremovable addresses containing TAB etc. But I'm pretty sure I've seen reports at least in 2.1.9, and probably more recently than that, where there was 8-bit content in a header of the incoming message and Mailman blew up on that. This is stuff that should have been shunted explicitly, but instead managed to get out of the parser and then blow up. I don't think the errors I'm thinking about were due to Mailman manipulations, but rather insufficient paranoia in handling incoming hazmat. That seems like application logic that the email package can't really get involved with, and indeed Mailman has built up a raft of defense for failures of this kind. But adding Subject prefixes and the like shouldn't be a problem as long is the internal representation of each message object (bytes vs str) is fixed and the representation is opaque, so that the module can do appropriate conversions when necessary. The problem that you face in Python 2 is that that separation is not properly made, and the same values in the message object can often serve as text and as wire format, and it's hard to tell which is which. The Unicode handling is tacked on as an afterthought. That mess is entirely unnecessary in Python 3. Text and wire format can be easily distinguished with three different representations of email: Unicode for the conceptual RFC 822 layer (of course this is an extension, because RFC 822 itself is strictly limited to the ASCII subset), bytes for wire format, and Message objects for modern structured mail (including MIME, etc). *If* email6 is reengineered with that kind of structure, then you should be able to dispense with almost all of the raft of defense, because the email module will give you well-behaved Message objects, whose text components (including the header) are well-behaved character strings that mix seamlessly with other character strings. Maybe even in email5 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Fri, 08 Oct 2010 15:51:45 -, l...@rmi.net wrote: For my part, one week from now I'll be standing up again in front of a group of 20 Python beginners, and basically apologizing for both the present and ongoing 3.X changes they must conform to in the near future. Python may not be Perl 6 yet, but its image is already tarnished in the real world where people make technology choices, due to its rapid pace of change. It's a genuine problem. In the end, I suppose I'm just one of those lazy end users you mentioned who are too busy to spend 24/7 hanging out on this list in order to head off changes that will break their code. (Yes, sarcasm intended.) What would be helpful would be to know what changes it is that we have made between 3.1 and 3.2 that are raising backward compatibility concerns. What are we doing that is perceived as ongoing 3.X changes? Generalities will not help, only by looking at specifics can we re-evaluate our actions. In a private message you mentioned the bytecode file model change, by which I presume you mean PEP 3147. Our view is that this is a backward compatible change: any Python program that was working should continue to work. Barry's original idea was that the new behavior would only be turned on by a flag, but Guido (and others) wanted it to be the default because in his view it is a superior arrangement for normal use. Perhaps we did not fully consider the effect on third party tools (and, as you point out, documentation) that expects .pyc files along side the .py files. Yet this change is no where near the level of change that makes typical Python programs fail. We feel like it is a worthwhile trade-off (and Debian and Ubuntu at least may well backport it to earlier Python versions). But apparently you disagree. So, engage us in dialog about it, please. And *please* mention any other specific changes you think are disruptive between 3.1 and 3.2. We need to know about them, preferably *before* we release 3.2 beta (currently targeted for the end of this month). Because I assure you that it is not our policy to be changing things any more rapidly than we did between python 2.x versions[*]. If you feel like you are apologizing to your groups of beginners, it would be wonderful if you could act as their advocate here. Obviously the issues directly affect you, so hopefully it is worth your time to engage us on this topic. And thank you for the messages you have sent. I know they have made me even more careful than I was already trying to be. -- R. David Murray www.bitdance.com [*] There may be a few exceptions to this where the 3.x library code fails to work in real-world applications, so that a more radical change is made but is, in reality, a bug fix. But even there we try to be conservative. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Sat, 09 Oct 2010 01:06:29 +0900, Stephen J. Turnbull step...@xemacs.org wrote: That mess is entirely unnecessary in Python 3. Text and wire format can be easily distinguished with three different representations of email: Unicode for the conceptual RFC 822 layer (of course this is an extension, because RFC 822 itself is strictly limited to the ASCII subset), bytes for wire format, and Message objects for modern structured mail (including MIME, etc). *If* email6 is reengineered with that kind of structure, then you should be able to dispense with almost all of the raft of defense, because the email module will give you well-behaved Message objects, whose text components (including the header) are well-behaved character strings that mix seamlessly with other character strings. That engineering is pretty much what we are looking at, although in practice I think you have to hang wire-format and text-format bits off of appropriate places in the model in order to keep everything properly coordinated. Maybe even in email5 I suspect that's pushing it. Patches happily accepted, though :) -- R. David Murray www.bitdance.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Fri, 08 Oct 2010 15:44:45 -, l...@rmi.net wrote: Thanks for both your reply and work, David. I'm going to have to test my email clients under the 3.2 patch when it gels. It's good to hear that email5 API support remains a goal. I just landed the patch (though without the MIME encoding of unknown header bytes or the 'yes-I-really-want-the-escaped-bytes' flags that Stephen and I have been discussing. So it will be present in alpha3. I would greatly appreciate your testing it and making sure it doesn't break any of your code. I don't mean to single out this change unfairly, of course. My real concern is not as much with the specific technical aspects of this proposal, as with the generally low priority that backward compatibility sometimes receives on this list. The bytecode file I don't perceive that lack of priority myself. Certainly I don't see a lack of priority on backward compatibility in the bug tracker, quite the reverse[*]. As I said in my public email, specific examples would be most helpful. model change in 3.2 comes to mind as another example; sound as it may be, I'm not sure this list has any idea how many users, systems, or docs may be impacted by this. Though not always true, the work here does sometimes appear to be conducted in a vacuum. Well, we can only react to the input we find out about. Developers *do* read blogs and such about what's going on in the wider community and bring that info back to python-dev, but as is inherent with projects structured as volunteer efforts, what we get is only what someone decides to put in time on. Specific suggestions on how to improve the feedback loop are always welcome; volunteer efforts to improve our fundamental procedures are just as or perhaps more valuable than volunteer code writing (though they probably involve even more politicing effort :). Ultimately, development in the open source world is driven by the very few with time to show up, rather than by the very many who depend on it. This can unfortunately lead to the perception of thrashing by end users. Some even come to see the net effect as not that much different from closed models. I have no solution Well, the Python community takes it as a principle to avoid thrashing. So if you see examples where we are failing in that goal, call us on it (with specifics). to offer, except to underscore again that changes made here affect very many people who are too busy using Python to participate here. Especially given the still tentative state of 3.X, stability matters. We do try to remain aware of that. When we fail, someone needs to let us know. -- R. David Murray www.bitdance.com [*] I'm currently aware of one exception to this, the nttplib module. It was pretty much unusable as it stood (I tried, as did Antoine; it had no unit tests so massive breakage is not that surprising), so we broke backward compatibility with 3.1 in order to fix that. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Fri, 08 Oct 2010 23:55:37 +0900, Stephen J. Turnbull step...@xemacs.org wrote: I should think you *want* addresses and suchlike structured headers (Content-Type with several RFC 2231 parameters, anyone?) to line up nicely, too. So generic folding algorithms are really only applicable to unstructured text fields like Subject and Summary anyway. You can call that sucky if you like, I prefer to call it tasteful. No, what's sucky is that email4/5 doesn't support that. It only folds headers as unstructured blobs, with a nod in the direction of structure by breaking lines at obvious places like ';'s. (Which line breaking algorithm is the subject of at least one bug report) I'd like to fix that in email6 by adding full support for structured headers. -- R. David Murray www.bitdance.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Fri, 08 Oct 2010 12:37:38 +0900, Stephen J. Turnbull step...@xemacs.org wrote: *If* you have an 8-bit value of unknown encoding on input, this will appear in the Header's value as a surrogate. Hm, OK, I see the problem ... as usual, it's that the only efficient thing to do is encode using surrogate-escape which loses the information that these are invalid bytes. Would it really be that bad to add an O(length) component where you examine the string for surrogates (and too-long words, for that matter), and chop off those pieces for MIME encoding? Nope, and that's more or less what I think I'm going to do. But I haven't started writing the code yet. Presumably you are suggesting that email5 be smart enough to turn my example into properly UTF-8/CTE encoded text. No, in general that's undecidable without asking the originator, although humans can often make a good guess. I was talking about unicode input, though, where you do know (modulo the language differences that unicode hasn't yet sorted out). I don't understand why this is difficult. As far as what Unicode has It isn't difficult in principle. It's just difficult in email5. -- R. David Murray www.bitdance.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
R. David Murray writes: On Sat, 09 Oct 2010 01:06:29 +0900, Stephen J. Turnbull step...@xemacs.org wrote: That mess is entirely unnecessary in Python 3. Text and wire format can be easily distinguished with three different representations of email: Unicode for the conceptual RFC 822 layer (of course this is an extension, because RFC 822 itself is strictly limited to the ASCII subset), bytes for wire format, and Message objects for modern structured mail (including MIME, etc). That engineering is pretty much what we are looking at, although in practice I think you have to hang wire-format and text-format bits off of appropriate places in the model in order to keep everything properly coordinated. Right. That's where I was going with my comment to Barry about the Received headers. Even if email isn't going to serve clients working with wire format, it needs to deal with those headers. But where I think the headers defined by RFC 822 should be stored as str in email6, I am leaning toward storing Received headers verbatim as bytes (including any RFC 822 folding whitespace) because of the RFC 5321 requirement that they be preserved exactly. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Oct 08, 2010, at 03:44 PM, l...@rmi.net wrote: Ultimately, development in the open source world is driven by the very few with time to show up, rather than by the very many who depend on it. This can unfortunately lead to the perception of thrashing by end users. Some even come to see the net effect as not that much different from closed models. I have no solution to offer, except to underscore again that changes made here affect very many people who are too busy using Python to participate here. Especially given the still tentative state of 3.X, stability matters. I'm reminded of a survey Guido conducted at some long past Python conference. He asked (paraphrasing): raise your hand if you think Python is changing too fast. Lots of hands went up. Then he asked, raise your hand if you have a feature you want to get in the next version. Lots of hands went up. I'm sympathetic to the view that changes in Python can be disruptive to end users. The Python community itself takes this seriously too, as evidenced by the language moratorium[*]. But OTOH, Python cannot stagnate and even fixing things means changing things. The reality too is that Python releases come out approximately every 18 months, and a year and a half can either seem like an excruciatingly long time, or blink of the eye depending on which side of the fence you stand on. Yes, stability matters, but Python 3 is still a new snakeling and I suspect that as the pace of porting picks up, more changes will be necessary. Adding new modules named like distutils2 or unittest2 is less than satisfying but useful for keeping older APIs around. I'm sad to hear that some people think that our development model differs little from closed source development. To me, nothing could be further from the truth. But the adage does go (s)he who does the work, decides, and this is the forum for those who are doing the work. I think everyone here welcomes advocates for under-represented Python communities, and their concerns should be taken in consideration when changes are discussed. But ultimately, Python must evolve to stay relevant or it will die. This is where competing design trade-offs must be discussed. If not here, by us, then where and by whom? -Barry [*] Mostly instituted to allow alternative implementations to catch up, it does necessarily slow the pace of changes visible to end users. signature.asc Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Sat, 09 Oct 2010 02:48:23 +0900, Stephen J. Turnbull step...@xemacs.org wrote: R. David Murray writes: On Sat, 09 Oct 2010 01:06:29 +0900, Stephen J. Turnbull step...@xemacs.org wrote: That mess is entirely unnecessary in Python 3. Text and wire format can be easily distinguished with three different representations of email: Unicode for the conceptual RFC 822 layer (of course this is an extension, because RFC 822 itself is strictly limited to the ASCII subset), bytes for wire format, and Message objects for modern structured mail (including MIME, etc). That engineering is pretty much what we are looking at, although in practice I think you have to hang wire-format and text-format bits off of appropriate places in the model in order to keep everything properly coordinated. Right. That's where I was going with my comment to Barry about the Received headers. Even if email isn't going to serve clients working with wire format, it needs to deal with those headers. But where I think the headers defined by RFC 822 should be stored as str in email6, I am leaning toward storing Received headers verbatim as bytes (including any RFC 822 folding whitespace) because of the RFC 5321 requirement that they be preserved exactly. Well, the plan for email6 is to *allow* clients to work with wire format, though it will probably be a bit more awkward than working with the text interface. And my current strategy is in general to preserve the input bytes and, as long as the header in question hasn't been modified, emit those bytes when serialization back to bytes is done. My current plan is that conversion to text is only done at the point where text is requested, at which point the conversion is cached for later use. And if the header is modified, the source bytes version is discarded. Conversely if the source of the header was text input (msg['Subject'] = 'Hi'), then the conversion to bytes is only done when serialization to bytes is requested. None of this is implemented yet. -- R. David Murray www.bitdance.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Thu, 07 Oct 2010 03:31:34 +0900, Stephen J. Turnbull step...@xemacs.org wrote: R. David Murray writes: 5. Return the content, with non-ASCII bytes replaced with ? characters. That hadn't occurred to me (and it makes me sick to contemplate it). That said, this is probably good enough for Mailman-like apps to limp along for most users. It's certainly good enough for the might kick your wife and elope with your dog alpha ports of Mailman to Python 3 (well, as certain as I can be; of course in the end Barry decides). Assuming reasonable backward compatibility of the API, of course! Yeah, good enough is pretty much the goal here. In other words, my proposed patch only makes email5 1/8 to 1/4 broken, instead of half broken as it is now. But not un-broken enough for Mailman, it sounds like. IMO, not in the long run. But realistically, in the applications I know of, most desired traffic is conformant, and since there aren't any Python 3 email apps yet, this isn't even a regression. :-/ I do think that it's important that the parsed object be able to tell you what fields are there (except if the field name itself is invalid) and return field bodies parsed as far as possible. Well, email doesn't currently parse the bodies any further by itself. You have to call parsing routines to get further parsing. So maybe what I should do is work on finalizing the patch without addressing the 'give me the escaped bytes issue', and then prepare a follow on patch that adds that keyword and adjusts the header parsing helpers accordingly. If we go this route (as opposed to only handling headers with 8bit data by sanitizing them), then we need to think about the email5 header parsers as well (decode_header and parseaddr). They are of course going to have the same problems as the rest of the email package with parsing bytes, and you are suggesting that access to those header 8bit bytes is needed. Yes, that would be preferable to replacing them with ASCII junk. But I don't see any problem with parsing them; they're syntactically insignificant by definition. The problem is purely on output: do I get verbatim escaped bytes, a sanitized str, or an exception? Right, the needed changes should be sanitizing by default, and providing the keyword to get the escaped bytes. Mostly it'll be writing tests :) Does my proposal make sense? But note, it raises exactly the backward compatibility concerns you mention in your next email (that I will reply to next). It is an open question whether it is worth opening that door in order to be able to do extended handling on non-RFC conforming email (as opposed to just sanitizing it and soldering on). Well, maybe not. However, it is not obvious to me that you won't run into these issues again in Email6. Applications that think of email as textual objects are going to want to make their own choices about handling of non-conforming email, and it's likely to be massively inconvenient to say OK, but you have to use bytes interfaces exclusively, because the str interfaces don't handle that. The strategy in email6 so far is for the application program to be able to access *any piece* of the parsed data as either text or bytes, and for the header parsers to record defects when there are non-ASCII bytes where there aren't supposed to be. So the application can check for defects and retrieve, say, the comment field that has the non-ASCII *as bytes* and decode it. Or, if it doesn't care about parsing them, it just modifies the fields it wants to modify that *are* valid, and the invalid non-ASCII comment gets carried along and emitted when the message is serialized as bytes. This is more or less what we are talking about enabling in email5 with the 'escape_bytes=True' keyword, it's just a less structured and more error prone approach to it than what we have planned for email6. -- R. David Murray www.bitdance.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
Stephen J. Turnbull stephen at xemacs.org writes: R. David Murray writes: We're (in the current patch) not punting on handling non-conforming email, we're punting on handling non-conforming bytes *if the headers that contain them need to be modified*. The headers can still be modified, you just (currently) lose the non-ASCII bytes in the process. Modified *or examined*. I can't think of any important applications offhand that *need* to examine the non-ASCII bytes (in particular, Mailman doesn't need to do that). Verbatim copying of the bytes themselves is almost always the desired usage. Mmm. Yes, or examined. If we allow escaped bytes to be returned, perhaps we also should provide a helper that unescapes the bytes and returns the byte string (yes, this is just a call to encode, but by wrapping it we continue to hide the surrogateescape implementation detail.) And robustness is not the issue, only extended-beyond-the-RFCs handling of non-conforming bytes would be an issue. And with that, I'm certain that Jon Postel is really dead. A goal for email6 is to be *at least* as Postel compliant as email4. The goal for my patch is to make email5.1 more Postel compliant than email5.0 is :) (Surely you are not saying that Generator.flatten can't DTRT with non-ASCII content *at all*?) Yes, that is *exactly* what I am saying: m = email.message_from_string(\ ... From: pöstal ... ... ) str(m) Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 1: ordinal not in range(128) But that's not interesting; you did that with Python 3. We want to Of course I did it with Python3. It's the Python3 email codebase I'm working with (and have to work *around*). know what people porting from Python 2 will expect. So, in 2.5.5 or 2.6.6 on Mac, with email v4.0.2, it *doesn't* raise, it returns wideload:~ 4:14$ python Python 2.5.5 (r255:77872, Jul 13 2010, 03:03:57) [GCC 4.0.1 (Apple Inc. build 5490)] on darwin Type help, copyright, credits or license for more information. import email m=email.message_from_string('From: pöstal\n\n') str(m) 'From nobody Thu Oct 7 04:18:25 2010\nFrom: p\xc3\xb6stal\n\n' m['From'] 'p\xc3\xb6stal' That's hardly helpful! Surely we can and should do better than that now, especially since UTF-8 (with a proper CTE) is now almost universally acceptable to MUAs. When would it be a problem for that to return 'From nobody Thu Oct 7 04:18:25 2010\nFrom: =?UTF-8?Q?p=C3=B6stal?=\n\n' What's wrong with that is that when we parse the bytes of the message we don't know that b'\xc3\xb6' == '=?UTF-8?Q?=C3=B6?='. It isn't even all that likely to be true, since I would guess that latin1 is still more common than utf-8 (but you might know better). Remember, email5 is a direct translation of email4, and email4 only handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the- -ride-fine-we'll-pass-then-along. So if you want to put non-ASCII data into a message you have to encode it properly to ASCII in exactly the same way that you did in email4: But if you do it right, then it will still work in a version that just encodes non-ASCII characters in UTF-8 with the appropriate CTE. Since you'll never be passing it non-ASCII characters, it's already ASCII and UTF-8, and no CTE will be needed. So you are suggesting that I should use U+FFFD encoded as UTF-8 rather than '?' as the substitution character? But earlier you said that people would probably rather not be forced to deal with Unicode just because there are invalid bytes in the message. So that's probably not what you meant. Presumably you are suggesting that email5 be smart enough to turn my example into properly UTF-8/CTE encoded text. But *that* problem is what email6 is trying to address. It just doesn't look practical to address it directly in the email5 code base, because the email4 codebase that email5 inherits does not provide the correct distinction between bytes and text. email5 is parsing the input stream *as if* it were ASCII-only CTE text. I'm trying to extend it to also handle non-ASCII bytes gracefully. Extending it to actually handle unicode input is a whole different kettle of sushi[*]. Yes, exactly. I need to fix the patch to recode using, say, quoted-printable in that case. It really should check for proportions of non-ASCII. QP would be horrible for Japanese or Chinese. Noted. DecodedGenerator could still produce the unicode, though, which is what I believe we want. (Although that raises the question of whether DecodedGenerator should also decode the RFC2047 encoded headersbut that raises a backward compatibility issue). Can't really help you there. While I would want the RFC 2047 headers decoded if I were writing new code (which is generally the case for me), I haven't really wrapped my
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Thu, 07 Oct 2010 15:00:04 +0900, Stephen J. Turnbull step...@xemacs.org wrote: R. David Murray writes: But that's not interesting; you did that with Python 3. We want to Of course I did it with Python3. It's the Python3 email codebase I'm working with (and have to work *around*). Sure. My point is that it has nothing to do with the expections of people trying to upgrade their apps to Python 3, and meeting those expectations is an important requirement of the specification of email5, right? Well, not necessarily, no. Python3 broke backward compatibility. *Some* changes are going to have to be made in user code to make it work with email5. Where we can minimize those changes we should, but it isn't a requirement, no. With my patch, the minimization will be message_from_string -- message_from_bytes, message_from_file -- message_from_binary_file, and in some cases Generator -- BytesGenerator, for those programs that need to deal with wire format data that is not 7bit clean. Programs that only *generate* emails should need few if any changes, but that is already true (that's the half of email that is working :). Actually, in context we were not talking about a random character that came in from outside, we were talking about U+FFFD that *we* generated, and *know* that it's the only non-ASCII character in the string because we replaced all the others with it. Ah, so that *was* what you were suggesting. Of course the best we can do with 'From: =?UNKNOWN?Q?p=C3=B6stal' or 'From: p\xc3\xb6stal' on input is to save the encoded or raw bytes representation and spit it back out on output. Yes. And I haven't actually dealt with what to do with non-ascii characters or RFC2047 unknown-8bit characters when decoding headers in email6. In issue 6302 we are talking about adding a decode_header_to_string method for email5 where the same issue arises, and so we'll need to make a decision soon. Presumably we'll use U+FFFD to replace them (along with registering defects in email6). The MIME-charset = UNKNOWN dodge might be a better way of handling this. The str is all ASCII, so won't raise exceptions unless the app itself objects to MIME encoded-words for some reason. OTOH, the presence of encoded words will be a red flag to any human viewer, and after processing with .flatten(), the receiver is likely to DTRT (from the receiving human's point of view, per that human's configuration). That is a very interesting idea. It is the *right* thing to do, since it would mean that a message parsed as bytes could be generated via Generator and passed to, say, smtplib without losing any information. However, It's not exactly trivial to implement, since issues of runs of characters and line re-wrapping need need to be dealt with. Perhaps Header can be made to handle bytes in order to do this; I'll have to look in to it. So you are suggesting that I should use U+FFFD encoded as UTF-8 rather than '?' as the substitution character? But earlier you said that people would probably rather not be forced to deal with Unicode just because there are invalid bytes in the message. So that's probably not what you meant. Suggest !=3D recommend. Talking to a wider base of users and developers, you might or might not find that to be a good idea. I don't think the 800 million or so Chinese coming online in the next decade will much care whether you use U+FFFD or '?'. The Japanese would prefer U+2639 WHITE FROWNING FACE or U+270C VICTORY HAND, no doubt (crassly cute is much beloved here). Americans will likely prefer '?', as they probably have correspondents with legacy systems that won't like UTF-8 or perhaps don't have a font to display U+FFFD. For the moment I think I'll stick with '?', with the idea of fixing that bug by using the unknown charset trick at a later stage. Presumably you are suggesting that email5 be smart enough to turn my example into properly UTF-8/CTE encoded text. No, in general that's undecidable without asking the originator, although humans can often make a good guess. But not always: Japanese are fond of four-character compound words, and I once found an 8-byte sequence (four 2-byte characters) that is idiomatic in both Shift JIS and EUC-JP. Even a dictionary lookup can't determine the intended encoding for that sequence. I was talking about unicode input, though, where you do know (modulo the language differences that unicode hasn't yet sorted out). I'm only saying that any Unicode email-N generates itself can be properly encoded. Agreed. But *that* problem is what email6 is trying to address. It just doesn't look practical to address it directly in the email5 code base, because the email4 codebase that email5 inherits does not provide the correct distinction between bytes and text. email5 is parsing the input stream *as if* it were ASCII-only CTE text. I don't see how this is different from email6.
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
Stephen J. Turnbull wrote (giving me an opening to jump in here): R. David Murray writes: In other words, my proposed patch only makes email5 1/8 to 1/4 broken, instead of half broken as it is now. But not un-broken enough for Mailman, it sounds like. IMO, not in the long run. But realistically, in the applications I know of, most desired traffic is conformant, and since there aren't any Python 3 email apps yet, this isn't even a regression. :-/ Well, yes there are, and yes it is. As I pointed out in a thread on this list back in June, there are multiple large Python 3 email apps in the new Programming Python, a book which is about to be released, and which will be read by at least tens of thousands of people, many of whom will be evaluating the stability of Python 3. These apps include both a simple webmail site, as well as a more sophisticated 5k-line tkinter email client -- one which I've been using for all my personal and business email over the last 6 months, and which works well with the email package as it is in 3.1 (albeit with a bit of workaround code). This includes support for Unicode, MIME, headers, attachments, and the lot. I'm forwarding a link to the code of these clients to David by private email in case they might be useful as a test case (O'Reilly has already posted them ahead of the book, but they may be a bit too heavy for use in formal testing). The email package is obviously less than ideal today, and there are many other clients for it besides my own, of course. But making it backward incompatible at this point is likely to be seen as a big negative to newcomers evaluating 3.X viability. And as I tried to make clear in June, this list should carefully weigh the PR cost of pulling the rug out from under those brave souls who have already taken the time to accommodate the 3.X world you've mandated. To put that more strongly, the Python user base is much larger than this list's readership. If I'm using 3.1 email, so are many others. People will accept the 3.X world you make up to a point, but it's impossible to code to a moving target, much less base a product on it. At some point, they'll simply stop trying to keep up; in fact, some already have. Fixes are a Good Thing, of course, and this particular change's scope remains to be seen; but to channel most of the users I meet out there in the real world today: Enough with the 3.X changes already, eh? --Mark Lutz (http://learning-python.com, http://rmi.net/~lutz) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Thu, 07 Oct 2010 16:03:18 -, l...@rmi.net wrote: I'm forwarding a link to the code of these clients to David by private email in case they might be useful as a test case (O'Reilly has already posted them ahead of the book, but they may be a bit too heavy for use in formal testing). Thanks very much. I will take a look, and expect they will be helpful. The email package is obviously less than ideal today, and there are many other clients for it besides my own, of course. But making it backward incompatible at this point is likely to be seen as a big negative to newcomers evaluating 3.X viability. And as I tried to make clear in June, this list should carefully weigh the PR cost of pulling the rug out from under those brave souls who have already taken the time to accommodate the 3.X world you've mandated. Well, as I have said before the plan is to provide backward compatibility in email6, so that you only need to change your code if you want to take advantage of improved or new functionality. If this turns out not to be possible for some reason, then we aren't going to suddenly stop supporting email5. That's not the Python Way :) (Example: we added ArgParse post-3.0, and lots of people wanted to deprecate OptParse, but we aren't planning on removing OptParse.) Do you see any issues with the patch I'm proposing? My goal is to make things work that didn't work before, but nothing that worked before should stop working, if I do my job right. The one *potentially* backward-incompatible change that I'm consciously considering (that is, any other backward incompatibilities will be bugs) is having DecodedGenerator fully decode headers and emit full unicode, rather than the ASCII-only unicode that Generator emits. Can you think of any problem that that would cause? A quick grep indicates your own code does not use that generator (possibly because currently it does not do that decoding). I could, of course, only enable header decoding if a flag is passed requesting it, and as I write this I realize that that is indeed what I should do. Even though I haven't been able to think of a case where DecodedGenerator producing non-ASCII unicode would be an issue, that doesn't mean there isn't one :) To put that more strongly, the Python user base is much larger than this list's readership. If I'm using 3.1 email, so are many others. People will accept the 3.X world you make up to a point, but it's impossible to code to a moving target, much less base a product on it. At some point, they'll simply stop trying to keep up; in fact, some already have. Fixes are a Good Thing, of course, and this particular change's scope remains to be seen; but to channel most of the users I meet out there in the real world today: Enough with the 3.X changes already, eh? Now that Python3 is out, the backward compatibility policy for it is the same as it always was for Python2. Only the transition from 2 to 3 broke backward compatibility in a significant way. From here on, we are as conservative as we always have been at making backward incompatible changes (that is, we don't do it intentionally without a good reason and a deprecation cycle, and if we do it unintentionally it is a regression and treated as such). -- R. David Murray www.bitdance.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Oct 07, 2010, at 04:40 AM, Stephen J. Turnbull wrote: And the email API currently promises not to raise during parsing, which is a contract my patch does not change. Which is a contract that has historically been broken frequently. Unhandled UnicodeErrors have been one of the most common causes of queue stoppage in Mailman (exceeded only by configuration errors AFAICS). I haven't seen any reports for a while, but with the email package being reengineered from the ground up, the possibility of regression can't be ignored. I'm fairly certain that most of the modern causes of this are post-parse modifications of the message. IOW, in Mailman's architecture, we try to parse the raw data into a Message object tree very early in the pipeline, and then a pickled version of that gets passed between the queue runners. If the initial parse fails, there's almost literally nothing Mailman can do with the original data other than delete it. Where we've gotten into trouble before has been things like adding the Subject prefixes and such. That seems like application logic that the email package can't really get involved with, and indeed Mailman has built up a raft of defense for failures of this kind. -Barry signature.asc Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
R. David Murray writes: The MIME-charset = UNKNOWN dodge might be a better way of handling this. That is a very interesting idea. It is the *right* thing to do, since it would mean that a message parsed as bytes could be generated via Generator and passed to, say, smtplib without losing any information. However, It's not exactly trivial to implement, since issues of runs of characters and line re-wrapping need need to be dealt with. Perhaps Header can be made to handle bytes in order to do this; I'll have to look in to it. Ouch. RFC 822 line wrapping is a bytes-bytes transformation, and the client shouldn't see it at all unless it inspects the wire format. MIME-encoding is a text-bytes transformation, again an internal matter. The constraints on the wire format means that the MIME- encoder needs to careful about encoded-word length. ISTM that all you need to know, assuming that this is a method on a Header, and it's normally invoked just before conversion to bytes, is the codec and the CTE, and both can be optional (default to 'utf-8' and a value depending on the proportion of encodable characters). You take the header, encode according to the codec, then start MIME-encoding according to the CTE. The maximum size of encoded words is chosen to fit on a line within 78 bytes. The number of bytes encoded in each word depends only on the size of metadata associated with the word. (Sure you could make it prettier for those reading it with an MUA like less, but I don't think that's really worth anybody's time.) *If* you have an 8-bit value of unknown encoding on input, this will appear in the Header's value as a surrogate. Hm, OK, I see the problem ... as usual, it's that the only efficient thing to do is encode using surrogate-escape which loses the information that these are invalid bytes. Would it really be that bad to add an O(length) component where you examine the string for surrogates (and too-long words, for that matter), and chop off those pieces for MIME encoding? Presumably you are suggesting that email5 be smart enough to turn my example into properly UTF-8/CTE encoded text. No, in general that's undecidable without asking the originator, although humans can often make a good guess. I was talking about unicode input, though, where you do know (modulo the language differences that unicode hasn't yet sorted out). I don't understand why this is difficult. As far as what Unicode has and hasn't sorted out, that's not your job AFAICS. If clients want a specific codec or other language-based style, they'd better specify it themselves. Else, you just stuff the Unicode into a UTF-8-encoded bytes, and go from there. This is *why* Unicode was designed, so that software could do something standard and sane with text which needs to be readable but not exquisitely crafted literary works. No? If you want beauty, then use a markup language. Right, but I was talking about my python3 example, where I was using the email5 parser to (unsuccessfully) parse unicode. *That's* the thing email5 can't really handle, but email6 will be able to. For email5 it would be an extension, yes, but I don't see why it would be hard to handle Unicode input, assuming it's *really* Unicode, unless you want to cater to legacy systems that might not understand Unicode (or at least would prefer an alternative encoding). Since it's an extension, I don't think that's your problem, and the people who would really like this extension (eg, the Japanese) are used to dealing with mojibake issues. (Of course, as an extension, you don't need to do it at all. This is just speculation.) The problem would be with careless clients of email5 that find a way to hand it bogus Unicode (eg, by inappropriately using the latin-1 codec to get a binary represention of their bytes in Unicode), but I'm not sure how big a problem that would be. Thank you very much for this piece of perspective. I hadn't thought about it that clearly before, but what you say makes perfect sense to me, and is in fact the implicit perspective I've been working from when working on the email6 stuff. You're welcome, of course, and it makes me feel much better about email6. (Not that I had any real worries, but here we are about halfway up a 100m cliff, and the trail just widened from 20cm to 2m. :-) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
l...@rmi.net writes: To put that more strongly, the Python user base is much larger than this list's readership. Agreed. Nevertheless, this is the channel (not channel) that the developers listen on, and substantial effort is made to let Python users know that. I think they do know it, too. If I'm using 3.1 email, so are many others. That's not obvious. 3.1 email is unusable for several applications. In fact, for human factors reasons (humans are very likely to communicate with other humans who use the same encodings, and to accept occasional glitches they must deal with manually), MUAs are likely to port relatively easily as good enough software. But I doubt very much that folks writing MTAs or spam filters that must run unattended, often in long-lived, very active processes, are producing production versions using Python 3 email yet. People will accept the 3.X world you make up to a point, but it's impossible to code to a moving target, much less base a product on it. Impossible is nothing. It's a decision that each individual developer makes for herself. I haven't heard Mailman devs complain about the impossibility of dealing with the proposed changes, for example. Quite the reverse, in fact. At some point, they'll simply stop trying to keep up; in fact, some already have. Predictable and predicted. Where's the balance? I don't know, but channeling the users is not a lot of help. There are three worthy goals here: 1. Taking advantage of improvements in to-be-released Pythons. 2. Not changing one's own working code. 3. Not participating in python-dev/email-sig. Take any two; one can't have all three. More specifically, it's interesting that most of the users you talk to care enough to actually say they don't want more incompatible changes. But what are we supposed to take from that? Some fixes have to be incompatible; do the users want the fix or the compatibility? You waffle (as a good representative often must): Fixes are a Good Thing, of course, and this particular change's scope remains to be seen; but to channel most of the users I meet out there in the real world today: Enough with the 3.X changes already, eh? But that's also a decision each developer *can* make for himself. Python does not withdraw products, or even withdraw support, just because the core developers release something they consider better. If having 1 *and* 2 is so important to particular users, but they come into conflict because of proposed changes in Python, then they're going to have to give up 3, come here, and articulate their needs. As you are doing -- but to have real influence, you're going to have to do the review of David's patch that he requests. I really don't see how the process can work any other way. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
R. David Murray writes: version of headers to the email5 API, but since any such data would be non-RFC compliant anyway, [access to non-conforming headers by reparsing the bytes] will just have to be good enough for now. But that's potentially unpleasant for, say, Mailman. AFAICS, what you're saying is that Mailman will have to implement a full header parser and repair module, or shunt (and wait for administrator intervention on) any mail that happens to contain even one byte of non-RFC-conforming content in a header it cares about. (Note that we're not talking about moderator-level admins here; we're talking about the Big Cheese with access to the command line on the list host.) That's substantially worse than the current system, where (in theory, and in actual practice where it distributes its own version of email) it can trap the Unicode exception on a per-header basis. I also worry about the implications for backwards compatibility. Eventually email-N needs to handle non-conforming mail in a sensible way, or anybody who gets spam (ie, everybody) and wants a reliable email system will need to implement their own. If you punt completely on handling non-conforming mail now, when is it going to be done? And when it is done, will the backward-compatible interface be able to access the robust implementation, or will people who want robust APIs have to use rather different ones? The way you're going right now, I have to worry about the answer to the second question, at least. [*] Why '?' and not the unicode invalid character character? Well, the email5 Generate.flatten can be used to generate data for transmission over the wire *if* the source is RFC compliant and 7bit-only, and this would be a normal email5 usage pattern (that is, smtplib.SMTP.sendmail expects ASCII-only strings as input!). So the data generated by Generator.flatten should not include unicode... I don't understand this at all. Of course the byte stream generated by Generator.flatten won't contain Unicode (in the headers, anyway); it will contain only ASCII (that happens to conform to QP or Base64 encoding of Unicode in some appropriate UTF in many cases). Why is U+FFFD REPLACEMENT CHARACTER any different from any other non-ASCII character in this respect? (Surely you are not saying that Generator.flatten can't DTRT with non-ASCII content *at all*?) The only thing I can think of is that you might not want to introduce non-ASCII characters into a string that looks like it might simply be corrupted in transmission (eg, it contains only one non-ASCII byte). That's reasonable; there are a lot of people who don't have to deal with anything but ASCII and occasionally Latin-1, and they don't like having Unicode crammed down their throats. which raises a problem for CTE 8bit sections that the patch doesn't currently address. AFAIK, there's no requirement, implied or otherwise, that a conforming implementation *produce* CTE 8bit. So just don't do that; that will keep smtplib happy, no? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Wed, 06 Oct 2010 12:22:18 +0900, Stephen J. Turnbull step...@xemacs.org wrote: Nick Coghlan writes: - if you pass in bytes data and know what you are doing, then you can access that raw bytes data and do your own decoding At what level, though? To take an interesting example I used to see frequently: From: t...@tokyo.jp (Taro Yamada in 8-bit Shift JIS) So I guess you are suggesting that the email module can RFC 822 parse that, and 1. Refuse to return the unwrapped (ie, single line) form of the whole field, except as bytes. 2. Refuse to return the content of the From field, except as bytes. 3. Return the email address parsed from the From field. 4. Refuse to return the comment, except as bytes. 5. Return the content, with non-ASCII bytes replaced with ? characters. In other words, my proposed patch only makes email5 1/8 to 1/4 broken, instead of half broken as it is now. But not un-broken enough for Mailman, it sounds like. That's fine. But suppose I have a private or newly defined header that is structured? Now I have two choices: 1. Write a version of my private parser for both str (the normal case) and bytes (if accessing the value as str raises) 2. Always get the bytes and convert them to str (probably using the same .decode('ascii','surrogate-escape') call that email uses but won't let me have the value of!), then use a common str parser. Yes, this is exactly the dilemma faced by the entire email package. The current email6 code attempts to do a variation on (1) by having a common parser that handles both strings and bytes using a dual subclass approach. This patch is trying out (2). If you have a private header parser, you would ideally like to be able to use the same mechanism as the email package to solve the problem. For email6 you'd be able to register your header parser and get handed the input like the built in parser and be able to use the tools provided by the built in parser to do your work. In email5 there is no way that I know of for you to register a private parser, so you need access to the raw input for the header in one form or another. If we go this route (as opposed to only handling headers with 8bit data by sanitizing them), then we need to think about the email5 header parsers as well (decode_header and parseaddr). They are of course going to have the same problems as the rest of the email package with parsing bytes, and you are suggesting that access to those header 8bit bytes is needed. One option would be to add a keyword to the get and get_all methods that instructs it to return the string with the surrogate-escaped bytes, which can then be passed onward to decode_header, parseaddr, or a custom decoder. Then I need to look at what needs to be added to those methods to handle the escaped bytes, and from what you say they too need a keyword telling them to preserve the escaped bytes on output (a yes I know what I'm doing flag...'preserve_escaped_bytes=True'?). Note that this is more problematic than it looks, since the appropriate base codec may require information from higher-level structures (eg, qp codec tags or a Content-Type header's charset field). You'll have to give me an example of where this is a problem but is not already a problem in email4. Why should I reproduce email's logic here? I don't care if the default or concise API raises on surrogates in the str value. But I'm pretty sure that I will want to use str values containing surrogates in these contexts (for the same reasons that email module does, for example), rather than work with bytes sometimes and strs sometimes. Please provide a way to return strs-with-surrogates if I ask for them. Does my proposal make sense? But note, it raises exactly the backward compatibility concerns you mention in your next email (that I will reply to next). It is an open question whether it is worth opening that door in order to be able to do extended handling on non-RFC conforming email (as opposed to just sanitizing it and soldering on). -- R. David Murray www.bitdance.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Wed, 06 Oct 2010 22:55:00 +0900, Stephen J. Turnbull step...@xemacs.org wrote: R. David Murray writes: version of headers to the email5 API, but since any such data would be non-RFC compliant anyway, [access to non-conforming headers by reparsing the bytes] will just have to be good enough for now. But that's potentially unpleasant for, say, Mailman. AFAICS, what you're saying is that Mailman will have to implement a full header parser and repair module, or shunt (and wait for administrator intervention on) any mail that happens to contain even one byte of non-RFC-conforming content in a header it cares about. (Note that No, it just means that such bytes would not be preserved for presentation in the web UI. They'd show up as '?'s (Or U+FFFDs, perhaps, if I change DeocdedGenerator to use U+FFFD instead of ?s for the unknown bytes). As long as BytesGenerator is used on the output side to send the messages, the bytes will be preserved and presented to the moderator in their email. So the only parsing issue is if Mailman cares about *the non-ASCII bytes* in the headers it cares about. If it has to modify headers that contain non-ASCII bytes (for example, addresses and Subject) and cares about preserving the non-ASCII bytes, then there is indeed an issue; see previous email for a possible way around that. we're not talking about moderator-level admins here; we're talking about the Big Cheese with access to the command line on the list host.) That's substantially worse than the current system, where (in theory, and in actual practice where it distributes its own version of email) it can trap the Unicode exception on a per-header basis. I thought mailman no longer distributed its own version of email? And the email API currently promises not to raise during parsing, which is a contract my patch does not change. I also worry about the implications for backwards compatibility. Eventually email-N needs to handle non-conforming mail in a sensible way, or anybody who gets spam (ie, everybody) and wants a reliable email system will need to implement their own. If you punt completely on handling non-conforming mail now, when is it going to be done? And We're (in the current patch) not punting on handling non-conforming email, we're punting on handling non-conforming bytes *if the headers that contain them need to be modified*. The headers can still be modified, you just (currently) lose the non-ASCII bytes in the process. when it is done, will the backward-compatible interface be able to access the robust implementation, or will people who want robust APIs have to use rather different ones? The way you're going right now, I have to worry about the answer to the second question, at least. Well, this is still theory given the current state of the email6 code, but I *think* that working email5 code, even after this patch, will continue to work using email6's backward compatibility interface. And robustness is not the issue, only extended-beyond-the-RFCs handling of non-conforming bytes would be an issue. *But*, as I implied in my previous email, if we allow the surrogates out so that custom header parsers can use them, then making *that* code continue to work may require an extra layer in the compatibility interface to produce the surrogateescaped strings. Still, at the moment I can't see any theoretical reason why that would not be possible, so it may be worth the risk. [*] Why '?' and not the unicode invalid character character? Well, the email5 Generate.flatten can be used to generate data for transmission over the wire *if* the source is RFC compliant and 7bit-only, and this would be a normal email5 usage pattern (that is, smtplib.SMTP.sendmail expects ASCII-only strings as input!). So the data generated by Generator.flatten should not include unicode... I don't understand this at all. Of course the byte stream generated by Generator.flatten won't contain Unicode (in the headers, anyway); it will contain only ASCII (that happens to conform to QP or Base64 encoding of Unicode in some appropriate UTF in many cases). Why is U+FFFD REPLACEMENT CHARACTER any different from any other non-ASCII character in this respect? (Surely you are not saying that Generator.flatten can't DTRT with non-ASCII content *at all*?) Yes, that is *exactly* what I am saying: m = email.message_from_string(\ ... From: pöstal ... ... ) str(m) Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 1: ordinal not in range(128) Remember, email5 is a direct translation of email4, and email4 only handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the- -ride-fine-we'll-pass-then-along. So if you want to put non-ASCII data into a message you have to encode it properly to ASCII in exactly the same way that you did in email4: m = email.message.Message() m['From'] =
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
R. David Murray writes: 5. Return the content, with non-ASCII bytes replaced with ? characters. That hadn't occurred to me (and it makes me sick to contemplate it). That said, this is probably good enough for Mailman-like apps to limp along for most users. It's certainly good enough for the might kick your wife and elope with your dog alpha ports of Mailman to Python 3 (well, as certain as I can be; of course in the end Barry decides). Assuming reasonable backward compatibility of the API, of course! In other words, my proposed patch only makes email5 1/8 to 1/4 broken, instead of half broken as it is now. But not un-broken enough for Mailman, it sounds like. IMO, not in the long run. But realistically, in the applications I know of, most desired traffic is conformant, and since there aren't any Python 3 email apps yet, this isn't even a regression. :-/ I do think that it's important that the parsed object be able to tell you what fields are there (except if the field name itself is invalid) and return field bodies parsed as far as possible. If we go this route (as opposed to only handling headers with 8bit data by sanitizing them), then we need to think about the email5 header parsers as well (decode_header and parseaddr). They are of course going to have the same problems as the rest of the email package with parsing bytes, and you are suggesting that access to those header 8bit bytes is needed. Yes, that would be preferable to replacing them with ASCII junk. But I don't see any problem with parsing them; they're syntactically insignificant by definition. The problem is purely on output: do I get verbatim escaped bytes, a sanitized str, or an exception? One option would be to add a keyword to the get and get_all methods that instructs it to return the string with the surrogate-escaped bytes, which can then be passed onward to decode_header, parseaddr, or a custom decoder. Then I need to look at what needs to be added to those methods to handle the escaped bytes, and from what you say they too need a keyword telling them to preserve the escaped bytes on output (a yes I know what I'm doing flag... 'preserve_escaped_bytes=True'?). The need is not absolute, but I would have a strong preference for being able to get at those bytes. Does my proposal make sense? But note, it raises exactly the backward compatibility concerns you mention in your next email (that I will reply to next). It is an open question whether it is worth opening that door in order to be able to do extended handling on non-RFC conforming email (as opposed to just sanitizing it and soldering on). Well, maybe not. However, it is not obvious to me that you won't run into these issues again in Email6. Applications that think of email as textual objects are going to want to make their own choices about handling of non-conforming email, and it's likely to be massively inconvenient to say OK, but you have to use bytes interfaces exclusively, because the str interfaces don't handle that. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
R. David Murray writes: So the only parsing issue is if Mailman cares about *the non-ASCII bytes* in the headers it cares about. If it has to modify headers that contain non-ASCII bytes (for example, addresses and Subject) and cares about preserving the non-ASCII bytes, then there is indeed an issue; see previous email for a possible way around that. OK. I thought mailman no longer distributed its own version of email? I believe so; the point is that it could do so again. And the email API currently promises not to raise during parsing, which is a contract my patch does not change. Which is a contract that has historically been broken frequently. Unhandled UnicodeErrors have been one of the most common causes of queue stoppage in Mailman (exceeded only by configuration errors AFAICS). I haven't seen any reports for a while, but with the email package being reengineered from the ground up, the possibility of regression can't be ignored. Granted, there should be no regression problem in the current model for Email5, AIUI. We're (in the current patch) not punting on handling non-conforming email, we're punting on handling non-conforming bytes *if the headers that contain them need to be modified*. The headers can still be modified, you just (currently) lose the non-ASCII bytes in the process. Modified *or examined*. I can't think of any important applications offhand that *need* to examine the non-ASCII bytes (in particular, Mailman doesn't need to do that). Verbatim copying of the bytes themselves is almost always the desired usage. And robustness is not the issue, only extended-beyond-the-RFCs handling of non-conforming bytes would be an issue. And with that, I'm certain that Jon Postel is really dead. :-( (Surely you are not saying that Generator.flatten can't DTRT with non-ASCII content *at all*?) Yes, that is *exactly* what I am saying: m = email.message_from_string(\ ... From: pöstal ... ... ) str(m) Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 1: ordinal not in range(128) But that's not interesting; you did that with Python 3. We want to know what people porting from Python 2 will expect. So, in 2.5.5 or 2.6.6 on Mac, with email v4.0.2, it *doesn't* raise, it returns wideload:~ 4:14$ python Python 2.5.5 (r255:77872, Jul 13 2010, 03:03:57) [GCC 4.0.1 (Apple Inc. build 5490)] on darwin Type help, copyright, credits or license for more information. import email m=email.message_from_string('From: pöstal\n\n') str(m) 'From nobody Thu Oct 7 04:18:25 2010\nFrom: p\xc3\xb6stal\n\n' m['From'] 'p\xc3\xb6stal' That's hardly helpful! Surely we can and should do better than that now, especially since UTF-8 (with a proper CTE) is now almost universally acceptable to MUAs. When would it be a problem for that to return 'From nobody Thu Oct 7 04:18:25 2010\nFrom: =?UTF-8?Q?p=C3=B6stal?=\n\n' Remember, email5 is a direct translation of email4, and email4 only handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the- -ride-fine-we'll-pass-then-along. So if you want to put non-ASCII data into a message you have to encode it properly to ASCII in exactly the same way that you did in email4: But if you do it right, then it will still work in a version that just encodes non-ASCII characters in UTF-8 with the appropriate CTE. Since you'll never be passing it non-ASCII characters, it's already ASCII and UTF-8, and no CTE will be needed. Yes, exactly. I need to fix the patch to recode using, say, quoted-printable in that case. It really should check for proportions of non-ASCII. QP would be horrible for Japanese or Chinese. DecodedGenerator could still produce the unicode, though, which is what I believe we want. (Although that raises the question of whether DecodedGenerator should also decode the RFC2047 encoded headersbut that raises a backward compatibility issue). Can't really help you there. While I would want the RFC 2047 headers decoded if I were writing new code (which is generally the case for me), I haven't really wrapped my head around the issues of porting old code using Python2 str to Python3 str here. My intuition says no problem (there won't be any MIME-words so the app won't try to decode them), but I'm not real sure of that. ;-) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Tue, Oct 5, 2010 at 3:41 PM, Stephen J. Turnbull step...@xemacs.org wrote: R. David Murray writes: Only if the email package contains a coding error would the surrogates escape and cause problems for user code. I don't think it is reasonable to internalize surrogates that way; some applications *will* want to look at them and do something useful with them (delete them or replace them with U+FFFD or ...). However, I argue below that the presence of surrogates already means the user code is under fire, and this puts the problem in a canonical form so the user code can prepare for it (if that is desirable). Hang on here, this objection doesn't seem to quite mesh with what RDM is proposing (and the similar trick I am considering for urllib.parse). The basic issue is having an algorithm that is designed to operate on character data and depends on multiple ASCII constants stored as str objects. In Python 2.x, those algorithms could innately operate on str objects in any ASCII compatible encoding, as well as on unicode objects (due to the implicit promotion of the ASCII constants to unicode when unicode input was encountered). In Py3k, that trick broke. Now those algorithms only operate on str objects, and bytes input fails, even when it uses an ASCII compatible encoding. For urllib.parse, the external API will be str in - str out, bytes in - bytes out. Whether that is internally implemented by duplicating all the ASCII constants with both bytes and str flavours (as my current patch does), or implicitly (and temporarily) decoding the bytes values using ascii+surrogateescape or latin-1 (a pair of alternative approaches I plan to explore soon) should be completely transparent to the user of the API. If a user can easily tell which of these I am doing just through the external behaviour of the documented API, then I'll have made a mistake somewhere. My understanding is that email6 in 3.3 will essentially follow that same model. What I believe RDM is suggesting is an in-between approach for the 3.2 email module: - if you pass in bytes data that isn't 7-bit clean and naively use the str APIs to access the headers, then it will complain loudly if it is about to return escaped data (but will decode the body in accordance with the Content Transfer Encoding) - if you pass in bytes data and know what you are doing, then you can access that raw bytes data and do your own decoding I've probably grossly oversimplified what RDM is suggesting, but it sounds plausible as a useful interim stepping stone to the more comprehensive type separation in email6. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Tue, 05 Oct 2010 22:05:33 +1000, Nick Coghlan wrote: On Tue, Oct 5, 2010 at 3:41 PM, Stephen J. Turnbull step...@xemacs.org wrote: R. David Murray writes: Only if the email package contains a coding error would the surrogates escape and cause problems for user code. I don't think it is reasonable to internalize surrogates that way; some applications *will* want to look at them and do something useful with them (delete them or replace them with U+FFFD or ...). However, I argue below that the presence of surrogates already means the user code is under fire, and this puts the problem in a canonical form so the user code can prepare for it (if that is desirable). Hang on here, this objection doesn't seem to quite mesh with what RDM is proposing (and the similar trick I am considering for urllib.parse). [snip Nick's clear explanation of the issue and using surrogates to allow string-based algorithms to work] My understanding is that email6 in 3.3 will essentially follow that same model. What I believe RDM is suggesting is an in-between approach for the 3.2 email module: - if you pass in bytes data that isn't 7-bit clean and naively use the str APIs to access the headers, then it will complain loudly if it is about to return escaped data (but will decode the body in accordance with the Content Transfer Encoding) Almost correct. What it will do when it does not have the information needed to decode the bytes correctly (ie: the message is not RFC compliant) is to replace the unknown bytes with '?' characters. This means that you can render a dirty email to the terminal, for example, and the invalid bytes will show as '?'s.[*] - if you pass in bytes data and know what you are doing, then you can access that raw bytes data and do your own decoding With the current patch this is a true statement for message bodies, but not for message headers. There is no easy way to add access to the bytes version of headers to the email5 API, but since any such data would be non-RFC compliant anyway, that will just have to be good enough for now. I've probably grossly oversimplified what RDM is suggesting, but it sounds plausible as a useful interim stepping stone to the more comprehensive type separation in email6. The more I look at the patch the more I think this can be an internal implementation detail in email5 just like you might do for urllib. So the email5 API will have a way to put bytes in, a way to get decoded data out, and a way to get a bytes out (except for individual header values). The model object will be the same no matter what you put in or take out. The additional methods added to the email5 API to make this possible will be: message_from_bytes (and Parser.parsebytes) message_from_binary_file Feedparser.feedbytes BytesGenerator message_from_bytes and message_from_binary_file are currently part of the proposed email6 API, and I was thinking about some version of Feedparser.feedbytes[**]. BytesGenerator wasn't, but now perhaps it will be (and certainly will be in the backward compatibility interface). -- R. David Murray www.bitdance.com [*] Why '?' and not the unicode invalid character character? Well, the email5 Generate.flatten can be used to generate data for transmission over the wire *if* the source is RFC compliant and 7bit-only, and this would be a normal email5 usage pattern (that is, smtplib.SMTP.sendmail expects ASCII-only strings as input!). So the data generated by Generator.flatten should not include unicode...which raises a problem for CTE 8bit sections that the patch doesn't currently address. [**] Benjamin asked how the patch would affect backward compatibility support in email6, and I said it wouldn't make it harder. However, if feedbytes calls can be mixed with feed calls, which in the simplest implementation they could be, then if email6 does *not* use surrogates internally its feedparser algorithm would need to be considerably more complicated to be backward compatible with this. So when I add Feedparser.parsebytes to my patch, I am at least initially going to disallow mixing calls to feed and feedbytes. Which is another reason to add that method so as to keep the use of the surrogateescape an implementation detail. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
Nick Coghlan writes: - if you pass in bytes data and know what you are doing, then you can access that raw bytes data and do your own decoding At what level, though? To take an interesting example I used to see frequently: From: t...@tokyo.jp (Taro Yamada in 8-bit Shift JIS) So I guess you are suggesting that the email module can RFC 822 parse that, and 1. Refuse to return the unwrapped (ie, single line) form of the whole field, except as bytes. 2. Refuse to return the content of the From field, except as bytes. 3. Return the email address parsed from the From field. 4. Refuse to return the comment, except as bytes. That's fine. But suppose I have a private or newly defined header that is structured? Now I have two choices: 1. Write a version of my private parser for both str (the normal case) and bytes (if accessing the value as str raises) 2. Always get the bytes and convert them to str (probably using the same .decode('ascii','surrogate-escape') call that email uses but won't let me have the value of!), then use a common str parser. Note that this is more problematic than it looks, since the appropriate base codec may require information from higher-level structures (eg, qp codec tags or a Content-Type header's charset field). Why should I reproduce email's logic here? I don't care if the default or concise API raises on surrogates in the str value. But I'm pretty sure that I will want to use str values containing surrogates in these contexts (for the same reasons that email module does, for example), rather than work with bytes sometimes and strs sometimes. Please provide a way to return strs-with-surrogates if I ask for them. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On 10/2/2010 7:00 PM, R. David Murray wrote: The clever hack (thanks ultimately to Martin) is to accept 8bit data by encoding it using the ASCII codec and the surrogateescape error handler. I've seen this idea pop up in a number of threads. I worry that you are all inventing a new kind of dual that is a direct parallel to Python 2.x strings. That is to say, 3.x b = b'\xc2\xa1' 3.x s = b.decode('utf8') 3.x v = b.decode('ascii', 'surrogateescape') , where s and v should be the same thing in 3.x but they are not due to an encoding trick. I believe this trick generates more-or-less the same issues as strings did in 2.x: 2.x b = '\xc2\xa1' 2.x s = b.decode('utf8') 2.x v = b Any reasonable 2.x code has to guard on str/unicode and it would seem in 3.x, if this idiom spreads, reasonable code will have to guard on surrogate escapes (which actually seems like a more expensive test). As in, 3.x print(v) Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in position 0: surrogates not allowed It seems like this hack is about making the 3.x unicode type more like the 2.x string type, and I thought we decided that was a bad idea. How will developers not have to ask themselves whether a given string is a real string or a byte sequence masquerading as a string? Am I missing something here? -- Scott Dial sc...@scottdial.com scod...@cs.indiana.edu ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Mon, 04 Oct 2010 12:32:26 -0400, Scott Dial scott+python-...@scottdial.com wrote: On 10/2/2010 7:00 PM, R. David Murray wrote: The clever hack (thanks ultimately to Martin) is to accept 8bit data by encoding it using the ASCII codec and the surrogateescape error handler. I've seen this idea pop up in a number of threads. I worry that you are all inventing a new kind of dual that is a direct parallel to Python 2.x strings. Yes, that is exactly my worry. That is to say, 3.x b = b'\xc2\xa1' 3.x s = b.decode('utf8') 3.x v = b.decode('ascii', 'surrogateescape') , where s and v should be the same thing in 3.x but they are not due to an encoding trick. Why should they be the same thing in 3.x? One is an ASCII string with some escaped bytes in an unknown encoding, the other is a valid unicode string. The surrogateescape trick is used only when we don't *know* the encoding (a priori) of the bytes in question. I believe this trick generates more-or-less the same issues as strings did in 2.x: 2.x b = '\xc2\xa1' 2.x s = b.decode('utf8') 2.x v = b The difference is that in 2.x people could and would operate on strings as if they knew the encoding, and get in trouble. In 3.x you can't do that. If you've got escaped bytes you *know* that you don't know the encoding, and the program can't get around that except by re-encoding to bytes and properly decoding them. Any reasonable 2.x code has to guard on str/unicode and it would seem in 3.x, if this idiom spreads, reasonable code will have to guard on surrogate escapes (which actually seems like a more expensive test). As in, 3.x print(v) Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in position 0: surrogates not allowed Right, I mentioned that concern in my post. In this case at least, however, the *goal* is that the surrogates are never seen outside the email internals. In reflection of this, my latest thought is that I should add a 'message_from_binary_file' helper method and a 'feedbytes' method to feedparser, making the surrogates a 100% internal implementation detail[*]. Only if the email package contains a coding error would the surrogates escape and cause problems for user code. It seems like this hack is about making the 3.x unicode type more like the 2.x string type, and I thought we decided that was a bad idea. How will developers not have to ask themselves whether a given string is a real string or a byte sequence masquerading as a string? Am I missing something here? I think this question is something that needs to be considered any time using surrogates is proposed. I hope that in the email package proposal I've addressed it. What do you think? --David [*] And you are right that there is a performance concern as a result of needing to detect surrogates at various points in the code. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Oct 02, 2010, at 07:00 PM, R. David Murray wrote: The advantage of this patch is that it means Python3.2 can have an email module that is capable of handling a significant proportion of the applications where the ability to process binary email data is required. Like others, I'm concerned that we're perpetuating the Python 2 problems with bytes vs. strings. OTOH, I went down a similar road (though much more hacky and less successful) in one of my failed branches, so I sympathize with this nod to practicality that actually works. If the choice is the current brokenness staying in Python 3.2 or this hack being added for now, I'd go with the latter. email6 will make it all better, right? :) In the meantime, I do think it would be good to give our users something that's practical. I've uploaded the patch to issue 4661 (http://bugs.python.org/issue4661). I uploaded it to rietveld as well just before Martin's announcement. After the announcement I uploaded the svn patch to the tracker, so hopefully there will be an automated review button as well. Here is your chance to exercise the new review tools :) I see no automatically generated link to the review, but I did add some comments to the Rietveld issue you linked to in one of your comments. -Barry signature.asc Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
R. David Murray writes: On Mon, 04 Oct 2010 12:32:26 -0400, Scott Dial scott+python-...@scottdial.com wrote: On 10/2/2010 7:00 PM, R. David Murray wrote: The clever hack (thanks ultimately to Martin) is to accept 8bit data by encoding it using the ASCII codec and the surrogateescape error handler. I've seen this idea pop up in a number of threads. I worry that you are all inventing a new kind of dual that is a direct parallel to Python 2.x strings. Yes, that is exactly my worry. I don't worry about this. Strings generated by decoding with surrogate-escape are *different* from other strings: they contain invalid code units (the naked surrogates). These cannot be encoded except with a surrogate-escape flag to .encode(), and sane developers won't do that unless she knows precisely what she's doing. This is not true with Python 2 strings, where all bytes are valid. Any reasonable 2.x code has to guard on str/unicode and it would seem in 3.x, if this idiom spreads, reasonable code will have to guard on surrogate escapes (which actually seems like a more expensive test). Right, I mentioned that concern in my post. Again, I don't worry about this. It is *not* an *extra* cost. Those messages are *already broken*, they *will* crash the email module if you fail to guard against them. Decoding them to surrogates actually makes it easier to guard, because you know that even if broken encodings are present, the parser will still work. Broken encodings can no longer crash the parser. That is a Very Good Thing IMHO. Only if the email package contains a coding error would the surrogates escape and cause problems for user code. I don't think it is reasonable to internalize surrogates that way; some applications *will* want to look at them and do something useful with them (delete them or replace them with U+FFFD or ...). However, I argue below that the presence of surrogates already means the user code is under fire, and this puts the problem in a canonical form so the user code can prepare for it (if that is desirable). It seems like this hack is about making the 3.x unicode type more like the 2.x string type, Not at all. It's about letting the parser be a parser, and letting the application handle broken content, or discard it, or whatever. Modularity is improved. This has been a major PITA for Mailman support over the years: every time the spammers and virus writers come up with a new idea, there's a chance it will leak out and the email parser will explode, stopping the show. These kinds of errors are a FAQ on the Mailman lists (although much less so in recent years). How will developers not have to ask themselves whether a given string is a real string or a byte sequence masquerading as a string? Am I missing something here? There are two things to say, actually. First, you're in a war zone. *All* email is bytes sequences masquerading as text, and if you're not wearing armor, you're going to get burned. The idea here is to have the email package provide the armor and enough instrumentation so you can do bomb detection yourself (or perhaps just let it blow, if you're hacking up a quick and dirty script). Second, there are developers who will not care whether strings are real or byte sequences in drag, because they're writing MTAs and the like. Those people get really upset, and rightly so, when the parser pukes on broken headers; it is not their app's job at all to deal with that breakage. I think this question is something that needs to be considered any time using surrogates is proposed. I don't agree. The presence of naked surrogates is *always* (assuming sane programmers) an indication of invalid input. The question is, should the parser signal invalidity, or should it allow the application to decide? The email module *doesn't have enough information to decide* whether the invalid input is a real problem, or how to handle it (cf the example of a MTA app). Note that a completely naive app doesn't care -- it will crash either way because it doesn't handle the exception, whether it's raised by the parser or by a codec when the app tries to do I/O. A robust app *does* care: if the parser raises, then the app must provide an alternative parser good enough to find and fix the invalid bytes. Clearly it's much better to pass invalid (but fully parsed) text back to the app in this case. Note that if the app really wants the parser to raise rather than pass on the input, that should be easy to implement at fairly low cost; you just provide a variable rather than hardcoding the surrogate-escape flag. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Patch making the current email package (mostly) support bytes
A while back on some issue or another I remember telling someone that if there was any sort of clever hack that would allow the current email package (email5) to work with bytes we would have implemented it. Well, I've come up with a clever hack. The idea came out of a conversation with Antoine. I was saying that it was ironic that Unicode could only be used as a 7bit-clean data transmission channel for email, and he remarked that by using surrogate escape you *could* use unicode as a transmission channel for 8bit data. At first I dismissed this observation as irrelevant to email, since email has to transform the 8bit data at some point. But I started thinking. And then I started experimenting. And it turns out that it works. The clever hack (thanks ultimately to Martin) is to accept 8bit data by encoding it using the ASCII codec and the surrogateescape error handler. Then, inside the email module at any point where bytes might be meaningful or might be about to escape, it can check to see if there are any surrogates and act accordingly. The API additions are few, and in fact for most programs (he says bravely, not really knowing) there are really only two changes you need to make when converting a program that handles bytes data to py3k. The first is the encoding of binary input data as mentioned. The second is that when you want to get the bytes back out, you use the new BytesGenerator instead of Generator. BytesGenerator is just like Generator except that it writes bytes to its file argument instead of strings, and it recovers any bytes that were in the original input. So given this sequence: msg = email.msg_from_file(open('myfile', encoding='ascii', errors='surrogateescape')) email.generator.BytesGenerator(open('myfile2', 'wb')).flatten(msg) myfile and myflie2 will theoretically be identical (modulo universal newline and _mangle_from issues). I've additionally added a 'message_from_bytes' convenience function. One nice feature of this patch is that once you've got the model built from surrogateescaped input, if you do a get_payload() on a message body whose ContentTransferEncoding is '8bit' you will get the body decoded to unicode using the charset declared in the Content-Type header (assuming Python supports that charset). You can always get at the bytes version of the body of a message part by using get_payload(decode=True) [*]. You can't really get at the bytes version of message headers, though...for safety if you access a header whose value contains non-ASCII chars (that aren't RFC2047 encoded to be ASCII) the 8bit characters get replaced with '?'s. (But BytesGenerator will emit the original 8bit characters if the headers haven't been modified.) I do not propose that this is a *good* API, since it has the classic problem that if there are coding bugs in the email module strings may escape that have surrogates in them and we end up with programs that work most of the timeexcept when they fail with mysterious errors because of unusual bytes input data. On the other hand you always *know* when you have bytes data in an unknown encoding (because they are surrogate escaped), so it is ever so much better than the Python2 situation. The advantage of this patch is that it means Python3.2 can have an email module that is capable of handling a significant proportion of the applications where the ability to process binary email data is required. I've uploaded the patch to issue 4661 (http://bugs.python.org/issue4661). I uploaded it to rietveld as well just before Martin's announcement. After the announcement I uploaded the svn patch to the tracker, so hopefully there will be an automated review button as well. Here is your chance to exercise the new review tools :) This patch does break two of Barry's patch-for-review rules: it is more than 800 lines of diff (but not a lot more, and less than 800 if you count only code diff and not docs), and it did not have a very extensive design discussion beforehand. I did talk with people on IRC, particularly Barry, before finishing the patch, and I did post a summary to the email-sig mailing list (but got no response). Now it is time to see what the wider community thinks. There is some question of whether this is a bending of the string/bytes separation that doesn't belong as part of the standard library, but after working my way through it I think it is a fairly clean hack[**], and most likely a case where practicality beats purity. Regardless of whether or not this patch or a descendant thereof is accepted I still intend to continue working on email6. There are many other bugs in the current email package that require a rewrite of parts of its infrastructure, and the email-sig is agreed that the email API needs revision quite apart from the bytes/string issues. However, there is something pleasing about the simplicity of this way of handling bytes that
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
2010/10/2 R. David Murray rdmur...@bitdance.com: Regardless of whether or not this patch or a descendant thereof is accepted I still intend to continue working on email6. There are many other bugs in the current email package that require a rewrite of parts of its infrastructure, and the email-sig is agreed that the email API needs revision quite apart from the bytes/string issues. However, there is something pleasing about the simplicity of this way of handling bytes that I intend to consider carefully while we work further on email6. And how would this addition interact with changes in email6? -- Regards, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Sat, 02 Oct 2010 19:15:57 -0500, Benjamin Peterson benja...@python.org wrote: 2010/10/2 R. David Murray rdmur...@bitdance.com: Regardless of whether or not this patch or a descendant thereof is accepted I still intend to continue working on email6. =C2=A0There are ma= ny other bugs in the current email package that require a rewrite of parts of its infrastructure, and the email-sig is agreed that the email API needs revision quite apart from the bytes/string issues. =C2=A0However, t= here is something pleasing about the simplicity of this way of handling bytes that I intend to consider carefully while we work further on email6. And how would this addition interact with changes in email6? It will be no harder to do the backward compatibility support for this than for the rest of the email5 API, if that's what you are asking. Assuming my plan for backward compatibility works at all (which it should). --David ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Patch making the current email package (mostly) support bytes
On Sun, Oct 3, 2010 at 9:00 AM, R. David Murray rdmur...@bitdance.com wrote: I do not propose that this is a *good* API, since it has the classic problem that if there are coding bugs in the email module strings may escape that have surrogates in them and we end up with programs that work most of the timeexcept when they fail with mysterious errors because of unusual bytes input data. On the other hand you always *know* when you have bytes data in an unknown encoding (because they are surrogate escaped), so it is ever so much better than the Python2 situation. It's a similar concept to one Antoine and I (and some others) have been considering in the tracker for making urllib.parse able to handle ASCII-compatible bytes-encodings. I've already implemented a version of that patch which has parallel bytes and str versions of all the ASCII constants, and the result is pretty ugly. My next goal is to implement a version that uses the same trick you have here for email and see how the code complexity compares. We do need to tread carefully to make sure the pseudo strings don't escape, but the other approach requires similar care all the way through the internal algorithms to make sure they aren't assuming bytes or str instances anywhere. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com