subject:"\[Python\-Dev\] Patch making the current email package \(mostly\) support bytes"

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-13 Thread Stephen J. Turnbull

Steven D'Aprano writes:

  I don't think anyone has ever suggested change for change's sake. If 
  they have, I'd love to read the PEP for it.

Not to mention the BDFL's pronouncement message!wink
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-12 Thread lutz

ba...@python.org wrote in the full post below:
 I'm reminded of a survey Guido conducted at some long past
 Python conference.  He asked (paraphrasing): raise your hand 
 if you think Python is changing too fast.  Lots of hands went
 up.  Then he asked, raise your hand if you have a feature you 
 want to get in the next version.  Lots of hands went up.

When?  I doubt that you'd get the same reaction today given 
the schism that 3.X has created.  Regardless, this underscores
much of what I'm trying to get across here.  Python conference
attendees are hardly representative of the user base at large.
Even today, they are probably just 0.1% of the whole.  This 
list's readership is an order of magnitude smaller still.
Open doesn't mean all that much to those outside the 0.01%
whose preferences set the agenda.

I appreciate that some people here do indeed weigh compatibility
carefully, and realize that there are multiple valid viewpoints
on this issue.  And regrettably, I have neither solutions nor
time to give this thread the further attention it deserves.

So my point is just this: Change for change's sake is truly not
what most Python users want.  If Python core developers want 3.X
to become as popular as 2.X, they should be less concerned with 
posts on this list or hands at a conference, than with the feet
of the masses whose votes will ultimately decide 3.X's fate.

--Mark Lutz  (http://learning-python.com, http://rmi.net/~lutz)



 Date: Fri, 8 Oct 2010 14:20:32 -0400
 From: Barry Warsaw ba...@python.org
 To: python-dev@python.org
 Subject: Re: [Python-Dev] Patch making the current email package (mostly) 
 support bytes
 
 On Oct 08, 2010, at 03:44 PM, l...@rmi.net wrote:
 
 Ultimately, development in the open source world is driven by the 
 very few with time to show up, rather than by the very many who 
 depend on it.  This can unfortunately lead to the perception
 of thrashing by end users.  Some even come to see the net effect 
 as not that much different from closed models.  I have no solution
 to offer, except to underscore again that changes made here affect 
 very many people who are too busy using Python to participate here.  
 Especially given the still tentative state of 3.X, stability matters.
 
 I'm reminded of a survey Guido conducted at some long past Python conference.
 He asked (paraphrasing): raise your hand if you think Python is changing too
 fast.  Lots of hands went up.  Then he asked, raise your hand if you have a
 feature you want to get in the next version.  Lots of hands went up.
 
 I'm sympathetic to the view that changes in Python can be disruptive to end
 users.  The Python community itself takes this seriously too, as evidenced by
 the language moratorium[*].  But OTOH, Python cannot stagnate and even fixing
 things means changing things.  The reality too is that Python releases come
 out approximately every 18 months, and a year and a half can either seem like
 an excruciatingly long time, or blink of the eye depending on which side of
 the fence you stand on.
  
 Yes, stability matters, but Python 3 is still a new snakeling and I suspect
 that as the pace of porting picks up, more changes will be necessary.  Adding
 new modules named like distutils2 or unittest2 is less than satisfying but
 useful for keeping older APIs around.
 
 I'm sad to hear that some people think that our development model differs
 little from closed source development.  To me, nothing could be further from
 the truth.  But the adage does go (s)he who does the work, decides, and this
 is the forum for those who are doing the work.  I think everyone here welcomes
 advocates for under-represented Python communities, and their concerns should
 be taken in consideration when changes are discussed.  But ultimately, Python
 must evolve to stay relevant or it will die.  This is where competing design
 trade-offs must be discussed.  If not here, by us, then where and by whom?
 
 -Barry
 
 [*] Mostly instituted to allow alternative implementations to catch up, it
 does necessarily slow the pace of changes visible to end users.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-12 Thread Steven D'Aprano

On Wed, 13 Oct 2010 03:01:57 am l...@rmi.net wrote:
 So my point is just this: Change for change's sake is truly not
 what most Python users want.  If Python core developers want 3.X
 to become as popular as 2.X, they should be less concerned with
 posts on this list or hands at a conference, than with the feet
 of the masses whose votes will ultimately decide 3.X's fate.

I don't think anyone has ever suggested change for change's sake. If 
they have, I'd love to read the PEP for it.



-- 
Steven D'Aprano
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread Barry Warsaw

On Oct 08, 2010, at 12:37 PM, Stephen J. Turnbull wrote:

Ouch.  RFC 822 line wrapping is a bytes-bytes transformation, and the
client shouldn't see it at all unless it inspects the wire format.

Header wrapping sucks even more because it's supposed to take the semantic
context into account, which means that a generic Header wrapping algorithm
cannot work for everything.  E.g. Received: headers are supposed to wrap after
the semicolon.  The current email package does a pretty poor job of emulating
this requirement, though it often gets it right enough.  David has plans for
addressing this problem.

-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread Stephen J. Turnbull

Barry Warsaw writes:

  Header wrapping sucks even more because it's supposed to take the
  semantic context into account, which means that a generic Header
  wrapping algorithm cannot work for everything.  E.g. Received:
  headers are supposed to wrap after the semicolon.

Received headers are an easy special case:

An Internet mail program MUST NOT change or delete a Received:
line that was previously added to the message header section.
(RFC 5321, sec. 4.4)

So you save them as bytes and Barry's your FLUFL, as they say.

If email wants to *produce* them (as a service to say smtplib), then
it wants to comply with the detailed recommendations in RFC 5321,
sec. 4.4 anyway; I don't think there's a good reason treat Received
headers as text since they're conceptually part of the wire protocol.
(Except for the information of curious users, but then getting it
exactly right is best done by just passing the whole thing, folds and
all, to .decode('ascii'), I should think.)

I should think you *want* addresses and suchlike structured headers
(Content-Type with several RFC 2231 parameters, anyone?) to line up
nicely, too.  So generic folding algorithms are really only applicable
to unstructured text fields like Subject and Summary anyway.

You can call that sucky if you like, I prefer to call it tasteful.
wink

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread lutz

Thanks for both your reply and work, David.  I'm going to have
to test my email clients under the 3.2 patch when it gels.  It's
good to hear that email5 API support remains a goal.

I don't mean to single out this change unfairly, of course.  My 
real concern is not as much with the specific technical aspects 
of this proposal, as with the generally low priority that backward 
compatibility sometimes receives on this list.  The bytecode file 
model change in 3.2 comes to mind as another example; sound as it 
may be, I'm not sure this list has any idea how many users, systems,
or docs may be impacted by this.  Though not always true, the work 
here does sometimes appear to be conducted in a vacuum.

Ultimately, development in the open source world is driven by the 
very few with time to show up, rather than by the very many who 
depend on it.  This can unfortunately lead to the perception
of thrashing by end users.  Some even come to see the net effect 
as not that much different from closed models.  I have no solution
to offer, except to underscore again that changes made here affect 
very many people who are too busy using Python to participate here.  
Especially given the still tentative state of 3.X, stability matters.

--Mark Lutz  (http://learning-python.com, http://rmi.net/~lutz)


 -Original Message-
 From: R. David Murray rdmur...@bitdance.com
 To: l...@rmi.net
 Subject: Re: [Python-Dev] Patch making the current email package (mostly) 
 support bytes
 Date: Thu, 07 Oct 2010 13:46:02 -0400
 
 On Thu, 07 Oct 2010 16:03:18 -, l...@rmi.net wrote:
  I'm forwarding a link to the code of these clients to David by 
  private email in case they might be useful as a test case (O'Reilly
  has already posted them ahead of the book, but they may be a bit too
  heavy for use in formal testing).
 
 Thanks very much.  I will take a look, and expect they will
 be helpful.
 
  The email package is obviously less than ideal today, and there are
  many other clients for it besides my own, of course.  But making it 
  backward incompatible at this point is likely to be seen as a big 
  negative to newcomers evaluating 3.X viability.  And as I tried to 
  make clear in June, this list should carefully weigh the PR cost of 
  pulling the rug out from under those brave souls who have already 
  taken the time to accommodate the 3.X world you've mandated.
 
 Well, as I have said before the plan is to provide backward compatibility
 in email6, so that you only need to change your code if you want to
 take advantage of improved or new functionality.  If this turns out not
 to be possible for some reason, then we aren't going to suddenly stop
 supporting email5.  That's not the Python Way :)  (Example: we added
 ArgParse post-3.0, and lots of people wanted to deprecate OptParse,
 but we aren't planning on removing OptParse.)
 
 Do you see any issues with the patch I'm proposing?  My goal is to make
 things work that didn't work before, but nothing that worked before
 should stop working, if I do my job right.
 
 The one *potentially* backward-incompatible change that I'm consciously
 considering (that is, any other backward incompatibilities will be bugs)
 is having DecodedGenerator fully decode headers and emit full unicode,
 rather than the ASCII-only unicode that Generator emits.  Can you think
 of any problem that that would cause?  A quick grep indicates your own
 code does not use that generator (possibly because currently it does not
 do that decoding).  I could, of course, only enable header decoding if
 a flag is passed requesting it, and as I write this I realize that that
 is indeed what I should do.  Even though I haven't been able to think of a
 case where DecodedGenerator producing non-ASCII unicode would be an issue,
 that doesn't mean there isn't one :)
 
  To put that more strongly, the Python user base is much larger than 
  this list's readership.  If I'm using 3.1 email, so are many others.
  People will accept the 3.X world you make up to a point, but it's 
  impossible to code to a moving target, much less base a product on 
  it.  At some point, they'll simply stop trying to keep up; in fact, 
  some already have.
 
  Fixes are a Good Thing, of course, and this particular change's scope
  remains to be seen; but to channel most of the users I meet out there
  in the real world today: Enough with the 3.X changes already, eh?
 
 Now that Python3 is out, the backward compatibility policy for it is
 the same as it always was for Python2.  Only the transition from 2
 to 3 broke backward compatibility in a significant way.  From here
 on, we are as conservative as we always have been at making backward
 incompatible changes (that is, we don't do it intentionally without
 a good reason and a deprecation cycle, and if we do it unintentionally
 it is a regression and treated as such).
 
 --
 R. David Murray  www.bitdance.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread lutz

step...@xemacs.org wrote in the full message below:
 If having 1 *and* 2 is so important to particular users, but they come
 into conflict because of proposed changes in Python, then they're
 going to have to give up 3, come here, and articulate their needs. 

But I _did_ come here and articulate my needs, and received this
antagonistic response for my efforts.  If you really value user
input, you may want to explore the nature of your reaction to it.
Trust me: criticism goes with the territory any time your actions
impact a large group of people.  This seems inherent here.

Frankly, your view of the roles of developers and users seems so 
upside down to me that I doubt anything I could say here would
matter.  You're more than welcome to ignore an interjection of 
reality and adopt a closed group mindset, of course, but you do
so at the peril of the system you're working on.

For my part, one week from now I'll be standing up again in front 
of a group of 20 Python beginners, and basically apologizing for 
both the present and ongoing 3.X changes they must conform to in 
the near future.  Python may not be Perl 6 yet, but its image is 
already tarnished in the real world where people make technology 
choices, due to its rapid pace of change.  It's a genuine problem.

In the end, I suppose I'm just one of those lazy end users you
mentioned who are too busy to spend 24/7 hanging out on this 
list in order to head off changes that will break their code. 
(Yes, sarcasm intended.)

--Mark Lutz  (http://learning-python.com, http://rmi.net/~lutz)


 -Original Message-
 From: Stephen J. Turnbull step...@xemacs.org
 To: l...@rmi.net
 Subject: Re: [Python-Dev] Patch making the current email package
   (mostly)support bytes
 Date: Fri, 08 Oct 2010 14:33:22 +0900
 
 l...@rmi.net writes:
 
   To put that more strongly, the Python user base is much larger than 
   this list's readership.
 
 Agreed.  Nevertheless, this is the channel (not channel) that the
 developers listen on, and substantial effort is made to let Python
 users know that.  I think they do know it, too.
 
   If I'm using 3.1 email, so are many others.
 
 That's not obvious.  3.1 email is unusable for several applications.
 In fact, for human factors reasons (humans are very likely to
 communicate with other humans who use the same encodings, and to
 accept occasional glitches they must deal with manually), MUAs are
 likely to port relatively easily as good enough software.  But I
 doubt very much that folks writing MTAs or spam filters that must run
 unattended, often in long-lived, very active processes, are producing
 production versions using Python 3 email yet.
 
   People will accept the 3.X world you make up to a point, but it's 
   impossible to code to a moving target, much less base a product on 
   it.
 
 Impossible is nothing.  It's a decision that each individual
 developer makes for herself.  I haven't heard Mailman devs complain
 about the impossibility of dealing with the proposed changes, for
 example.  Quite the reverse, in fact.
 
   At some point, they'll simply stop trying to keep up; in fact, 
   some already have.
 
 Predictable and predicted.  Where's the balance?  I don't know, but
 channeling the users is not a lot of help.  There are three worthy
 goals here:
 
 1. Taking advantage of improvements in to-be-released Pythons.
 2. Not changing one's own working code.
 3. Not participating in python-dev/email-sig.
 
 Take any two; one can't have all three.
 
 More specifically, it's interesting that most of the users you talk to
 care enough to actually say they don't want more incompatible changes.
 But what are we supposed to take from that?  Some fixes have to be
 incompatible; do the users want the fix or the compatibility?  You
 waffle (as a good representative often must):
 
   Fixes are a Good Thing, of course, and this particular change's scope
   remains to be seen; but to channel most of the users I meet out there
   in the real world today: Enough with the 3.X changes already, eh?
 
 But that's also a decision each developer *can* make for himself.
 Python does not withdraw products, or even withdraw support, just
 because the core developers release something they consider better.
 
 If having 1 *and* 2 is so important to particular users, but they come
 into conflict because of proposed changes in Python, then they're
 going to have to give up 3, come here, and articulate their needs.  As
 you are doing -- but to have real influence, you're going to have to
 do the review of David's patch that he requests.
 
 I really don't see how the process can work any other way.
 



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread Stephen J. Turnbull

Barry Warsaw writes:
  On Oct 07, 2010, at 04:40 AM, Stephen J. Turnbull wrote:

  I'm fairly certain that most of the modern causes of [Unicode
  errors in Mailman] are post-parse modifications of the message.
  IOW, in Mailman's architecture, we try to parse the raw data into a
  Message object tree very early in the pipeline, and then a pickled
  version of that gets passed between the queue runners.
  
  Where we've gotten into trouble before has been things like adding
  the Subject prefixes and such.

Not to mention those wonderful unremovable addresses containing TAB
etc.

But I'm pretty sure I've seen reports at least in 2.1.9, and probably
more recently than that, where there was 8-bit content in a header of
the incoming message and Mailman blew up on that.  This is stuff that
should have been shunted explicitly, but instead managed to get out of
the parser and then blow up.  I don't think the errors I'm thinking
about were due to Mailman manipulations, but rather insufficient
paranoia in handling incoming hazmat.

  That seems like application logic that the email package can't
  really get involved with, and indeed Mailman has built up a raft of
  defense for failures of this kind.

But adding Subject prefixes and the like shouldn't be a problem as
long is the internal representation of each message object (bytes vs
str) is fixed and the representation is opaque, so that the module can
do appropriate conversions when necessary.  The problem that you face
in Python 2 is that that separation is not properly made, and the same
values in the message object can often serve as text and as wire
format, and it's hard to tell which is which.   The Unicode handling
is tacked on as an afterthought.

That mess is entirely unnecessary in Python 3.  Text and wire format
can be easily distinguished with three different representations of
email: Unicode for the conceptual RFC 822 layer (of course this is an
extension, because RFC 822 itself is strictly limited to the ASCII
subset), bytes for wire format, and Message objects for modern
structured mail (including MIME, etc).

*If* email6 is reengineered with that kind of structure, then you
should be able to dispense with almost all of the raft of defense,
because the email module will give you well-behaved Message objects,
whose text components (including the header) are well-behaved
character strings that mix seamlessly with other character strings.
Maybe even in email5 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread R. David Murray

On Fri, 08 Oct 2010 15:51:45 -, l...@rmi.net wrote:
 For my part, one week from now I'll be standing up again in front 
 of a group of 20 Python beginners, and basically apologizing for 
 both the present and ongoing 3.X changes they must conform to in 
 the near future.  Python may not be Perl 6 yet, but its image is 
 already tarnished in the real world where people make technology 
 choices, due to its rapid pace of change.  It's a genuine problem.
 
 In the end, I suppose I'm just one of those lazy end users you
 mentioned who are too busy to spend 24/7 hanging out on this 
 list in order to head off changes that will break their code. 
 (Yes, sarcasm intended.)

What would be helpful would be to know what changes it is that we
have made between 3.1 and 3.2 that are raising backward compatibility
concerns.  What are we doing that is perceived as ongoing 3.X changes?
Generalities will not help, only by looking at specifics can we
re-evaluate our actions.

In a private message you mentioned the bytecode file model change, by
which I presume you mean PEP 3147.  Our view is that this is a backward
compatible change:  any Python program that was working should continue
to work.  Barry's original idea was that the new behavior would only be
turned on by a flag, but Guido (and others) wanted it to be the default
because in his view it is a superior arrangement for normal use.

Perhaps we did not fully consider the effect on third party tools (and,
as you point out, documentation) that expects .pyc files along side the
.py files.  Yet this change is no where near the level of change that
makes typical Python programs fail.  We feel like it is a worthwhile
trade-off (and Debian and Ubuntu at least may well backport it to earlier
Python versions).  But apparently you disagree.

So, engage us in dialog about it, please.  And *please* mention any
other specific changes you think are disruptive between 3.1 and 3.2.
We need to know about them, preferably *before* we release 3.2 beta
(currently targeted for the end of this month).  Because I assure you
that it is not our policy to be changing things any more rapidly than
we did between python 2.x versions[*].

If you feel like you are apologizing to your groups of beginners,
it would be wonderful if you could act as their advocate here.
Obviously the issues directly affect you, so hopefully it is worth
your time to engage us on this topic.

And thank you for the messages you have sent.  I know they have made
me even more careful than I was already trying to be.

--
R. David Murray  www.bitdance.com

[*] There may be a few exceptions to this where the 3.x library code
fails to work in real-world applications, so that a more radical change
is made but is, in reality, a bug fix.  But even there we try to be
conservative.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread R. David Murray

On Sat, 09 Oct 2010 01:06:29 +0900, Stephen J. Turnbull step...@xemacs.org 
wrote:
 That mess is entirely unnecessary in Python 3.  Text and wire format
 can be easily distinguished with three different representations of
 email: Unicode for the conceptual RFC 822 layer (of course this is an
 extension, because RFC 822 itself is strictly limited to the ASCII
 subset), bytes for wire format, and Message objects for modern
 structured mail (including MIME, etc).
 
 *If* email6 is reengineered with that kind of structure, then you
 should be able to dispense with almost all of the raft of defense,
 because the email module will give you well-behaved Message objects,
 whose text components (including the header) are well-behaved
 character strings that mix seamlessly with other character strings.

That engineering is pretty much what we are looking at, although in
practice I think you have to hang wire-format and text-format bits off
of appropriate places in the model in order to keep everything properly
coordinated.

 Maybe even in email5 

I suspect that's pushing it.  Patches happily accepted, though :)

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread R. David Murray

On Fri, 08 Oct 2010 15:44:45 -, l...@rmi.net wrote:
 Thanks for both your reply and work, David.  I'm going to have
 to test my email clients under the 3.2 patch when it gels.  It's
 good to hear that email5 API support remains a goal.

I just landed the patch (though without the MIME encoding of unknown
header bytes or the 'yes-I-really-want-the-escaped-bytes' flags that
Stephen and I have been discussing.  So it will be present in alpha3.
I would greatly appreciate your testing it and making sure it doesn't
break any of your code.

 I don't mean to single out this change unfairly, of course.  My 
 real concern is not as much with the specific technical aspects 
 of this proposal, as with the generally low priority that backward 
 compatibility sometimes receives on this list.  The bytecode file 

I don't perceive that lack of priority myself.  Certainly I don't see
a lack of priority on backward compatibility in the bug tracker, quite
the reverse[*].  As I said in my public email, specific examples would be
most helpful.

 model change in 3.2 comes to mind as another example; sound as it 
 may be, I'm not sure this list has any idea how many users, systems,
 or docs may be impacted by this.  Though not always true, the work 
 here does sometimes appear to be conducted in a vacuum.

Well, we can only react to the input we find out about.  Developers *do*
read blogs and such about what's going on in the wider community and bring
that info back to python-dev, but as is inherent with projects structured
as volunteer efforts, what we get is only what someone decides to put in
time on.  Specific suggestions on how to improve the feedback loop are
always welcome; volunteer efforts to improve our fundamental procedures
are just as or perhaps more valuable than volunteer code writing (though
they probably involve even more politicing effort :).

 Ultimately, development in the open source world is driven by the 
 very few with time to show up, rather than by the very many who 
 depend on it.  This can unfortunately lead to the perception
 of thrashing by end users.  Some even come to see the net effect 
 as not that much different from closed models.  I have no solution

Well, the Python community takes it as a principle to avoid thrashing.
So if you see examples where we are failing in that goal, call us on it
(with specifics).

 to offer, except to underscore again that changes made here affect 
 very many people who are too busy using Python to participate here.  
 Especially given the still tentative state of 3.X, stability matters.

We do try to remain aware of that.  When we fail, someone needs to let
us know.

--
R. David Murray  www.bitdance.com

[*] I'm currently aware of one exception to this, the nttplib module.
It was pretty much unusable as it stood (I tried, as did Antoine; it had
no unit tests so massive breakage is not that surprising), so we broke
backward compatibility with 3.1 in order to fix that.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread R. David Murray

On Fri, 08 Oct 2010 23:55:37 +0900, Stephen J. Turnbull step...@xemacs.org 
wrote:
 I should think you *want* addresses and suchlike structured headers
 (Content-Type with several RFC 2231 parameters, anyone?) to line up
 nicely, too.  So generic folding algorithms are really only applicable
 to unstructured text fields like Subject and Summary anyway.
 
 You can call that sucky if you like, I prefer to call it tasteful.

No, what's sucky is that email4/5 doesn't support that.  It only folds
headers as unstructured blobs, with a nod in the direction of structure
by breaking lines at obvious places like ';'s.  (Which line breaking
algorithm is the subject of at least one bug report)

I'd like to fix that in email6 by adding full support for structured
headers.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread R. David Murray

On Fri, 08 Oct 2010 12:37:38 +0900, Stephen J. Turnbull step...@xemacs.org 
wrote:
 *If* you have an 8-bit value of unknown encoding on input, this will
 appear in the Header's value as a surrogate.  Hm, OK, I see the
 problem ... as usual, it's that the only efficient thing to do is
 encode using surrogate-escape which loses the information that these
 are invalid bytes.  Would it really be that bad to add an O(length)
 component where you examine the string for surrogates (and too-long
 words, for that matter), and chop off those pieces for MIME encoding?

Nope, and that's more or less what I think I'm going to do.  But I
haven't started writing the code yet.

  Presumably you are suggesting that email5 be smart enough to turn my
  example into properly UTF-8/CTE encoded text.

No, in general that's undecidable without asking the originator,
although humans can often make a good guess.
   
   I was talking about unicode input, though, where you do know (modulo
   the language differences that unicode hasn't yet sorted out).
 
 I don't understand why this is difficult.  As far as what Unicode has

It isn't difficult in principle.  It's just difficult in email5.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread Stephen J. Turnbull

R. David Murray writes:
  On Sat, 09 Oct 2010 01:06:29 +0900, Stephen J. Turnbull 
  step...@xemacs.org wrote:
   That mess is entirely unnecessary in Python 3.  Text and wire format
   can be easily distinguished with three different representations of
   email: Unicode for the conceptual RFC 822 layer (of course this is an
   extension, because RFC 822 itself is strictly limited to the ASCII
   subset), bytes for wire format, and Message objects for modern
   structured mail (including MIME, etc).

  That engineering is pretty much what we are looking at, although in
  practice I think you have to hang wire-format and text-format bits off
  of appropriate places in the model in order to keep everything properly
  coordinated.

Right.  That's where I was going with my comment to Barry about the
Received headers.  Even if email isn't going to serve clients working
with wire format, it needs to deal with those headers.  But where I
think the headers defined by RFC 822 should be stored as str in
email6, I am leaning toward storing Received headers verbatim as bytes
(including any RFC 822 folding whitespace) because of the RFC 5321
requirement that they be preserved exactly.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread Barry Warsaw

On Oct 08, 2010, at 03:44 PM, l...@rmi.net wrote:

Ultimately, development in the open source world is driven by the 
very few with time to show up, rather than by the very many who 
depend on it.  This can unfortunately lead to the perception
of thrashing by end users.  Some even come to see the net effect 
as not that much different from closed models.  I have no solution
to offer, except to underscore again that changes made here affect 
very many people who are too busy using Python to participate here.  
Especially given the still tentative state of 3.X, stability matters.

I'm reminded of a survey Guido conducted at some long past Python conference.
He asked (paraphrasing): raise your hand if you think Python is changing too
fast.  Lots of hands went up.  Then he asked, raise your hand if you have a
feature you want to get in the next version.  Lots of hands went up.

I'm sympathetic to the view that changes in Python can be disruptive to end
users.  The Python community itself takes this seriously too, as evidenced by
the language moratorium[*].  But OTOH, Python cannot stagnate and even fixing
things means changing things.  The reality too is that Python releases come
out approximately every 18 months, and a year and a half can either seem like
an excruciatingly long time, or blink of the eye depending on which side of
the fence you stand on.

Yes, stability matters, but Python 3 is still a new snakeling and I suspect
that as the pace of porting picks up, more changes will be necessary.  Adding
new modules named like distutils2 or unittest2 is less than satisfying but
useful for keeping older APIs around.

I'm sad to hear that some people think that our development model differs
little from closed source development.  To me, nothing could be further from
the truth.  But the adage does go (s)he who does the work, decides, and this
is the forum for those who are doing the work.  I think everyone here welcomes
advocates for under-represented Python communities, and their concerns should
be taken in consideration when changes are discussed.  But ultimately, Python
must evolve to stay relevant or it will die.  This is where competing design
trade-offs must be discussed.  If not here, by us, then where and by whom?

-Barry

[*] Mostly instituted to allow alternative implementations to catch up, it
does necessarily slow the pace of changes visible to end users.


signature.asc
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-08 Thread R. David Murray

On Sat, 09 Oct 2010 02:48:23 +0900, Stephen J. Turnbull step...@xemacs.org 
wrote:
 R. David Murray writes:
   On Sat, 09 Oct 2010 01:06:29 +0900, Stephen J. Turnbull 
 step...@xemacs.org wrote:
That mess is entirely unnecessary in Python 3.  Text and wire format
can be easily distinguished with three different representations of
email: Unicode for the conceptual RFC 822 layer (of course this is an
extension, because RFC 822 itself is strictly limited to the ASCII
subset), bytes for wire format, and Message objects for modern
structured mail (including MIME, etc).
 
   That engineering is pretty much what we are looking at, although in
   practice I think you have to hang wire-format and text-format bits off
   of appropriate places in the model in order to keep everything properly
   coordinated.
 
 Right.  That's where I was going with my comment to Barry about the
 Received headers.  Even if email isn't going to serve clients working
 with wire format, it needs to deal with those headers.  But where I
 think the headers defined by RFC 822 should be stored as str in
 email6, I am leaning toward storing Received headers verbatim as bytes
 (including any RFC 822 folding whitespace) because of the RFC 5321
 requirement that they be preserved exactly.

Well, the plan for  email6 is to *allow* clients to work with wire format,
though it will probably be a bit more awkward than working with the
text interface.  And my current strategy is in general to preserve the
input bytes and, as long as the header in question hasn't been modified,
emit those bytes when serialization back to bytes is done.  My current
plan is that conversion to text is only done at the point where text
is requested, at which point the conversion is cached for later use.
And if the header is modified, the source bytes version is discarded.
Conversely if the source of the header was text input (msg['Subject'] =
'Hi'), then the conversion to bytes is only done when serialization to
bytes is requested.

None of this is implemented yet.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-07 Thread R. David Murray

On Thu, 07 Oct 2010 03:31:34 +0900, Stephen J. Turnbull step...@xemacs.org 
wrote:
 R. David Murray writes:
 
 5.  Return the content, with non-ASCII bytes replaced with ?
 characters.
 
 That hadn't occurred to me (and it makes me sick to contemplate it).
 
 That said, this is probably good enough for Mailman-like apps to limp
 along for most users.  It's certainly good enough for the might
 kick your wife and elope with your dog alpha ports of Mailman to
 Python 3 (well, as certain as I can be; of course in the end Barry
 decides).  Assuming reasonable backward compatibility of the API, of
 course!

Yeah, good enough is pretty much the goal here.

   In other words, my proposed patch only makes email5 1/8 to 1/4
   broken, instead of half broken as it is now.  But not un-broken
   enough for Mailman, it sounds like.
 
 IMO, not in the long run.  But realistically, in the applications I
 know of, most desired traffic is conformant, and since there aren't
 any Python 3 email apps yet, this isn't even a regression. :-/
 
 I do think that it's important that the parsed object be able to tell
 you what fields are there (except if the field name itself is invalid)
 and return field bodies parsed as far as possible.

Well, email doesn't currently parse the bodies any further by itself.
You have to call parsing routines to get further parsing.  So maybe
what I should do is work on finalizing the patch without addressing the
'give me the escaped bytes issue', and then prepare a follow on patch
that adds that keyword and adjusts the header parsing helpers accordingly.

   If we go this route (as opposed to only handling headers with 8bit data by
   sanitizing them), then we need to think about the email5 header parsers
   as well (decode_header and parseaddr).  They are of course going to have
   the same problems as the rest of the email package with parsing bytes,
   and you are suggesting that access to those header 8bit bytes is needed.
 
 Yes, that would be preferable to replacing them with ASCII junk.
 
 But I don't see any problem with parsing them; they're syntactically
 insignificant by definition.  The problem is purely on output: do I
 get verbatim escaped bytes, a sanitized str, or an exception?

Right, the needed changes should be sanitizing by default, and providing
the keyword to get the escaped bytes.  Mostly it'll be writing tests :)

   Does my proposal make sense?  But note, it raises exactly the backward
   compatibility concerns you mention in your next email (that I will reply
   to next).  It is an open question whether it is worth opening that door
   in order to be able to do extended handling on non-RFC conforming email
   (as opposed to just sanitizing it and soldering on).
 
 Well, maybe not.  However, it is not obvious to me that you won't run
 into these issues again in Email6.  Applications that think of email
 as textual objects are going to want to make their own choices about
 handling of non-conforming email, and it's likely to be massively
 inconvenient to say OK, but you have to use bytes interfaces
 exclusively, because the str interfaces don't handle that.

The strategy in email6 so far is for the application program to be
able to access *any piece* of the parsed data as either text or bytes,
and for the header parsers to record defects when there are non-ASCII
bytes where there aren't supposed to be.  So the application can check
for defects and retrieve, say, the comment field that has the non-ASCII
*as bytes* and decode it.  Or, if it doesn't care about parsing them,
it just modifies the fields it wants to modify that *are* valid, and the
invalid non-ASCII comment gets carried along and emitted when the message
is serialized as bytes.

This is more or less what we are talking about enabling in email5 with
the 'escape_bytes=True' keyword, it's just a less structured and more
error prone approach to it than what we have planned for email6.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-07 Thread R. David Murray

Stephen J. Turnbull stephen at xemacs.org writes:
 R. David Murray writes:
   We're (in the current patch) not punting on handling non-conforming
   email, we're punting on handling non-conforming bytes *if the headers
   that contain them need to be modified*.  The headers can still be
   modified, you just (currently) lose the non-ASCII bytes in the process.
 
 Modified *or examined*.  I can't think of any important applications
 offhand that *need* to examine the non-ASCII bytes (in particular,
 Mailman doesn't need to do that).  Verbatim copying of the bytes
 themselves is almost always the desired usage.

Mmm.  Yes, or examined.  If we allow escaped bytes to be returned, perhaps
we also should provide a helper that unescapes the bytes and returns
the byte string (yes, this is just a call to encode, but by wrapping it
we continue to hide the surrogateescape implementation detail.)

   And robustness is not the issue, only extended-beyond-the-RFCs handling
   of non-conforming bytes would be an issue.
 
 And with that, I'm certain that Jon Postel is really dead. 

A goal for email6 is to be *at least* as Postel compliant as email4.
The goal for my patch is to make email5.1 more Postel compliant than
email5.0 is :)

(Surely you are not saying that Generator.flatten can't DTRT with
non-ASCII content *at all*?)
   
   Yes, that is *exactly* what I am saying:
   
m = email.message_from_string(\
   ... From: pÃ¶stal
   ...   
   ... )
str(m)
   Traceback (most recent call last):
 
   UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in 
 position 1: ordinal not in range(128)
 
 But that's not interesting; you did that with Python 3.  We want to

Of course I did it with Python3.  It's the Python3 email codebase
I'm working with (and have to work *around*).

 know what people porting from Python 2 will expect.  So, in 2.5.5 or
 2.6.6 on Mac, with email v4.0.2, it *doesn't* raise, it returns
 
 wideload:~ 4:14$ python
 Python 2.5.5 (r255:77872, Jul 13 2010, 03:03:57) 
 [GCC 4.0.1 (Apple Inc. build 5490)] on darwin
 Type help, copyright, credits or license for more information.
  import email
  m=email.message_from_string('From: pÃ¶stal\n\n')
  str(m)
 'From nobody Thu Oct  7 04:18:25 2010\nFrom: p\xc3\xb6stal\n\n'
  m['From']
 'p\xc3\xb6stal'
  
 
 That's hardly helpful!  Surely we can and should do better than that
 now, especially since UTF-8 (with a proper CTE) is now almost
 universally acceptable to MUAs.  When would it be a problem for that
 to return
 
 'From nobody Thu Oct  7 04:18:25 2010\nFrom: =?UTF-8?Q?p=C3=B6stal?=\n\n'

What's wrong with that is that when we parse the bytes of the message
we don't know that b'\xc3\xb6' == '=?UTF-8?Q?=C3=B6?='.  It isn't even
all that likely to be true, since I would guess that latin1 is still
more common than utf-8 (but you might know better).

   Remember, email5 is a direct translation of email4, and email4 only
   handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the-
   -ride-fine-we'll-pass-then-along.  So if you want to put non-ASCII
   data into a message you have to encode it properly to ASCII in
   exactly the same way that you did in email4:
 
 But if you do it right, then it will still work in a version that just
 encodes non-ASCII characters in UTF-8 with the appropriate CTE.  Since
 you'll never be passing it non-ASCII characters, it's already ASCII
 and UTF-8, and no CTE will be needed.

So you are suggesting that I should use U+FFFD encoded as UTF-8
rather than '?' as the substitution character?  But earlier you said
that people would probably rather not be forced to deal with Unicode
just because there are invalid bytes in the message.  So that's
probably not what you meant.

Presumably you are suggesting that email5 be smart enough to turn my
example into properly UTF-8/CTE encoded text.  But *that* problem is what
email6 is trying to address.  It just doesn't look practical to address it
directly in the email5 code base, because the email4 codebase that email5
inherits does not provide the correct distinction between bytes and text.
email5 is parsing the input stream *as if* it were ASCII-only CTE text.
I'm trying to extend it to also handle non-ASCII bytes gracefully.
Extending it to actually handle unicode input is a whole different kettle
of sushi[*].

   Yes, exactly.  I need to fix the patch to recode using, say,
   quoted-printable in that case.
 
 It really should check for proportions of non-ASCII.  QP would be
 horrible for Japanese or Chinese.

Noted.

   DecodedGenerator could still produce the unicode, though, which is
   what I believe we want.  (Although that raises the question of
   whether DecodedGenerator should also decode the RFC2047 encoded
   headersbut that raises a backward compatibility issue).
 
 Can't really help you there.  While I would want the RFC 2047 headers
 decoded if I were writing new code (which is generally the case for
 me), I haven't really wrapped my

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-07 Thread R. David Murray

On Thu, 07 Oct 2010 15:00:04 +0900, Stephen J. Turnbull step...@xemacs.org 
wrote:
 R. David Murray writes:
 
But that's not interesting; you did that with Python 3.  We want to
   Of course I did it with Python3.  It's the Python3 email codebase
   I'm working with (and have to work *around*).
 
 Sure.  My point is that it has nothing to do with the expections of
 people trying to upgrade their apps to Python 3, and meeting those
 expectations is an important requirement of the specification of
 email5, right?

Well, not necessarily, no.  Python3 broke backward compatibility.
*Some* changes are going to have to be made in user code to make it
work with email5.  Where we can minimize those changes we should,
but it isn't a requirement, no.  With my patch, the minimization will
be message_from_string -- message_from_bytes, message_from_file --
message_from_binary_file, and in some cases Generator -- BytesGenerator,
for those programs that need to deal with wire format data that is not
7bit clean.  Programs that only *generate* emails should need few
if any changes, but that is already true (that's the half of email
that is working :).

 Actually, in context we were not talking about a random character that
 came in from outside, we were talking about U+FFFD that *we*
 generated, and *know* that it's the only non-ASCII character in the
 string because we replaced all the others with it.

Ah, so that *was* what you were suggesting.

 Of course the best we can do with 'From: =?UNKNOWN?Q?p=C3=B6stal' or
 'From: p\xc3\xb6stal' on input is to save the encoded or raw bytes
 representation and spit it back out on output.

Yes.  And I haven't actually dealt with what to do with non-ascii
characters or RFC2047 unknown-8bit characters when decoding
headers in email6.  In issue 6302 we are talking about adding a
decode_header_to_string method for email5 where the same issue arises,
and so we'll need to make a decision soon.  Presumably we'll use U+FFFD
to replace them (along with registering defects in email6).

 The MIME-charset = UNKNOWN dodge might be a better way of handling
 this.  The str is all ASCII, so won't raise exceptions unless the app
 itself objects to MIME encoded-words for some reason.  OTOH, the
 presence of encoded words will be a red flag to any human viewer, and
 after processing with .flatten(), the receiver is likely to DTRT (from
 the receiving human's point of view, per that human's configuration).

That is a very interesting idea.  It is the *right* thing to do, since it
would mean that a message parsed as bytes could be generated via Generator
and passed to, say, smtplib without losing any information.  However,
It's not exactly trivial to implement, since issues of runs of characters
and line re-wrapping need need to be dealt with.  Perhaps Header can be
made to handle bytes in order to do this; I'll have to look in to it.

   So you are suggesting that I should use U+FFFD encoded as UTF-8
   rather than '?' as the substitution character?  But earlier you said
   that people would probably rather not be forced to deal with Unicode
   just because there are invalid bytes in the message.  So that's
   probably not what you meant.
 
 Suggest !=3D recommend.  Talking to a wider base of users and
 developers, you might or might not find that to be a good idea.  I
 don't think the 800 million or so Chinese coming online in the next
 decade will much care whether you use U+FFFD or '?'.  The Japanese
 would prefer U+2639 WHITE FROWNING FACE or U+270C VICTORY HAND, no
 doubt (crassly cute is much beloved here).  Americans will likely
 prefer '?', as they probably have correspondents with legacy systems
 that won't like UTF-8 or perhaps don't have a font to display U+FFFD.

For the moment I think I'll stick with '?', with the idea of fixing
that bug by using the unknown charset trick at a later stage.

   Presumably you are suggesting that email5 be smart enough to turn my
   example into properly UTF-8/CTE encoded text.
 
 No, in general that's undecidable without asking the originator,
 although humans can often make a good guess.  But not always: Japanese
 are fond of four-character compound words, and I once found an
 8-byte sequence (four 2-byte characters) that is idiomatic in both
 Shift JIS and EUC-JP.  Even a dictionary lookup can't determine the
 intended encoding for that sequence.

I was talking about unicode input, though, where you do know (modulo
the language differences that unicode hasn't yet sorted out).

 I'm only saying that any Unicode email-N generates itself can be
 properly encoded.

Agreed.

   But *that* problem is what email6 is trying to address.  It just
   doesn't look practical to address it directly in the email5 code
   base, because the email4 codebase that email5 inherits does not
   provide the correct distinction between bytes and text.  email5 is
   parsing the input stream *as if* it were ASCII-only CTE text.
 
 I don't see how this is different from email6.

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-07 Thread lutz

Stephen J. Turnbull wrote (giving me an opening to jump in here):
 R. David Murray writes:
  In other words, my proposed patch only makes email5 1/8 to 1/4
  broken, instead of half broken as it is now.  But not un-broken
  enough for Mailman, it sounds like.

 IMO, not in the long run.  But realistically, in the applications I
 know of, most desired traffic is conformant, and since there aren't
 any Python 3 email apps yet, this isn't even a regression. :-/

Well, yes there are, and yes it is.  As I pointed out in a thread 
on this list back in June, there are multiple large Python 3 email 
apps in the new Programming Python, a book which is about to be 
released, and which will be read by at least tens of thousands of 
people, many of whom will be evaluating the stability of Python 3.

These apps include both a simple webmail site, as well as a more
sophisticated 5k-line tkinter email client -- one which I've been 
using for all my personal and business email over the last 6 months,
and which works well with the email package as it is in 3.1 (albeit
with a bit of workaround code).  This includes support for Unicode,
MIME, headers, attachments, and the lot.

I'm forwarding a link to the code of these clients to David by 
private email in case they might be useful as a test case (O'Reilly
has already posted them ahead of the book, but they may be a bit too
heavy for use in formal testing).

The email package is obviously less than ideal today, and there are
many other clients for it besides my own, of course.  But making it 
backward incompatible at this point is likely to be seen as a big 
negative to newcomers evaluating 3.X viability.  And as I tried to 
make clear in June, this list should carefully weigh the PR cost of 
pulling the rug out from under those brave souls who have already 
taken the time to accommodate the 3.X world you've mandated.

To put that more strongly, the Python user base is much larger than 
this list's readership.  If I'm using 3.1 email, so are many others.
People will accept the 3.X world you make up to a point, but it's 
impossible to code to a moving target, much less base a product on 
it.  At some point, they'll simply stop trying to keep up; in fact, 
some already have.

Fixes are a Good Thing, of course, and this particular change's scope
remains to be seen; but to channel most of the users I meet out there
in the real world today: Enough with the 3.X changes already, eh?

--Mark Lutz  (http://learning-python.com, http://rmi.net/~lutz)



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-07 Thread R. David Murray

On Thu, 07 Oct 2010 16:03:18 -, l...@rmi.net wrote:
 I'm forwarding a link to the code of these clients to David by 
 private email in case they might be useful as a test case (O'Reilly
 has already posted them ahead of the book, but they may be a bit too
 heavy for use in formal testing).

Thanks very much.  I will take a look, and expect they will
be helpful.

 The email package is obviously less than ideal today, and there are
 many other clients for it besides my own, of course.  But making it 
 backward incompatible at this point is likely to be seen as a big 
 negative to newcomers evaluating 3.X viability.  And as I tried to 
 make clear in June, this list should carefully weigh the PR cost of 
 pulling the rug out from under those brave souls who have already 
 taken the time to accommodate the 3.X world you've mandated.

Well, as I have said before the plan is to provide backward compatibility
in email6, so that you only need to change your code if you want to
take advantage of improved or new functionality.  If this turns out not
to be possible for some reason, then we aren't going to suddenly stop
supporting email5.  That's not the Python Way :)  (Example: we added
ArgParse post-3.0, and lots of people wanted to deprecate OptParse,
but we aren't planning on removing OptParse.)

Do you see any issues with the patch I'm proposing?  My goal is to make
things work that didn't work before, but nothing that worked before
should stop working, if I do my job right.

The one *potentially* backward-incompatible change that I'm consciously
considering (that is, any other backward incompatibilities will be bugs)
is having DecodedGenerator fully decode headers and emit full unicode,
rather than the ASCII-only unicode that Generator emits.  Can you think
of any problem that that would cause?  A quick grep indicates your own
code does not use that generator (possibly because currently it does not
do that decoding).  I could, of course, only enable header decoding if
a flag is passed requesting it, and as I write this I realize that that
is indeed what I should do.  Even though I haven't been able to think of a
case where DecodedGenerator producing non-ASCII unicode would be an issue,
that doesn't mean there isn't one :)

 To put that more strongly, the Python user base is much larger than 
 this list's readership.  If I'm using 3.1 email, so are many others.
 People will accept the 3.X world you make up to a point, but it's 
 impossible to code to a moving target, much less base a product on 
 it.  At some point, they'll simply stop trying to keep up; in fact, 
 some already have.

 Fixes are a Good Thing, of course, and this particular change's scope
 remains to be seen; but to channel most of the users I meet out there
 in the real world today: Enough with the 3.X changes already, eh?

Now that Python3 is out, the backward compatibility policy for it is
the same as it always was for Python2.  Only the transition from 2
to 3 broke backward compatibility in a significant way.  From here
on, we are as conservative as we always have been at making backward
incompatible changes (that is, we don't do it intentionally without
a good reason and a deprecation cycle, and if we do it unintentionally
it is a regression and treated as such).

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-07 Thread Barry Warsaw

On Oct 07, 2010, at 04:40 AM, Stephen J. Turnbull wrote:

  And the email API currently promises not to raise during parsing,
  which is a contract my patch does not change.

Which is a contract that has historically been broken frequently.
Unhandled UnicodeErrors have been one of the most common causes of
queue stoppage in Mailman (exceeded only by configuration errors
AFAICS).  I haven't seen any reports for a while, but with the email
package being reengineered from the ground up, the possibility of
regression can't be ignored.

I'm fairly certain that most of the modern causes of this are post-parse
modifications of the message.  IOW, in Mailman's architecture, we try to parse
the raw data into a Message object tree very early in the pipeline, and then a
pickled version of that gets passed between the queue runners.  If the initial
parse fails, there's almost literally nothing Mailman can do with the original
data other than delete it.

Where we've gotten into trouble before has been things like adding the Subject
prefixes and such.  That seems like application logic that the email package
can't really get involved with, and indeed Mailman has built up a raft of
defense for failures of this kind.

-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-07 Thread Stephen J. Turnbull

R. David Murray writes:

   The MIME-charset = UNKNOWN dodge might be a better way of handling
   this.
  
  That is a very interesting idea.  It is the *right* thing to do, since it
  would mean that a message parsed as bytes could be generated via Generator
  and passed to, say, smtplib without losing any information.  However,
  It's not exactly trivial to implement, since issues of runs of characters
  and line re-wrapping need need to be dealt with.  Perhaps Header can be
  made to handle bytes in order to do this; I'll have to look in to
  it.

Ouch.  RFC 822 line wrapping is a bytes-bytes transformation, and the
client shouldn't see it at all unless it inspects the wire format.
MIME-encoding is a text-bytes transformation, again an internal
matter.  The constraints on the wire format means that the MIME-
encoder needs to careful about encoded-word length.  ISTM that all you
need to know, assuming that this is a method on a Header, and it's
normally invoked just before conversion to bytes, is the codec and the
CTE, and both can be optional (default to 'utf-8' and a value
depending on the proportion of encodable characters).

You take the header, encode according to the codec, then start
MIME-encoding according to the CTE.  The maximum size of encoded words
is chosen to fit on a line within 78 bytes.  The number of bytes
encoded in each word depends only on the size of metadata associated
with the word.  (Sure you could make it prettier for those reading it
with an MUA like less, but I don't think that's really worth
anybody's time.)

*If* you have an 8-bit value of unknown encoding on input, this will
appear in the Header's value as a surrogate.  Hm, OK, I see the
problem ... as usual, it's that the only efficient thing to do is
encode using surrogate-escape which loses the information that these
are invalid bytes.  Would it really be that bad to add an O(length)
component where you examine the string for surrogates (and too-long
words, for that matter), and chop off those pieces for MIME encoding?

 Presumably you are suggesting that email5 be smart enough to turn my
 example into properly UTF-8/CTE encoded text.
   
   No, in general that's undecidable without asking the originator,
   although humans can often make a good guess.
  
  I was talking about unicode input, though, where you do know (modulo
  the language differences that unicode hasn't yet sorted out).

I don't understand why this is difficult.  As far as what Unicode has
and hasn't sorted out, that's not your job AFAICS.  If clients want a
specific codec or other language-based style, they'd better specify it
themselves.  Else, you just stuff the Unicode into a UTF-8-encoded
bytes, and go from there.  This is *why* Unicode was designed, so that
software could do something standard and sane with text which needs to
be readable but not exquisitely crafted literary works.  No?  If you
want beauty, then use a markup language.

  Right, but I was talking about my python3 example, where I was using
  the email5 parser to (unsuccessfully) parse unicode.  *That's* the thing
  email5 can't really handle, but email6 will be able to.

For email5 it would be an extension, yes, but I don't see why it would
be hard to handle Unicode input, assuming it's *really* Unicode,
unless you want to cater to legacy systems that might not understand
Unicode (or at least would prefer an alternative encoding).  Since
it's an extension, I don't think that's your problem, and the people
who would really like this extension (eg, the Japanese) are used to
dealing with mojibake issues.  (Of course, as an extension, you don't
need to do it at all.  This is just speculation.)

The problem would be with careless clients of email5 that find a way
to hand it bogus Unicode (eg, by inappropriately using the latin-1
codec to get a binary represention of their bytes in Unicode), but I'm
not sure how big a problem that would be.

  Thank you very much for this piece of perspective.  I hadn't thought
  about it that clearly before, but what you say makes perfect sense to me,
  and is in fact the implicit perspective I've been working from when
  working on the email6 stuff.

You're welcome, of course, and it makes me feel much better about
email6.  (Not that I had any real worries, but here we are about
halfway up a 100m cliff, and the trail just widened from 20cm to
2m. :-)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-07 Thread Stephen J. Turnbull

l...@rmi.net writes:

  To put that more strongly, the Python user base is much larger than 
  this list's readership.

Agreed.  Nevertheless, this is the channel (not channel) that the
developers listen on, and substantial effort is made to let Python
users know that.  I think they do know it, too.

  If I'm using 3.1 email, so are many others.

That's not obvious.  3.1 email is unusable for several applications.
In fact, for human factors reasons (humans are very likely to
communicate with other humans who use the same encodings, and to
accept occasional glitches they must deal with manually), MUAs are
likely to port relatively easily as good enough software.  But I
doubt very much that folks writing MTAs or spam filters that must run
unattended, often in long-lived, very active processes, are producing
production versions using Python 3 email yet.

  People will accept the 3.X world you make up to a point, but it's 
  impossible to code to a moving target, much less base a product on 
  it.

Impossible is nothing.  It's a decision that each individual
developer makes for herself.  I haven't heard Mailman devs complain
about the impossibility of dealing with the proposed changes, for
example.  Quite the reverse, in fact.

  At some point, they'll simply stop trying to keep up; in fact, 
  some already have.

Predictable and predicted.  Where's the balance?  I don't know, but
channeling the users is not a lot of help.  There are three worthy
goals here:

1. Taking advantage of improvements in to-be-released Pythons.
2. Not changing one's own working code.
3. Not participating in python-dev/email-sig.

Take any two; one can't have all three.

More specifically, it's interesting that most of the users you talk to
care enough to actually say they don't want more incompatible changes.
But what are we supposed to take from that?  Some fixes have to be
incompatible; do the users want the fix or the compatibility?  You
waffle (as a good representative often must):

  Fixes are a Good Thing, of course, and this particular change's scope
  remains to be seen; but to channel most of the users I meet out there
  in the real world today: Enough with the 3.X changes already, eh?

But that's also a decision each developer *can* make for himself.
Python does not withdraw products, or even withdraw support, just
because the core developers release something they consider better.

If having 1 *and* 2 is so important to particular users, but they come
into conflict because of proposed changes in Python, then they're
going to have to give up 3, come here, and articulate their needs.  As
you are doing -- but to have real influence, you're going to have to
do the review of David's patch that he requests.

I really don't see how the process can work any other way.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-06 Thread Stephen J. Turnbull

R. David Murray writes:

  version of headers to the email5 API, but since any such data would
  be non-RFC compliant anyway, [access to non-conforming headers by
  reparsing the bytes] will just have to be good enough for now.

But that's potentially unpleasant for, say, Mailman.  AFAICS, what
you're saying is that Mailman will have to implement a full header
parser and repair module, or shunt (and wait for administrator
intervention on) any mail that happens to contain even one byte of
non-RFC-conforming content in a header it cares about.  (Note that
we're not talking about moderator-level admins here; we're talking
about the Big Cheese with access to the command line on the list
host.)  That's substantially worse than the current system, where (in
theory, and in actual practice where it distributes its own version of
email) it can trap the Unicode exception on a per-header basis.

I also worry about the implications for backwards compatibility.
Eventually email-N needs to handle non-conforming mail in a sensible
way, or anybody who gets spam (ie, everybody) and wants a reliable
email system will need to implement their own.  If you punt completely
on handling non-conforming mail now, when is it going to be done?  And
when it is done, will the backward-compatible interface be able to
access the robust implementation, or will people who want robust APIs
have to use rather different ones?  The way you're going right now, I
have to worry about the answer to the second question, at least.

  [*] Why '?' and not the unicode invalid character character?  Well, the
  email5 Generate.flatten can be used to generate data for transmission over
  the wire *if* the source is RFC compliant and 7bit-only, and this would
  be a normal email5 usage pattern (that is, smtplib.SMTP.sendmail expects
  ASCII-only strings as input!).  So the data generated by Generator.flatten
  should not include unicode...

I don't understand this at all.  Of course the byte stream generated
by Generator.flatten won't contain Unicode (in the headers, anyway);
it will contain only ASCII (that happens to conform to QP or Base64
encoding of Unicode in some appropriate UTF in many cases).  Why is
U+FFFD REPLACEMENT CHARACTER any different from any other non-ASCII
character in this respect?

(Surely you are not saying that Generator.flatten can't DTRT with
non-ASCII content *at all*?)

The only thing I can think of is that you might not want to introduce
non-ASCII characters into a string that looks like it might simply be
corrupted in transmission (eg, it contains only one non-ASCII byte).
That's reasonable; there are a lot of people who don't have to deal
with anything but ASCII and occasionally Latin-1, and they don't like
having Unicode crammed down their throats.

  which raises a problem for CTE 8bit sections
  that the patch doesn't currently address.

AFAIK, there's no requirement, implied or otherwise, that a conforming
implementation *produce* CTE 8bit.  So just don't do that; that will
keep smtplib happy, no?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-06 Thread R. David Murray

On Wed, 06 Oct 2010 12:22:18 +0900, Stephen J. Turnbull step...@xemacs.org 
wrote:
 Nick Coghlan writes:
 
   - if you pass in bytes data and know what you are doing, then you can
   access that raw bytes data and do your own decoding
 
 At what level, though?
 
 To take an interesting example I used to see frequently:
 
 From: t...@tokyo.jp
   (Taro Yamada in 8-bit Shift JIS)
 
 So I guess you are suggesting that the email module can RFC 822 parse
 that, and
 
 1.  Refuse to return the unwrapped (ie, single line) form of the whole
 field, except as bytes.
 2.  Refuse to return the content of the From field, except as bytes.
 3.  Return the email address parsed from the From field.
 4.  Refuse to return the comment, except as bytes.

  5.  Return the content, with non-ASCII bytes replaced with ?
  characters.

In other words, my proposed patch only makes email5 1/8 to 1/4
broken, instead of half broken as it is now.  But not un-broken
enough for Mailman, it sounds like.

 That's fine.  But suppose I have a private or newly defined header
 that is structured?  Now I have two choices:
 
 1.  Write a version of my private parser for both str (the normal
 case) and bytes (if accessing the value as str raises)

 2.  Always get the bytes and convert them to str (probably using the
 same .decode('ascii','surrogate-escape') call that email uses but
 won't let me have the value of!), then use a common str parser.

Yes, this is exactly the dilemma faced by the entire email package.
The current email6 code attempts to do a variation on (1) by having a
common parser that handles both strings and bytes using a dual subclass
approach.  This patch is trying out (2).  If you have a private header
parser, you would ideally like to be able to use the same mechanism as the
email package to solve the problem.  For email6 you'd be able to register
your header parser and get handed the input like the built in parser and
be able to use the tools provided by the built in parser to do your work.

In email5 there is no way that I know of for you to register a private
parser, so you need access to the raw input for the header in one form
or another.

If we go this route (as opposed to only handling headers with 8bit data by
sanitizing them), then we need to think about the email5 header parsers
as well (decode_header and parseaddr).  They are of course going to have
the same problems as the rest of the email package with parsing bytes,
and you are suggesting that access to those header 8bit bytes is needed.

One option would be to add a keyword to the get and get_all methods
that instructs it to return the string with the surrogate-escaped
bytes, which can then be passed onward to decode_header, parseaddr,
or a custom decoder.  Then I need to look at what needs to be added to
those methods to handle the escaped bytes, and from what you say they
too need a keyword telling them to preserve the escaped bytes on output
(a yes I know what I'm doing flag...'preserve_escaped_bytes=True'?).

 Note that this is more problematic than it looks, since the
 appropriate base codec may require information from higher-level
 structures (eg, qp codec tags or a Content-Type header's charset
 field).

You'll have to give me an example of where this is a problem but is
not already a problem in email4.

 Why should I reproduce email's logic here?  I don't care if the
 default or concise API raises on surrogates in the str value.  But I'm
 pretty sure that I will want to use str values containing surrogates
 in these contexts (for the same reasons that email module does, for
 example), rather than work with bytes sometimes and strs sometimes.
 
 Please provide a way to return strs-with-surrogates if I ask for them.

Does my proposal make sense?  But note, it raises exactly the backward
compatibility concerns you mention in your next email (that I will reply
to next).  It is an open question whether it is worth opening that door
in order to be able to do extended handling on non-RFC conforming email
(as opposed to just sanitizing it and soldering on).

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-06 Thread R. David Murray

On Wed, 06 Oct 2010 22:55:00 +0900, Stephen J. Turnbull step...@xemacs.org 
wrote:
 R. David Murray writes:
 
   version of headers to the email5 API, but since any such data would
   be non-RFC compliant anyway, [access to non-conforming headers by
   reparsing the bytes] will just have to be good enough for now.
 
 But that's potentially unpleasant for, say, Mailman.  AFAICS, what
 you're saying is that Mailman will have to implement a full header
 parser and repair module, or shunt (and wait for administrator
 intervention on) any mail that happens to contain even one byte of
 non-RFC-conforming content in a header it cares about.  (Note that

No, it just means that such bytes would not be preserved for presentation
in the web UI.  They'd show up as '?'s  (Or U+FFFDs, perhaps, if I change
DeocdedGenerator to use U+FFFD instead of ?s for the unknown bytes).
As long as BytesGenerator is used on the output side to send the messages,
the bytes will be preserved and presented to the moderator in their email.

So the only parsing issue is if Mailman cares about *the non-ASCII
bytes* in the headers it cares about.  If it has to modify headers that
contain non-ASCII bytes (for example, addresses and Subject) and cares
about preserving the non-ASCII bytes, then there is indeed an issue;
see previous email for a possible way around that.

 we're not talking about moderator-level admins here; we're talking
 about the Big Cheese with access to the command line on the list
 host.)  That's substantially worse than the current system, where (in
 theory, and in actual practice where it distributes its own version of
 email) it can trap the Unicode exception on a per-header basis.

I thought mailman no longer distributed its own version of email?
And the email API currently promises not to raise during parsing,
which is a contract my patch does not change.

 I also worry about the implications for backwards compatibility.
 Eventually email-N needs to handle non-conforming mail in a sensible
 way, or anybody who gets spam (ie, everybody) and wants a reliable
 email system will need to implement their own.  If you punt completely
 on handling non-conforming mail now, when is it going to be done?  And

We're (in the current patch) not punting on handling non-conforming
email, we're punting on handling non-conforming bytes *if the headers
that contain them need to be modified*.  The headers can still be
modified, you just (currently) lose the non-ASCII bytes in the process.

 when it is done, will the backward-compatible interface be able to
 access the robust implementation, or will people who want robust APIs
 have to use rather different ones?  The way you're going right now, I
 have to worry about the answer to the second question, at least.

Well, this is still theory given the current state of the email6
code, but I *think* that working email5 code, even after this patch,
will continue to work using email6's backward compatibility interface.
And robustness is not the issue, only extended-beyond-the-RFCs handling
of non-conforming bytes would be an issue.

*But*, as I implied in my previous email, if we allow the surrogates
out so that custom header parsers can use them, then making *that*
code continue to work may require an extra layer in the compatibility
interface to produce the surrogateescaped strings.  Still, at the moment
I can't see any theoretical reason why that would not be possible,
so it may be worth the risk.

   [*] Why '?' and not the unicode invalid character character?  Well, the
   email5 Generate.flatten can be used to generate data for transmission over
   the wire *if* the source is RFC compliant and 7bit-only, and this would
   be a normal email5 usage pattern (that is, smtplib.SMTP.sendmail expects
   ASCII-only strings as input!).  So the data generated by Generator.flatten
   should not include unicode...
 
 I don't understand this at all.  Of course the byte stream generated
 by Generator.flatten won't contain Unicode (in the headers, anyway);
 it will contain only ASCII (that happens to conform to QP or Base64
 encoding of Unicode in some appropriate UTF in many cases).  Why is
 U+FFFD REPLACEMENT CHARACTER any different from any other non-ASCII
 character in this respect?

 (Surely you are not saying that Generator.flatten can't DTRT with
 non-ASCII content *at all*?)

Yes, that is *exactly* what I am saying:

 m = email.message_from_string(\
... From: pÃ¶stal
...   
... )
 str(m)
Traceback (most recent call last):
  
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 1: 
ordinal not in range(128)

Remember, email5 is a direct translation of email4, and email4 only
handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the-
-ride-fine-we'll-pass-then-along.  So if you want to put non-ASCII
data into a message you have to encode it properly to ASCII in
exactly the same way that you did in email4:

 m = email.message.Message()
 m['From'] =

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-06 Thread Stephen J. Turnbull

R. David Murray writes:

5.  Return the content, with non-ASCII bytes replaced with ?
characters.

That hadn't occurred to me (and it makes me sick to contemplate it).

That said, this is probably good enough for Mailman-like apps to limp
along for most users.  It's certainly good enough for the might
kick your wife and elope with your dog alpha ports of Mailman to
Python 3 (well, as certain as I can be; of course in the end Barry
decides).  Assuming reasonable backward compatibility of the API, of
course!

  In other words, my proposed patch only makes email5 1/8 to 1/4
  broken, instead of half broken as it is now.  But not un-broken
  enough for Mailman, it sounds like.

IMO, not in the long run.  But realistically, in the applications I
know of, most desired traffic is conformant, and since there aren't
any Python 3 email apps yet, this isn't even a regression. :-/

I do think that it's important that the parsed object be able to tell
you what fields are there (except if the field name itself is invalid)
and return field bodies parsed as far as possible.

  If we go this route (as opposed to only handling headers with 8bit data by
  sanitizing them), then we need to think about the email5 header parsers
  as well (decode_header and parseaddr).  They are of course going to have
  the same problems as the rest of the email package with parsing bytes,
  and you are suggesting that access to those header 8bit bytes is needed.

Yes, that would be preferable to replacing them with ASCII junk.

But I don't see any problem with parsing them; they're syntactically
insignificant by definition.  The problem is purely on output: do I
get verbatim escaped bytes, a sanitized str, or an exception?

  One option would be to add a keyword to the get and get_all methods
  that instructs it to return the string with the surrogate-escaped
  bytes, which can then be passed onward to decode_header, parseaddr,
  or a custom decoder.  Then I need to look at what needs to be added
  to those methods to handle the escaped bytes, and from what you say
  they too need a keyword telling them to preserve the escaped bytes
  on output (a yes I know what I'm doing flag...
  'preserve_escaped_bytes=True'?).

The need is not absolute, but I would have a strong preference for
being able to get at those bytes.

  Does my proposal make sense?  But note, it raises exactly the backward
  compatibility concerns you mention in your next email (that I will reply
  to next).  It is an open question whether it is worth opening that door
  in order to be able to do extended handling on non-RFC conforming email
  (as opposed to just sanitizing it and soldering on).

Well, maybe not.  However, it is not obvious to me that you won't run
into these issues again in Email6.  Applications that think of email
as textual objects are going to want to make their own choices about
handling of non-conforming email, and it's likely to be massively
inconvenient to say OK, but you have to use bytes interfaces
exclusively, because the str interfaces don't handle that.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-06 Thread Stephen J. Turnbull

R. David Murray writes:

  So the only parsing issue is if Mailman cares about *the non-ASCII
  bytes* in the headers it cares about.  If it has to modify headers that
  contain non-ASCII bytes (for example, addresses and Subject) and cares
  about preserving the non-ASCII bytes, then there is indeed an issue;
  see previous email for a possible way around that.

OK.

  I thought mailman no longer distributed its own version of email?

I believe so; the point is that it could do so again.

  And the email API currently promises not to raise during parsing,
  which is a contract my patch does not change.

Which is a contract that has historically been broken frequently.
Unhandled UnicodeErrors have been one of the most common causes of
queue stoppage in Mailman (exceeded only by configuration errors
AFAICS).  I haven't seen any reports for a while, but with the email
package being reengineered from the ground up, the possibility of
regression can't be ignored.

Granted, there should be no regression problem in the current model
for Email5, AIUI.

  We're (in the current patch) not punting on handling non-conforming
  email, we're punting on handling non-conforming bytes *if the headers
  that contain them need to be modified*.  The headers can still be
  modified, you just (currently) lose the non-ASCII bytes in the process.

Modified *or examined*.  I can't think of any important applications
offhand that *need* to examine the non-ASCII bytes (in particular,
Mailman doesn't need to do that).  Verbatim copying of the bytes
themselves is almost always the desired usage.

  And robustness is not the issue, only extended-beyond-the-RFCs handling
  of non-conforming bytes would be an issue.

And with that, I'm certain that Jon Postel is really dead. :-(

   (Surely you are not saying that Generator.flatten can't DTRT with
   non-ASCII content *at all*?)
  
  Yes, that is *exactly* what I am saying:
  
   m = email.message_from_string(\
  ... From: pöstal
  ...   
  ... )
   str(m)
  Traceback (most recent call last):

  UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 
  1: ordinal not in range(128)

But that's not interesting; you did that with Python 3.  We want to
know what people porting from Python 2 will expect.  So, in 2.5.5 or
2.6.6 on Mac, with email v4.0.2, it *doesn't* raise, it returns

wideload:~ 4:14$ python
Python 2.5.5 (r255:77872, Jul 13 2010, 03:03:57) 
[GCC 4.0.1 (Apple Inc. build 5490)] on darwin
Type help, copyright, credits or license for more information.
 import email
 m=email.message_from_string('From: pöstal\n\n')
 str(m)
'From nobody Thu Oct  7 04:18:25 2010\nFrom: p\xc3\xb6stal\n\n'
 m['From']
'p\xc3\xb6stal'
 

That's hardly helpful!  Surely we can and should do better than that
now, especially since UTF-8 (with a proper CTE) is now almost
universally acceptable to MUAs.  When would it be a problem for that
to return

'From nobody Thu Oct  7 04:18:25 2010\nFrom: =?UTF-8?Q?p=C3=B6stal?=\n\n'

  Remember, email5 is a direct translation of email4, and email4 only
  handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the-
  -ride-fine-we'll-pass-then-along.  So if you want to put non-ASCII
  data into a message you have to encode it properly to ASCII in
  exactly the same way that you did in email4:

But if you do it right, then it will still work in a version that just
encodes non-ASCII characters in UTF-8 with the appropriate CTE.  Since
you'll never be passing it non-ASCII characters, it's already ASCII
and UTF-8, and no CTE will be needed.

  Yes, exactly.  I need to fix the patch to recode using, say,
  quoted-printable in that case.

It really should check for proportions of non-ASCII.  QP would be
horrible for Japanese or Chinese.

  DecodedGenerator could still produce the unicode, though, which is
  what I believe we want.  (Although that raises the question of
  whether DecodedGenerator should also decode the RFC2047 encoded
  headersbut that raises a backward compatibility issue).

Can't really help you there.  While I would want the RFC 2047 headers
decoded if I were writing new code (which is generally the case for
me), I haven't really wrapped my head around the issues of porting old
code using Python2 str to Python3 str here.  My intuition says no
problem (there won't be any MIME-words so the app won't try to decode
them), but I'm not real sure of that. ;-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-05 Thread Nick Coghlan

On Tue, Oct 5, 2010 at 3:41 PM, Stephen J. Turnbull step...@xemacs.org wrote:
 R. David Murray writes:
   Only if the email package contains a coding error would the
   surrogates escape and cause problems for user code.

 I don't think it is reasonable to internalize surrogates that way;
 some applications *will* want to look at them and do something useful
 with them (delete them or replace them with U+FFFD or ...).  However,
 I argue below that the presence of surrogates already means the user
 code is under fire, and this puts the problem in a canonical form so
 the user code can prepare for it (if that is desirable).

Hang on here, this objection doesn't seem to quite mesh with what RDM
is proposing (and the similar trick I am considering for
urllib.parse).

The basic issue is having an algorithm that is designed to operate on
character data and depends on multiple ASCII constants stored as str
objects.

In Python 2.x, those algorithms could innately operate on str objects
in any ASCII compatible encoding, as well as on unicode objects (due
to the implicit promotion of the ASCII constants to unicode when
unicode input was encountered).

In Py3k, that trick broke. Now those algorithms only operate on str
objects, and bytes input fails, even when it uses an ASCII compatible
encoding.

For urllib.parse, the external API will be str in - str out, bytes
in - bytes out. Whether that is internally implemented by
duplicating all the ASCII constants with both bytes and str flavours
(as my current patch does), or implicitly (and temporarily) decoding
the bytes values using ascii+surrogateescape or latin-1 (a pair of
alternative approaches I plan to explore soon) should be completely
transparent to the user of the API. If a user can easily tell which of
these I am doing just through the external behaviour of the documented
API, then I'll have made a mistake somewhere.

My understanding is that email6 in 3.3 will essentially follow that
same model. What I believe RDM is suggesting is an in-between approach
for the 3.2 email module:

- if you pass in bytes data that isn't 7-bit clean and naively use the
str APIs to access the headers, then it will complain loudly if it is
about to return escaped data (but will decode the body in accordance
with the Content Transfer Encoding)
- if you pass in bytes data and know what you are doing, then you can
access that raw bytes data and do your own decoding

I've probably grossly oversimplified what RDM is suggesting, but it
sounds plausible as a useful interim stepping stone to the more
comprehensive type separation in email6.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-05 Thread R. David Murray

On Tue, 05 Oct 2010 22:05:33 +1000, Nick Coghlan wrote:
 On Tue, Oct 5, 2010 at 3:41 PM, Stephen J. Turnbull step...@xemacs.org 
 wrote:
  R. David Murray writes:
   Only if the email package contains a coding error would the
   surrogates escape and cause problems for user code.
 
  I don't think it is reasonable to internalize surrogates that way;
  some applications *will* want to look at them and do something useful
  with them (delete them or replace them with U+FFFD or ...). However,
  I argue below that the presence of surrogates already means the user
  code is under fire, and this puts the problem in a canonical form so
  the user code can prepare for it (if that is desirable).
 
 Hang on here, this objection doesn't seem to quite mesh with what RDM
 is proposing (and the similar trick I am considering for
 urllib.parse).

[snip Nick's clear explanation of the issue and using surrogates to
allow string-based algorithms to work]

 My understanding is that email6 in 3.3 will essentially follow that
 same model. What I believe RDM is suggesting is an in-between approach
 for the 3.2 email module:
 
 - if you pass in bytes data that isn't 7-bit clean and naively use the
 str APIs to access the headers, then it will complain loudly if it is
 about to return escaped data (but will decode the body in accordance
 with the Content Transfer Encoding)

Almost correct.  What it will do when it does not have the information
needed to decode the bytes correctly (ie: the message is not RFC
compliant) is to replace the unknown bytes with '?' characters.  This
means that you can render a dirty email to the terminal, for example,
and the invalid bytes will show as '?'s.[*]

 - if you pass in bytes data and know what you are doing, then you can
 access that raw bytes data and do your own decoding

With the current patch this is a true statement for message bodies, but
not for message headers.  There is no easy way to add access to the bytes
version of headers to the email5 API, but since any such data would be
non-RFC compliant anyway, that will just have to be good enough for now.

 I've probably grossly oversimplified what RDM is suggesting, but it
 sounds plausible as a useful interim stepping stone to the more
 comprehensive type separation in email6.

The more I look at the patch the more I think this can be an internal
implementation detail in email5 just like you might do for urllib.
So the email5 API will have a way to put bytes in, a way to get decoded
data out, and a way to get a bytes out (except for individual header
values).  The model object will be the same no matter what you put in
or take out.  The additional methods added to the email5 API to make
this possible will be:

message_from_bytes (and Parser.parsebytes)
message_from_binary_file
Feedparser.feedbytes
BytesGenerator

message_from_bytes and message_from_binary_file are currently part
of the proposed email6 API, and I was thinking about some version of
Feedparser.feedbytes[**].  BytesGenerator wasn't, but now perhaps it
will be (and certainly will be in the backward compatibility interface).

--
R. David Murray  www.bitdance.com

[*] Why '?' and not the unicode invalid character character?  Well, the
email5 Generate.flatten can be used to generate data for transmission over
the wire *if* the source is RFC compliant and 7bit-only, and this would
be a normal email5 usage pattern (that is, smtplib.SMTP.sendmail expects
ASCII-only strings as input!).  So the data generated by Generator.flatten
should not include unicode...which raises a problem for CTE 8bit sections
that the patch doesn't currently address.

[**] Benjamin asked how the patch would affect backward compatibility
support in email6, and I said it wouldn't make it harder.  However,
if feedbytes calls can be mixed with feed calls, which in the simplest
implementation they could be, then if email6 does *not* use surrogates
internally its feedparser algorithm would need to be considerably
more complicated to be backward compatible with this.  So when I add
Feedparser.parsebytes to my patch, I am at least initially going to
disallow mixing calls to feed and feedbytes.  Which is another reason
to add that method so as to keep the use of the surrogateescape an
implementation detail.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-05 Thread Stephen J. Turnbull

Nick Coghlan writes:

  - if you pass in bytes data and know what you are doing, then you can
  access that raw bytes data and do your own decoding

At what level, though?

To take an interesting example I used to see frequently:

From: t...@tokyo.jp
  (Taro Yamada in 8-bit Shift JIS)

So I guess you are suggesting that the email module can RFC 822 parse
that, and

1.  Refuse to return the unwrapped (ie, single line) form of the whole
field, except as bytes.
2.  Refuse to return the content of the From field, except as bytes.
3.  Return the email address parsed from the From field.
4.  Refuse to return the comment, except as bytes.

That's fine.  But suppose I have a private or newly defined header
that is structured?  Now I have two choices:

1.  Write a version of my private parser for both str (the normal
case) and bytes (if accessing the value as str raises)

2.  Always get the bytes and convert them to str (probably using the
same .decode('ascii','surrogate-escape') call that email uses but
won't let me have the value of!), then use a common str parser.
Note that this is more problematic than it looks, since the
appropriate base codec may require information from higher-level
structures (eg, qp codec tags or a Content-Type header's charset
field).

Why should I reproduce email's logic here?  I don't care if the
default or concise API raises on surrogates in the str value.  But I'm
pretty sure that I will want to use str values containing surrogates
in these contexts (for the same reasons that email module does, for
example), rather than work with bytes sometimes and strs sometimes.

Please provide a way to return strs-with-surrogates if I ask for them.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-04 Thread Scott Dial

On 10/2/2010 7:00 PM, R. David Murray wrote:
 The clever hack (thanks ultimately to Martin) is to accept 8bit data
 by encoding it using the ASCII codec and the surrogateescape error
 handler.

I've seen this idea pop up in a number of threads. I worry that you are
all inventing a new kind of dual that is a direct parallel to Python 2.x
strings. That is to say,

3.x b = b'\xc2\xa1'
3.x s = b.decode('utf8')
3.x v = b.decode('ascii', 'surrogateescape')

, where s and v should be the same thing in 3.x but they are not due
to an encoding trick. I believe this trick generates more-or-less the
same issues as strings did in 2.x:

2.x b = '\xc2\xa1'
2.x s = b.decode('utf8')
2.x v = b

Any reasonable 2.x code has to guard on str/unicode and it would seem in
3.x, if this idiom spreads, reasonable code will have to guard on
surrogate escapes (which actually seems like a more expensive test). As in,

3.x print(v)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in
position 0: surrogates not allowed

It seems like this hack is about making the 3.x unicode type more like
the 2.x string type, and I thought we decided that was a bad idea. How
will developers not have to ask themselves whether a given string is a
real string or a byte sequence masquerading as a string? Am I missing
something here?

-- 
Scott Dial
sc...@scottdial.com
scod...@cs.indiana.edu
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-04 Thread R. David Murray

On Mon, 04 Oct 2010 12:32:26 -0400, Scott Dial scott+python-...@scottdial.com 
wrote:
 On 10/2/2010 7:00 PM, R. David Murray wrote:
  The clever hack (thanks ultimately to Martin) is to accept 8bit data
  by encoding it using the ASCII codec and the surrogateescape error
  handler.
 
 I've seen this idea pop up in a number of threads. I worry that you are
 all inventing a new kind of dual that is a direct parallel to Python 2.x
 strings.

Yes, that is exactly my worry.

 That is to say,
 
 3.x b = b'\xc2\xa1'
 3.x s = b.decode('utf8')
 3.x v = b.decode('ascii', 'surrogateescape')
 
 , where s and v should be the same thing in 3.x but they are not due
 to an encoding trick.

Why should they be the same thing in 3.x?  One is an ASCII string with
some escaped bytes in an unknown encoding, the other is a valid unicode
string.  The surrogateescape trick is used only when we don't *know*
the encoding (a priori) of the bytes in question.

 I believe this trick generates more-or-less the same issues as strings
 did in 2.x:
 
 2.x b = '\xc2\xa1'
 2.x s = b.decode('utf8')
 2.x v = b

The difference is that in 2.x people could and would operate on strings as
if they knew the encoding, and get in trouble.  In 3.x you can't do that.
If you've got escaped bytes you *know* that you don't know the encoding,
and the program can't get around that except by re-encoding to bytes
and properly decoding them.

 Any reasonable 2.x code has to guard on str/unicode and it would seem in
 3.x, if this idiom spreads, reasonable code will have to guard on
 surrogate escapes (which actually seems like a more expensive test). As in,
 
 3.x print(v)
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in
 position 0: surrogates not allowed

Right, I mentioned that concern in my post.

In this case at least, however, the *goal* is that the surrogates are
never seen outside the email internals.  In reflection of this, my latest
thought is that I should add a 'message_from_binary_file' helper method
and a 'feedbytes' method to feedparser, making the surrogates a 100%
internal implementation detail[*].  Only if the email package contains a
coding error would the surrogates escape and cause problems for user
code.

 It seems like this hack is about making the 3.x unicode type more like
 the 2.x string type, and I thought we decided that was a bad idea. How
 will developers not have to ask themselves whether a given string is a
 real string or a byte sequence masquerading as a string? Am I missing
 something here?

I think this question is something that needs to be considered any
time using surrogates is proposed.  I hope that in the email package
proposal I've addressed it.  What do you think?

--David

[*] And you are right that there is a performance concern as a result
of needing to detect surrogates at various points in the code.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-04 Thread Barry Warsaw

On Oct 02, 2010, at 07:00 PM, R. David Murray wrote:

The advantage of this patch is that it means Python3.2 can have an
email module that is capable of handling a significant proportion of
the applications where the ability to process binary email data is
required.

Like others, I'm concerned that we're perpetuating the Python 2 problems with
bytes vs. strings.  OTOH, I went down a similar road (though much more hacky
and less successful) in one of my failed branches, so I sympathize with this
nod to practicality that actually works.

If the choice is the current brokenness staying in Python 3.2 or this hack
being added for now, I'd go with the latter.  email6 will make it all better,
right? :)  In the meantime, I do think it would be good to give our users
something that's practical.

I've uploaded the patch to issue 4661 (http://bugs.python.org/issue4661). I
uploaded it to rietveld as well just before Martin's announcement. After the
announcement I uploaded the svn patch to the tracker, so hopefully there will
be an automated review button as well.  Here is your chance to exercise the
new review tools :)

I see no automatically generated link to the review, but I did add some
comments to the Rietveld issue you linked to in one of your comments.

-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-04 Thread Stephen J. Turnbull

R. David Murray writes:
  On Mon, 04 Oct 2010 12:32:26 -0400, Scott Dial 
  scott+python-...@scottdial.com wrote:
   On 10/2/2010 7:00 PM, R. David Murray wrote:
The clever hack (thanks ultimately to Martin) is to accept 8bit data
by encoding it using the ASCII codec and the surrogateescape error
handler.
   
   I've seen this idea pop up in a number of threads. I worry that you are
   all inventing a new kind of dual that is a direct parallel to Python 2.x
   strings.
  
  Yes, that is exactly my worry.

I don't worry about this.  Strings generated by decoding with
surrogate-escape are *different* from other strings: they contain
invalid code units (the naked surrogates).  These cannot be encoded
except with a surrogate-escape flag to .encode(), and sane developers
won't do that unless she knows precisely what she's doing.  This is
not true with Python 2 strings, where all bytes are valid.

   Any reasonable 2.x code has to guard on str/unicode and it would seem in
   3.x, if this idiom spreads, reasonable code will have to guard on
   surrogate escapes (which actually seems like a more expensive test).
  
  Right, I mentioned that concern in my post.

Again, I don't worry about this.  It is *not* an *extra* cost.  Those
messages are *already broken*, they *will* crash the email module if
you fail to guard against them.  Decoding them to surrogates actually
makes it easier to guard, because you know that even if broken
encodings are present, the parser will still work.  Broken encodings
can no longer crash the parser.  That is a Very Good Thing IMHO.

  Only if the email package contains a coding error would the
  surrogates escape and cause problems for user code.

I don't think it is reasonable to internalize surrogates that way;
some applications *will* want to look at them and do something useful
with them (delete them or replace them with U+FFFD or ...).  However,
I argue below that the presence of surrogates already means the user
code is under fire, and this puts the problem in a canonical form so
the user code can prepare for it (if that is desirable).

   It seems like this hack is about making the 3.x unicode type more like
   the 2.x string type,

Not at all.  It's about letting the parser be a parser, and letting
the application handle broken content, or discard it, or whatever.
Modularity is improved.  This has been a major PITA for Mailman
support over the years: every time the spammers and virus writers come
up with a new idea, there's a chance it will leak out and the email
parser will explode, stopping the show.  These kinds of errors are a
FAQ on the Mailman lists (although much less so in recent years).

   How will developers not have to ask themselves whether a given
   string is a real string or a byte sequence masquerading as a
   string? Am I missing something here?

There are two things to say, actually.  First, you're in a war zone.
*All* email is bytes sequences masquerading as text, and if you're not
wearing armor, you're going to get burned.  The idea here is to have
the email package provide the armor and enough instrumentation so you
can do bomb detection yourself (or perhaps just let it blow, if you're
hacking up a quick and dirty script).

Second, there are developers who will not care whether strings are
real or byte sequences in drag, because they're writing MTAs and
the like.  Those people get really upset, and rightly so, when the
parser pukes on broken headers; it is not their app's job at all to
deal with that breakage.

  I think this question is something that needs to be considered any
  time using surrogates is proposed.

I don't agree.  The presence of naked surrogates is *always* (assuming
sane programmers) an indication of invalid input.  The question is,
should the parser signal invalidity, or should it allow the
application to decide?  The email module *doesn't have enough
information to decide* whether the invalid input is a real problem,
or how to handle it (cf the example of a MTA app).  Note that a
completely naive app doesn't care -- it will crash either way because
it doesn't handle the exception, whether it's raised by the parser or
by a codec when the app tries to do I/O.  A robust app *does* care: if
the parser raises, then the app must provide an alternative parser
good enough to find and fix the invalid bytes.  Clearly it's much
better to pass invalid (but fully parsed) text back to the app in this
case.

Note that if the app really wants the parser to raise rather than pass
on the input, that should be easy to implement at fairly low cost; you
just provide a variable rather than hardcoding the surrogate-escape
flag.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-02 Thread R. David Murray

A while back on some issue or another I remember telling someone that
if there was any sort of clever hack that would allow the current email
package (email5) to work with bytes we would have implemented it.

Well, I've come up with a clever hack.

The idea came out of a conversation with Antoine.  I was saying that it
was ironic that Unicode could only be used as a 7bit-clean data
transmission channel for email, and he remarked that by using
surrogate escape you *could* use unicode as a transmission channel
for 8bit data.  At first I dismissed this observation as irrelevant
to email, since email has to transform the 8bit data at some point.

But I started thinking.  And then I started experimenting.  And it turns
out that it works.

The clever hack (thanks ultimately to Martin) is to accept 8bit data
by encoding it using the ASCII codec and the surrogateescape error
handler.  Then, inside the email module at any point where bytes might
be meaningful or might be about to escape, it can check to see if there
are any surrogates and act accordingly.

The API additions are few, and in fact for most programs (he says bravely,
not really knowing) there are really only two changes you need to make
when converting a program that handles bytes data to py3k.  The first
is the encoding of binary input data as mentioned.  The second is that
when you want to get the bytes back out, you use the new BytesGenerator
instead of Generator.  BytesGenerator is just like Generator except
that it writes bytes to its file argument instead of strings, and it
recovers any bytes that were in the original input.

So given this sequence:

msg = email.msg_from_file(open('myfile',
   encoding='ascii',
   errors='surrogateescape'))
email.generator.BytesGenerator(open('myfile2', 'wb')).flatten(msg)

myfile and myflie2 will theoretically be identical (modulo universal
newline and _mangle_from issues).

I've additionally added a 'message_from_bytes' convenience function.

One nice feature of this patch is that once you've got the model built
from surrogateescaped input, if you do a get_payload() on a message body
whose ContentTransferEncoding is '8bit' you will get the body decoded
to unicode using the charset declared in the Content-Type header
(assuming Python supports that charset).

You can always get at the bytes version of the body of a message part by
using get_payload(decode=True) [*].  You can't really get at the bytes
version of message headers, though...for safety if you access a header
whose value contains non-ASCII chars (that aren't RFC2047 encoded to be
ASCII) the 8bit characters get replaced with '?'s.  (But BytesGenerator
will emit the original 8bit characters if the headers haven't been
modified.)

I do not propose that this is a *good* API, since it has the classic
problem that if there are coding bugs in the email module strings may
escape that have surrogates in them and we end up with programs that
work most of the timeexcept when they fail with mysterious errors
because of unusual bytes input data.  On the other hand you always
*know* when you have bytes data in an unknown encoding (because they
are surrogate escaped), so it is ever so much better than the Python2
situation.

The advantage of this patch is that it means Python3.2 can have an
email module that is capable of handling a significant proportion of the
applications where the ability to process binary email data is required.

I've uploaded the patch to issue 4661 (http://bugs.python.org/issue4661).
I uploaded it to rietveld as well just before Martin's announcement.
After the announcement I uploaded the svn patch to the tracker, so
hopefully there will be an automated review button as well.  Here
is your chance to exercise the new review tools :)

This patch does break two of Barry's patch-for-review rules: it is
more than 800 lines of diff (but not a lot more, and less than 800
if you count only code diff and not docs), and it did not have a very
extensive design discussion beforehand.  I did talk with people on IRC,
particularly Barry, before finishing the patch, and I did post a summary
to the email-sig mailing list (but got no response).

Now it is time to see what the wider community thinks.  There is some
question of whether this is a bending of the string/bytes separation
that doesn't belong as part of the standard library, but after working
my way through it I think it is a fairly clean hack[**], and most
likely a case where practicality beats purity.

Regardless of whether or not this patch or a descendant thereof is
accepted I still intend to continue working on email6.  There are many
other bugs in the current email package that require a rewrite of parts
of its infrastructure, and the email-sig is agreed that the email API
needs revision quite apart from the bytes/string issues.  However, there
is something pleasing about the simplicity of this way of handling bytes
that

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-02 Thread Benjamin Peterson

2010/10/2 R. David Murray rdmur...@bitdance.com:
 Regardless of whether or not this patch or a descendant thereof is
 accepted I still intend to continue working on email6.  There are many
 other bugs in the current email package that require a rewrite of parts
 of its infrastructure, and the email-sig is agreed that the email API
 needs revision quite apart from the bytes/string issues.  However, there
 is something pleasing about the simplicity of this way of handling bytes
 that I intend to consider carefully while we work further on email6.

And how would this addition interact with changes in email6?



-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-02 Thread R. David Murray

On Sat, 02 Oct 2010 19:15:57 -0500, Benjamin Peterson benja...@python.org 
wrote:
 2010/10/2 R. David Murray rdmur...@bitdance.com:
  Regardless of whether or not this patch or a descendant thereof is
  accepted I still intend to continue working on email6. =C2=A0There are ma=
 ny
  other bugs in the current email package that require a rewrite of parts
  of its infrastructure, and the email-sig is agreed that the email API
  needs revision quite apart from the bytes/string issues. =C2=A0However, t=
 here
  is something pleasing about the simplicity of this way of handling bytes
  that I intend to consider carefully while we work further on email6.
 
 And how would this addition interact with changes in email6?

It will be no harder to do the backward compatibility support for this
than for the rest of the email5 API, if that's what you are asking.
Assuming my plan for backward compatibility works at all (which it
should).

--David
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch making the current email package (mostly) support bytes

2010-10-02 Thread Nick Coghlan

On Sun, Oct 3, 2010 at 9:00 AM, R. David Murray rdmur...@bitdance.com wrote:
 I do not propose that this is a *good* API, since it has the classic
 problem that if there are coding bugs in the email module strings may
 escape that have surrogates in them and we end up with programs that
 work most of the timeexcept when they fail with mysterious errors
 because of unusual bytes input data.  On the other hand you always
 *know* when you have bytes data in an unknown encoding (because they
 are surrogate escaped), so it is ever so much better than the Python2
 situation.

It's a similar concept to one Antoine and I (and some others) have
been considering in the tracker for making urllib.parse able to handle
ASCII-compatible bytes-encodings. I've already implemented a version
of that patch which has parallel bytes and str versions of all the
ASCII constants, and the result is pretty ugly. My next goal is to
implement a version that uses the same trick you have here for email
and see how the code complexity compares.

We do need to tread carefully to make sure the pseudo strings don't
escape, but the other approach requires similar care all the way
through the internal algorithms to make sure they aren't assuming
bytes or str instances anywhere.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

40 matches

Mail list logo