Re: [Python-Dev] email package status in 3.X

P.J. Eby Mon, 21 Jun 2010 12:11:12 -0700

At 02:58 AM 6/22/2010 +0900, Stephen J. Turnbull wrote:

Nick alluded to the The One Obvious Way as a change in architecture.


Specifically: Decode all bytes to typed objects (str, images, audio,
structured objects) at input.  Do no manipulations on bytes ever
except decode and encode (both to text, and to special-purpose objects
such as images) in a program that does I/O.

This ignores the existence of use cases where what you have is textthat can't be properly encoded in unicode. I know, it's a hard thingto wrap one's head around, since on the surface it sounds likeunicode is the programmer's savior. Unfortunately, real-world textdata exists which cannot be safely roundtripped to unicode, and mustbe handled in "bytes with encoding" form for certain operations.

I personally do not have to deal with this *particular* use case anymore -- I haven't been at NTT/Verio for six years now. But I do knowit exists for e.g. Asian language email handling, which is where Ifirst encountered it. At the time (this *may* have changed), manypopular email clients did not actually support unicode, so youcouldn't necessarily just send off an email in UTF-8. It drove usnuts on the project where this was involved (an i18n of an existingPython app), and I think we had to compromise a bit in some fashion(because we couldn't really avoid unicode roundtripping due todatabase issues), but the use case does actually exist.

My current needs are simpler, thank goodness. ;-) However, they*do* involve situations where I'm dealing with *other*encoding-restricted legacy systems, such as software for interfacingwith the US Postal Service that only works with a restricted subsetof latin1, while receiving mangled ASCII from an ecommerce provider,and storing things in what's effectively a latin-1 database. Beingable to easily assert what kind of bytes I've got would actually letme catch errors sooner, *if* those assertions were being checked whendifferent kinds of strings or bytes were being combined. i.e., atcoercion time).

Yes, this is tedious if you live in an ASCII world, compared to using
bytes as characters.  However, it works for the rest of us, which the
old style doesn't.

I'm not trying to go back to the old style -- ideally, I wantsomething that would actually improve on the "it's not reallyunicode" use cases above if it were available in 2.x.

I don't want to be "encoding agnostic" or "encoding implicit", -- Iwant to make it possible to be even *more* explicit and restrictivethan it is currently possible to be in either 2.x OR 3.x. It's justthat 3.x affords greater opportunity for doing this, and is an idealplace to make the switch -- i.e., at a point where you now have toget explicit about your encodings, anyway!

As for "Think Carefully About It Every Time", that is required only in
Porting Programs That Mix Operation On Bytes With Operation On Str.
If you write programs from scratch, however, the decode-process-encode
paradigm quickly becomes second nature.

Which works if and only if your outputs are truly unicode-able. Ifyou work with legacy systems (e.g. those Asian email clients and USpostal software), you are really working with a *character set*, notunicode, and so putting your data in unicode form is actually *wrong*-- an expedient lie.


Heresy, I know, but there you go.  ;-)

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

Reply via email to