Re: [Python-Dev] email package status in 3.X

Stephen J. Turnbull Tue, 22 Jun 2010 00:12:01 -0700

P.J. Eby writes:

 > I know, it's a hard thing to wrap one's head around, since on the
 > surface it sounds like unicode is the programmer's savior.


I don't need to wrap my head around it.  It's been deeply embedded,
point first, and the nasty barbs ensure that I have no desire to pull
it back out.

To wit, I've been dealing with Japanese encoding issues on a daily
basis for 20 years, and I'm well aware that programmers have several
good reasons (and a lot more bad ones) for avoiding them, and even for
avoiding Unicode when they must deal with encodings at all.  I don't
think any of the good reasons have been offered here yet, that's all.

 > Unfortunately, real-world text data exists which cannot be safely
 > roundtripped to unicode, and must be handled in "bytes with
 > encoding" form for certain operations.

Or "Unicode with encoding" form.  See below for why this makes sense in
the context of Python.

 > I personally do not have to deal with this *particular* use case any 
 > more -- I haven't been at NTT/Verio for six years now.

As mentioned, I have a bit of understanding of the specific problems
of Japanese-language computing.  In particular, roundtripping Japanese
from *any* encoding to *any other* encoding is problematic, because
the national standards provide a proper subset of the repertoire
actually used by the Japanese people.  (Even JIS X 0213.)

 > My current needs are simpler, thank goodness.  ;-)  However, they 
 > *do* involve situations where I'm dealing with *other* 
 > encoding-restricted legacy systems, such as software for interfacing 
 > with the US Postal Service that only works with a restricted subset 
 > of latin1, while receiving mangled ASCII from an ecommerce provider, 
 > and storing things in what's effectively a latin-1 database.

Yes, I know of similar issues in other applications.  For example, TeX
error messages do not respect UTF-8 character boundaries, so Emacs has
to handle them specially (basically a mechanism similar in spirit to
PEP 383 is used).

 > Being able to easily assert what kind of bytes I've got would
 > actually let me catch errors sooner, *if* those assertions were
 > being checked when different kinds of strings or bytes were being
 > combined.  i.e., at coercion time).

I see that this would make life a little easier for you in maintaining
without refactoring.  I'd say it's a kludge, but without a full list
of requirements I'm in no position to claim any authority <wink>.  Eg,
for a non-kludgey suggestion, how about defining a codec which takes
Latin-1 bytes, checks (with error on failure) for the restricted
subset, and converts to str?  Then you can manipulate these things as
str with abandon internally.  Finally you get another check in the
outgoing codec which converts from str to "effective Latin-1 bytes",
however that is defined.

But OK, maybe I'm just being naive.  You need this unlovely artifice
so you can put in asserts in appropriate places.  Now, does it belong
in the stdlib?

It seems to me that in the case of Japanese roundtripping, *most* of
the time encoding back to a standard Japanese encoding will work.  If
you run into one of the problematic characters that JIS doesn't allow
but Japanese like to use because they prefer the glyph to the
JIS-standard glyph, you get an occasional error on encoding to a
standard Japanese encoding, which you handle specially with a database
of such characters.  Knowing the specific encoding originally used
*normally does not help unless you're replying to that person and
**only** that person*, because the extended repertoires vary widely
and the only standard is Japanese.  I conclude ebytes does *no* good
here.

For the ecommerce/USPS case, well, actually you need special-purpose
encodings anyway (ISTM).  'latin-1' loses, the USPS is allergic to
some valid 'latin-1' characters.  'ascii' loses, apparently you need
some of the Latin-1 repertoire, and anyway AIUI the ecommerce provider
munges the ASCII.  So what does ebytes actually buy you here, unless
you write the codecs?  If you've got the codecs, what additional
benefit do you get from ebytes?

Note that you would *also* need to do explicit transcoding anyway if
you were dealing with Japan Post instead of the USPS, although I grant
your code is probably general enough to deal with Deutsche Telecom
(but the German equivalent of your ecommerce provider probably has its
own ways of munging Latin-1).  I conclude that there may be genuine
benefits to ebytes here, but they're probably not general enough to
put in the stdlib (or the Python language).

 > Which works if and only if your outputs are truly unicode-able.

With PEP 383, they always are, as long as you allow Unicode to be
decoded to the same garbage your bytes-based program would have
produced anyway.

 > If you work with legacy systems (e.g. those Asian email clients and
 > US postal software), you are really working with a *character set*,
 > not unicode,

I think you're missing something.  Namely, Unicode is a standard for
handling character objects as integers, and a registry for mapping
characters to integers.  It includes over 100,000 points for making up
your own mappings, and recent Python also provides (as an internal
extension) for embedding non-characters in a str.

Unicode does not define a repertoire, however.  That's up to the
application, and Python 2+ provides a convenient way to restrict
repertoires by defining special purpose codecs in Python.

It is then up to the program to ensure that all candidates claiming to
be text pass through the cleansing fire of a codec before being
allowed into the Pure Land of str.  This can be something of a
problem; there are a few ways for textual data to get into Python, and
not all of them were obvious to me.  But this problem would be even
worse for mechanisms like ebytes, where it's up to the programmer to
decide which things are put into ebytes.

 > and so putting your data in unicode form is actually *wrong*
 > -- an expedient lie.
 > 
 > Heresy, I know, but there you go.  ;-)

It's not heresy, it's simply assuming a restriction on use of Unicode
that just isn't true.  It *is* true that mapping the data to Unicode
according to some encoding is not always sufficient.  It *is* often
the case that further information must be provided to ensure semantic
correctness.  However, given the mapping (== properly defined codecs),
roundtripping *is* always possible, at least up to the size of private
space, which is big enough to hold the Post Office's repertoire, for
sure.  And that mapping is a Python object which will fit into a
variable for later use.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

Reply via email to