Re: [Python-Dev] bytes / unicode

P.J. Eby Fri, 25 Jun 2010 06:11:34 -0700

At 04:49 PM 6/25/2010 +0900, Stephen J. Turnbull wrote:

P.J. Eby writes:


 > This doesn't have to be in the functions; it can be in the
 > *types*.  Mixed-type string operations have to do type checking and
 > upcasting already, but if the protocol were open, you could make an
 > encoded-bytes type that would handle the error checking.

Don't you realize that "encoded-bytes" is equivalent to use of a very
limited profile of ISO 2022 coding extensions?  Such as Emacs/MULE
internal encoding or TRON code?  It has been tried.  It does not work.

I understand how types can do such checking; my point is that the
encoded-bytes type doesn't have enough information to do it in the
cases where you think it is better than converting to str.  There are
*no useful operations* that can be done on two encoded-bytes with
different encodings unless you know the ultimate target codec.


I do know the ultimate target codec -- that's the point.

IOW, I want to be able to do to all my operations by passingtarget-encoded strings to polymorphic functions. Then, the momentsomething creeps in that won't go to the target codec, I'll be ableto track down the hole in the legacy code that's letting bad data creep in.

  The
only sensible way to define the concatenation of ('ascii', 'English')
with ('euc-jp','ÆüËÜ¸ì') is something like ('ascii', 'English',
'euc-jp','ÆüËÜ¸ì'), and *not* ('euc-jp','EnglishÆüËÜ¸ì'), because you
don't know that the ultimate target codec is 'euc-jp'-compatible.
Worse, you need to build in all the information about which codecs are
mutually compatible into the encoded-bytes type.  For example, if the
ultimate target is known to be 'shift_jis', it's trivially compatible
with 'ascii' and 'euc-jp' requires a conversion, but latin-9 you can't
have.

The interaction won't be with other encoded bytes, it'll be withother *unicode* strings. Ones coming from other code, and literalsembedded in the stdlib.

No, the problem is not with the Unicode, it is with the code that
allows characters not encodable with the target codec.

And which code that is, precisely, is the thing that may be verydifficult to find, unless I can identify it at the first point itenters (and corrupts) my output data. When dealing with a large codebase, this may be a nontrivial problem.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

Reply via email to