At 04:49 PM 6/25/2010 +0900, Stephen J. Turnbull wrote:
P.J. Eby writes:

 > This doesn't have to be in the functions; it can be in the
 > *types*.  Mixed-type string operations have to do type checking and
 > upcasting already, but if the protocol were open, you could make an
 > encoded-bytes type that would handle the error checking.

Don't you realize that "encoded-bytes" is equivalent to use of a very
limited profile of ISO 2022 coding extensions?  Such as Emacs/MULE
internal encoding or TRON code?  It has been tried.  It does not work.

I understand how types can do such checking; my point is that the
encoded-bytes type doesn't have enough information to do it in the
cases where you think it is better than converting to str.  There are
*no useful operations* that can be done on two encoded-bytes with
different encodings unless you know the ultimate target codec.

I do know the ultimate target codec -- that's the point.

IOW, I want to be able to do to all my operations by passing target-encoded strings to polymorphic functions. Then, the moment something creeps in that won't go to the target codec, I'll be able to track down the hole in the legacy code that's letting bad data creep in.


  The
only sensible way to define the concatenation of ('ascii', 'English')
with ('euc-jp','ÆüËܸì') is something like ('ascii', 'English',
'euc-jp','ÆüËܸì'), and *not* ('euc-jp','EnglishÆüËܸì'), because you
don't know that the ultimate target codec is 'euc-jp'-compatible.
Worse, you need to build in all the information about which codecs are
mutually compatible into the encoded-bytes type.  For example, if the
ultimate target is known to be 'shift_jis', it's trivially compatible
with 'ascii' and 'euc-jp' requires a conversion, but latin-9 you can't
have.

The interaction won't be with other encoded bytes, it'll be with other *unicode* strings. Ones coming from other code, and literals embedded in the stdlib.



No, the problem is not with the Unicode, it is with the code that
allows characters not encodable with the target codec.

And which code that is, precisely, is the thing that may be very difficult to find, unless I can identify it at the first point it enters (and corrupts) my output data. When dealing with a large code base, this may be a nontrivial problem.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to