[issue2857] add codec for java modified utf-8

Marc-Andre Lemburg Fri, 12 Aug 2011 03:26:43 -0700

Marc-Andre Lemburg <[email protected]> added the comment:

Tom Christiansen wrote:
> 
> Tom Christiansen <[email protected]> added the comment:
> 
> Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at:
> 
>   http://unicode.org/reports/tr26/
> 
> CESU-8 is *not* a a valid Unicode Transform Format and should not be called 
> UTF-8. It is a real pain in the butt, caused by people who misunderand 
> Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need 
> to be able to read it, but call it what it is, please.
> 
> Despite the talk about Lucene, I note that the Perl port of Lucene uses real 
> UTF-8, not CESU-8.


CESU-8 is a different encoding than the one we are talking about.

The only difference between UTF-8 and the modified one is the different
encoding for the U+0000 code point to have the output not contain
any NUL bytes.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue2857>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2857] add codec for java modified utf-8

Reply via email to