Marc-Andre Lemburg <m...@egenix.com> added the comment: Tom Christiansen wrote: > > Tom Christiansen <tchr...@perl.com> added the comment: > > Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at: > > http://unicode.org/reports/tr26/ > > CESU-8 is *not* a a valid Unicode Transform Format and should not be called > UTF-8. It is a real pain in the butt, caused by people who misunderand > Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need > to be able to read it, but call it what it is, please. > > Despite the talk about Lucene, I note that the Perl port of Lucene uses real > UTF-8, not CESU-8.
CESU-8 is a different encoding than the one we are talking about. The only difference between UTF-8 and the modified one is the different encoding for the U+0000 code point to have the output not contain any NUL bytes. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue2857> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com