Ezio Melotti <ezio.melo...@gmail.com> added the comment: >> We might keep the old public API for compatibility, but it should be >> clearly marked as broken for non-BMP scalar values.
> That has always been the case. UCS2 doesn't support surrogates. > However, we have been slowly moving into the direction of making > the UCS2 storage appear like UTF-16 to the Python programmer. UCS2 died long ago, is there any reason why we keep using an UCS2 that "appears" like UTF-16 instead of real UTF-16? > This process is not yet complete and will likely never complete > since it must still be possible to create things line lone > surrogates for processing purposes, so care has to be taken > when using non-BMP code points on narrow builds. I don't exactly know all the details of the current implementation, but -- from what I understand reading this (correct me if I'm wrong) -- it seems that the implementation is half-UCS2 to allow things like the processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to work with surrogate pairs and hence with chars outside the BMP. What are the use cases for processing the lone surrogates? Wouldn't be better to use UTF-16 and disallow them (since they are illegal) and possibly provide some other way to deal with them (if it's really needed)? ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue5127> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com