On 30 January 2018 at 06:54, Chris Barker <chris.bar...@noaa.gov> wrote: > On Fri, Jan 26, 2018 at 5:27 PM, Steven D'Aprano <st...@pearwood.info> > wrote: >> >> tcl/tk and Javascript only support UCS-2 (16 bit) Unicode strings. >> Dealing with the Supplementary Unicode Planes have the same problems >> that older "narrow" builds of Python sufferred from: single code points >> were counted as len(2) instead of len(1), slicing could be wrong, etc. >> >> There are still many applications which assume Latin-1 data. For >> instance, I use a media player which displays mojibake when passed >> anything outside of Latin-1. >> >> Sometimes it is useful to know in advance when text you pass to another >> application is going to run into problems because of the other >> application's limitations. > > > I'm confused -- isn't the way to do this to encode your text into the > encoding the other application accepts ? > > if you really want to know in advance, it is so hard to run it through a > encode/decode sandwich? > > Wait -- I can't find UCS-2 in the built-in encodings -- am I dense or is it > not there? Shouldn't it be? If only for this reason?
If you're wanting to check whether or not something lies entirely within the BMP, check for: 2*len(text) == len(text.encode("utf-16")) # True iff text is UCS-2 If there's an astral code point in there, then the encoded version will need more than 2 bytes for at least one element, so the result will end up being longer than it would for UCS-2 data. You can also check for pure ASCII in much the same way: len(text) == len(text.encode("utf-8")) # True iff text is 7-bit ASCII So this is partly an optimisation question: - folks want to avoid allocating a bytes object just to throw it away - folks want to avoid running the equivalent of "max(map(ord, text))" - folks know that CPython (at least) tracks this kind of info internally to manage its own storage allocations But it's also a readability question: "is_ascii()" and "is_UCS2()/is_BMP()" just require knowing what 7-bit ASCII and UCS-2 (or the basic multilingual plane) *are*, whereas the current ways of checking for them require knowing how they *behave*. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/