Steve D'Aprano <steve+pyt...@pearwood.info>: > On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: >> Also, [surrogates] don't exist as Unicode code points. Python >> shouldn't allow surrogate characters in strings. > > Not quite. This is where it gets a bit messy and confusing. The bottom > line is: surrogates *are* code points, but they aren't *characters*.
All animals are equal, but some animals are more equal than others. > Strings which contain surrogates are strictly speaking illegal, > although some programming languages (including Python) allow them. Python shouldn't allow them. > The Unicode standard defines surrogates as follows: > [...] > > - Surrogate Code Point. A Unicode code point in the range > U+D800..U+DFFF. Reserved for use by UTF-16, The writer of the standard is playing word games, maybe to offer a fig leaf to Windows, Java et al. > By the letter of the Unicode standard, [Python] should not do this, > but nevertheless it does and it appears to do no real harm and have > some benefit. I'm afraid Python's choice may lead to exploitable security holes in Python programs. >>> py> low = '\uDC37' >> >> That should raise a SyntaxError exception. > > If Python was strictly conforming, that is correct, but it turns out > there are some useful things you can do with strings if you allow > surrogates. Conceptual confusion is a high price to pay for such tricks. Marko -- https://mail.python.org/mailman/listinfo/python-list