[Python-ideas] Re: New explicit methods to trim strings

Richard Damon Mon, 23 Mar 2020 19:30:47 -0700

On 3/23/20 3:31 PM, Andrew Barnert via Python-ideas wrote:
> On Mar 23, 2020, at 04:51, Chris Angelico <[email protected]> wrote:
>> Right, which is why for a proposal like this, it's best to start with
>> the simple and straight-forward option of case sensitivity and precise
>> matching. Removing a prefix of "a\u0301" will not remove a leading
>> "\xe1" and vice versa (just as those two strings don't compare equal).
> Agreed, but I think it’s not just “to start with”, but forever, or at least 
> as long as Python strings are sequences of Unicode code points. If 
> "Café".startswith("Cafe\u0301") is false, "Café".stripprefix("Cafe\u0301") 
> had better not strip anything. And as long as "é" in "Cafe\u0301" and 
> any(ch=="é" for ch in "Cafe\u0301" are false, startswith is correct.
>
> By comparison, in Swift, "Café".hasPrefix("Cafe\u{0301}") is true, because 
> "Cafe\u{0301}" is a sequence of four Unicode scalars, the fourth of which is 
> 'é', as opposed to Python where it’s a sequence of five Unicode code points. 
> And of course Swift also has a slew of methods to do things like localized 
> vs. default case-insensitive equality, substring, etc. testing, none of which 
> Python has, or should have, as long as its strings are made of code points 
> rather than scalars (or EGCs or whatever).


I wasn't familiar with the term Scalar as used in Unicode so I looked it
up, and I think you are incorrect here. From the Glossery:

Unicode Scalar Value. Any Unicode code point except high-surrogate and
low-surrogate code points. In other words, the ranges of integers 0 to
D7FF16 and E00016 to 10FFFF16 inclusive.

Thus Scalar ARE just codepoints (but exclude the surrogate pairs). What
you may be thinking of is the Grapheme. It may be that Swift does some
automatic conversion to a canonical form, to make the strings match. In
fact, just because the text displays as Café doesn't help you know how
many code-points their are, as the glyph/graheme é can be expressed as
either a single code point \u00E9 (NFC), or the sequence e \u0301 (NFD),
and Python can express it as either.

A basic rule with unicode strings, is if you are going to be doing these
sorts of comparison, you should make sure you have both strings in the
same normal form.

-- 
Richard Damon
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/BGJ52L4FLIN7XFI57DJRIXKVTI5IZ3VY/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: New explicit methods to trim strings

Reply via email to