On Wed, Jul 19, 2017 at 4:31 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Chris Angelico <ros...@gmail.com>: > >> On Wed, Jul 19, 2017 at 3:01 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: >>> Yes. Also, not every letter can be normalized to a single codepoint so >>> NFC is not a way out. For example, >>> >>> re.match("^[q̈]$", "q̈") >>> >>> returns None regardless of normalization. >> >> In what language or context would you actually want to do this? > > I could have picked more realistic examples: Classic Greek or Hebrew, > for example. > > However, someone might actually use even "q̈" in a real setting. First of > all, it *is* a legal character. Secondly, people sometimes combine > characters in an ad-hoc fashion. Thirdly, remember the case of > Esperanto, which blessed the world with the letters > > ĉ ĝ ĥ ĵ ŝ ŭ > > Esperanto's venerable history finally awarded those characters a > code-point status in Unicode. However, around the year 2000, it was > still commonplace to use all sorts of tricks to type them on the > Internet: > > ch gh hh jj sh u > > ^c ^g ^h ^j ^s ^u > > cx gx hx jx sx ux > > For all we know, someone somewhere might be cooking up a language that > depends on "q̈".
Sure. And if they do, they'll have to contend with the fact that it's going to be represented as multiple code units. What I *think* you're asking for is for square brackets in a regex to count combining characters with their preceding base character. That would make a lot of sense, and would actually be a reasonable feature to request. (Probably as an option, in case there's a backward compatibility issue.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list