On Thu, 12 Jan 2017 14:12:09 +0100 Mark Davis ☕️ <m...@macchiato.com> wrote:
> I agree that comprehension is a goal. I'd imagine using a BNF regex, > like the following. This is simple, since I'm just doing Latin, but > you can see what I mean. > word = base* ; > base = (latinLetter latinMn*) ; > latinLetter = [[:scx=Latn:]&[:L:]] ; > latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ; > > which turns into the single regex expression: > > ([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)* Ouch! That's alarmingly wrong. You've excluded the likes of English 'Caesar' with ZWJ, Welsh 'Llan͏gollen' with CGJ (the word doesn't contain the letter 'ng') and the ISO-sanctioned transliteration of Thai SO SUEA as 's̄'. Fixinɡ it isn't easy. At least, I assume Arabic harakat don't attach to Latin letters in your conception of Latin script text, so replacing 'scx=Common' by 'sc=Inherited' doesn't work well. The problem may be conflicting requirements on the Script_Extensions property. Richard.